New rapid voice cloning tech could spell end for voice authentication

Sat, 10th Mar 2018

FYI, this story is more than a year old

A new innovation has been announced that can clone and reproduce a pretty believable version of your voice using less than four seconds of audio.

Chinese tech giant Baidu unveiled its latest advancements within Deep Voice, a project the company began around a year ago.

In its infancy, Deep Voice required 30 minutes of audio to create a new fake clip, which is a perfect representation of the rapid development within machine learning and artificial intelligence (AI).

The more audio samples the system is provided, the better the quality and believability of the fake voice. At its bare minimum of 3.7 seconds the output does sound a little distorted but certainly not much worse than what a low quality audio file might sound.

Furthermore, the system can change a male voice to female or a British accent to American (and vice versa) showcasing the power of AI to learn and imitate different styles of speaking and bringing the act of text-to-speech to a entirely new level.

Technologies similar to this have emerged in recent years with Adobe revealing its VoCo software in 2016 that could generate speech from text after listening to a voice for 20 minutes. Likewise, Lyrebird (a Montreal-based AI startup) affirms it can do the same with just one minute of audio.

Of course, there is a fair amount of apprehension surrounding the rapid advancements of text-to-speech, especially given the rise of machine learning software that has enabled the easy creation of fake videos – making any media on the web increasingly harder to believe.

AI researchers and theorists have expressed their concerns as essentially if all that's needed is a few seconds of someone's voice and a dataset of their face it wouldn't be hard at all to concoct an entire interview, press conference or news segment.

Aeriandi chief product officer and co-founder Tom Harwood says there is a huge security issue given we are in the era of voice biometric authentication.

"This technology is poised to transform personalisation in human-machine interfaces, but it raises serious concerns about voice biometric security systems,” Harwood says.

“Soon, criminals will need just a few seconds of someone's voice to cheat a voice recognition security system – voice biometric authentication will be rendered useless.

Harwood says organisations need to be thinking now about how they can implement new technologies to ensure they stay ahead of the curve

“Voice fraud detection technology is the primary candidate, as it looks at far more than the user's voice print; it considers hundreds, if not thousands, of other parameters,” Harwood says.

“For example, is the phone number being used legitimate? Has it been used fraudulently before? Increasingly, phone fraud attacks come from overseas. Voice fraud technology has been proven to protect against this as well as domestic threats."