WaveNet
WaveNet is a deep neural network for generating raw audio created by researchers at London-based artificial intelligence firm DeepMind. The technique, outlined in a paper in September 2016,[1] is able to generate realistic-sounding human-like voices by sampling real human speech and directly modelling waveforms. Tests with US-based English and Mandarin, reportedly showed that the system outperforms the best existing Text-to-Speech systems from Google, although it is still less convincing than actual human speech.[2] WaveNet’s ability to generate raw waveforms means that it can model any kind of audio, including music.[3] Canada-based start-up Lyrebird offers similar technology, but is based on a different deep learning model.[4]
History
Generating speech from text is becoming an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft’s Cortana, Amazon Alexa, and the Google Assistant.[5]
Most of today’s systems use a variation of a technique that involves stitching sounds fragments together to form recognisable sounds and words.[6] The most common of these is called concatenative TTS.[7] It consists of large library of speech fragments, recorded from a single speaker that are then combined - or concatenated - to produce complete words and sounds . The technique can often sound unnatural, with an unconvincing cadence and tone.[8] The reliance on a recorded library also makes it difficult to modify or change the voice.[9]
Another technique, known as parametric TTS,[10] uses mathematical models to recreate known sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.
WaveNet
WaveNet is a type of feed-forward artificial neural network known as a deep convolutional neural network. These consist of layers of interconnected nodes similar to the brain’s neurons. The CNN takes a raw signal as an input and synthesises an output one sample at a time.[11]
In the 2016 paper, the network was fed real waveforms of English and Mandarin speech. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms from scratch at 16,000 samples per second. These waveforms include realistic breaths and lip smacks - but do not conform to any language.[12]
WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained on with German, it produces German speech.[13] This ability to clone voices has raised ethical concerns about WaveNets ability to mimic anyone’s voice.
The capability also means that if the WaveNet is fed other inputs - such as music - its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce classical sounding music.[14]
Applications
At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications.[15]
References
- ↑ Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". 1609. Bibcode:2016arXiv160903499V. arXiv:1609.03499 .
- ↑ Kahn, Jeremy (2016-09-09). "Google’s DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
- ↑ Meyer, David (2016-09-09). "Google's DeepMind Claims Massive Progress in Synthesized Speech". Fortune. Retrieved 2017-07-06.
- ↑ Gholipour, Bahar (2017-05-02). "New AI Tech Can Mimic Any Voice". Scientific American. Retrieved 2017-07-06.
- ↑ Kahn, Jeremy (2016-09-09). "Google’s DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
- ↑ Condliffe, Jamie (2016-09-09). "When this computer talks, you may actually want to listen". MIT Technology Review. Retrieved 2017-07-06.
- ↑ Hunt, A. J.; Black, A. W. (May 1996). "Unit selection in a concatenative speech synthesis system using a large speech database" (PDF). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. 1: 373–376. ISBN 0-7803-3192-3. doi:10.1109/ICASSP.1996.541110.
- ↑ Coldewey, Devin (2016-09-09). "Google’s WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
- ↑ van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
- ↑ Zen, Heiga; Tokuda, Keiichi; Black, Alan W. (2009). "Statistical parametric speech synthesis". Speech Communication. 51 (11): 1039–1064.
- ↑ Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". 1609. Bibcode:2016arXiv160903499V. arXiv:1609.03499 .
- ↑ Gershgorn, Dave (2016-09-09). "Are you sure you're talking to a human? Robots are starting to sounding eerily lifelike". Quartz. Retrieved 2017-07-06.
- ↑ Coldewey, Devin (2016-09-09). "Google’s WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
- ↑ van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
- ↑ "Adobe Voco 'Photoshop-for-voice' causes concern". BBC News. 2016-11-07. Retrieved 2017-07-06.