DNN MapIn a recent article, the Economist examined how “voice technology is transforming computing.” rSpeak TTS voices are blazing a trail with their expressive high quality and market-leading accuracy. But the rSpeak TTS team are also championing ground-breaking technologies that are radically changing the world of speech.

Today’s generation of text-to-speech (TTS) voices, such as the ones offered by rSpeak Technologies, use a synthesis technique called unit selection synthesis. Read about it here. Although the resulting speech sounds very natural, it involves recording many hours of speech with a professional speaker, which is costly. And in order to avoid “glitches” at points where speech units are pasted together, the speech is recorded in a very neutral speaking style with little variation in pitch.

Since the 1970s, researchers have been developing rule-based synthesizers to translate the linguistic properties of an input text to acoustic features of the speech (pitch, spectrum, duration, voicing). They then used a vocoder to translate those acoustic features into speech. The rules were very complex but not complex enough to describe human speech accurately and as such the result was very robotic sounding with stilted intonation. In the early 2000s, the rules were replaced by trained Hidden Markov Models (HMMs) which had already been used successfully in speech recognition. This made it possible to generate speech using a limited acoustic database for training. The HMMs can learn to group linguistic properties together that produce similar acoustic features from a training set of linguistic properties with matching acoustic features. The trained HMM model thus replaces the hand-crafted rules. The output speech sounded more natural than the rule-based vocoded speech, although there were still some artifacts such as buzziness.

We’re now seeing an exciting revolution in TTS, which is set to change the future of speech. In the last few years, computers have become more and more powerful and the concept of deep learning has become increasingly popular. Instead of by HMMs, the mapping of linguistic properties to acoustic features is handled by Deep Neural Networks (DNNs). An iterative learning process tries to minimize objectively measurable differences between the predicted acoustic features and the observed ones in the training set.

Several companies are working on this technology. A startup called Lyrebird claims to be able to train a new DNN TTS voice using only one minute of speech. They have quite an impressive demo featuring Barack Obama, Donald Trump, and Hillary Clinton. However, the trained ear can hear many artifacts in these examples.They are due to different factors, including the quality of the recorded speech, the accuracy of the annotation of the database (where each sound is located), and the accuracy of the acoustic feature extraction.

Accurately extracting acoustic features is easier for some voices than for others. For instance, female voices can be more challenging due to the inherently higher pitch. But with a slightly larger acoustic database, a robust linguistic preprocessing module to determine the correct linguistic properties of the sentences in the acoustic database, and high-quality annotations and acoustic feature extraction, we can get much closer to natural sounding speech.

Baidu’s Deep Voice also looks promising, as does Google’s Wavenet. However, like the other players, they don’t have an end-to-end system yet. The difference between Wavenet and Deep Voice on the one hand and rSpeak and Lyrebird’s systems, on the other hand, is that the first two use acoustic features from previous frames as input features along with the linguistic features. While that produces even more natural-sounding speech, it increases the complexity of the system even more. For Wavenet, it can take over a minute at the moment to generate one sentence.

The advantage of the new DNN TTS method is that the acoustic database needed is much smaller than for a unit selection voice. Indeed, the new, smart TTS voices can be created based on just a few minutes’ recordings. The DNN TTS method is also more flexible. We can record more expressive speech, and then control how much emphasis we want to put on a word and which words we want to emphasize. We can direct the pitch up or down to signal a question or a declarative sentence. The result is even more human-like.

There is still work to be done before DNN TTS matches the quality of unit selection synthesis. At rSpeak Technologies, we are working on finding innovative solutions to overcome the hurdles to making DNN TTS a viable commercial product.

Compare the current rSpeak TTS voices and their counterparts created leveraging a few minutes of speech and DNN technologies:

Listen to rSpeak TTS Sophie

Listen to rSpeak DNN voice Sophie

Listen to rSpeak TTS Mark

Listen to rSpeak DNN voice Mark

While we prepare DNN TTS voices for you, rSpeak text-to-speech voices are ready to power your online business today – just as they do for countless customers around the world. Click here to get in touch with the rSpeak Technologies team.