Listen
Listen

Still from My Fair Lady
In our exploration of rSpeak Technologies text-to-speech software, we’ve looked at how rSpeak text-to-speech voices are created. In this post, another look under the hood to learn about the workings of the speech engine leads us to the discovery of the fact that one of the most important functions in the text to speech process is the conversion of all words into their respective pronunciations. This task is referred to as letter to sound conversion.

Starting from a word containing a sequence of letters, the Text to Speech engine needs to establish what sounds are needed to pronounce the word. We know that the sounds of a language are either vowels (such as ‘aa’, ‘ey’, ‘oo’) or consonants (such as ‘p’, ‘m’, ‘k’) and a word is usually made up of one or more syllables with one vowel per syllable. In most languages, a word has one primary stressed syllable. This is the syllable with longer sound durations, associated with a pitch movement when the word is emphasized.

The engine’s first task is to look up in a pronunciation dictionary if the word already has a known pronunciation associated with it. The pronunciation dictionary contains 100,000+ entries per language. The development team at rSpeak Technologies is dedicated to achieving the highest quality standards in the industry. However, it’s no mystery that there can sometimes be a glitch here and there in the dictionary – and these need correcting.

For some words called called heteronyms, there can be two or more distinct pronunciations associated with the same spelling. The rSpeak TTS system has rules that look at the other words that surround it, in order to determine the most likely pronunciation. For instance, in the sentence ‘He likes to read’ the word ‘read‘ should be pronounced with an ‘ee‘ sound, but in ‘He read the book’ it should be pronounced with a short ‘eh‘ sound.

The second step is to figure out pronunciations for words that are not in the pronunciation dictionary. In certain cases, the TTS system can find one or more base words that are in the dictionary by taking off a suffix or prefix from the word, deriving the pronunciation for the entire word via a set of rules. In other cases, a state-of-the-art machine learning letter-to-sound module is trained to learn what the most likely letter to sound conversions are for a given language. The existing pronunciation dictionary for that language is used to find probabilities of all letter-to-sound conversions.

Nevertheless, there may always be a possibility of any TTS software pronouncing certain words in a way that is not quite right. It can make mistakes in picking the correct homograph and there can also be a transcription error in the pronunciation dictionary. A TTS engine can wrongly derive a pronunciation, or it can predict a wrong transcription in the letter-to-sound module.

rSpeak Technologies is unique in the market, because we enable rSpeak TTS customers to effortlessly share requests for transcription corrections. Our dedicated team of world-class linguists is able to quickly implement pronunciation correction rules for a specific customer or for our TTS engine in general. The team is strongly committed to keeping rSpeak TTS frequently updated, taking corrections into account, thus constantly increasing the quality and accuracy of best in class rSpeak text-to-speech voices.

Try rSpeak TTS for yourself and test our engine’s pronunciation!