Research & Development

rSpeak R&D Projects

rSpeak Technologies employs a group of talented Text-to-Speech developers to create high-quality rSpeak TTS voices in a variety of languages. We currently offer high-quality unit selection TTS voices in Swedish, Dutch, German, French, US/UK/AU English, and Spanish. Other languages are in development as well.

rSpeak Technologies is committed to Research and Development (R&D) projects to improve the quality of the TTS voices and to develop new TTS voices for different languages and different application areas. We are interested in partnering with academic institutions around the world to advance the technology and push the state-of-the-art. We encourage internships for graduate and undergraduate students at our offices in Uppsala, Sweden and Huis ter Heide, the Netherlands. The projects encompass a variety of relevant topics such as computational linguistics, deep learning, phonetics, prosody, and signal processing. Below you can find a selection of topics with some questions we are trying to answer. We encourage interested parties to contact us by sending an email to Esther Judd-Klabbers:

Statistical Parametric Speech Synthesis using Deep Neural Networks

We are investigating the SPSS method for generating speech. Currently, there are many aspects of SPSS that we are interested in such as:

  • What linguistic features are relevant to train optimal acoustic models? How can we build a compact representation of the trained acoustic model with fast loading and prediction during synthesis?
  • How can we improve the vocoder quality to generate natural-sounding comprehensible speech?
  • Can we develop a hybrid synthesis where we use the predicted acoustic parameters in the current unit selection synthesis to select more appropriate units

Prosody Prediction

In order to build high-quality acoustic models in SPSS and to improve our unit selection synthesis, it is imperative that we improve the prosody prediction. On the one hand we need better NLP classification algorithms to predict from text where to insert phrase breaks and which words to accent or emphasize. On the other hand we need to develop classification algorithms based on acoustic features to mark up our acoustic databases with these phrase breaks and accents. We need to build language-specific models and in some cases even speaker-specific models.

Voice Adaptation

The use of statistical parametric synthesis allows for more precise control over the generated speech output. Some interesting questions are:

  • Description of the amount of emphasis to allow for more expressive speech
    Modeling prosody prediction / realization of different speaking styles (conversational / reading / teaching)
  • Development of child voices
  • Voice conversion

Multilingual Linguistic Processing

rSpeak develops TTS for many languages and we keep expanding our language offering. We strive to create a high-quality linguistic frontend for each new language to predict from input text in that language which sounds or phonemes should be pronounced, where the syllable stress is, which words are accented and where phrase boundaries occur. We also perform normalization to expand abbreviations and process number series, tokenization, and heteronym disambiguation.