PhD Position on “Expressive speech synthesis based on deep learning”

Expressive speech synthesis based on deep learning

Location: INRIA Nancy Grand Est research center — LORIA Laboratory, Nancy, France

Project-team: MULTISPEECH (

Scientific Context

Over the last decades, text-to-speech synthesis (TTS) has reached good quality and intelligibility, and is now commonly used in information delivery services, as for example in call center automation, in navigation systems, and in voice assistants. In the past, the main goal when developing TTS systems was to achieve high intelligibility. The speech style was then typically a “reading style”, which resulted from the style of the speech data used to develop TTS systems (reading of a large set of sentences). Although a reading style is acceptable for occasional interactions, TTS systems should benefit from more variability and expressivity in the generated synthetic speech, for example, for lengthy interactions between machines and humans, or for entertainment applications. This is the goal of recent or emerging research on expressive speech synthesis. Contrary to neutral speech, which is typically read speech without conveying any particular emotion, expressive speech can be defined as speech carrying an emotion, or spoken as in spontaneous speech, or also as speech with emphasis set on some words.

Missions: (objectives, approach, etc.)

Deep learning approaches leads to good speech synthesis quality, however the main scientific and technological barrier remains the necessity of having a speech corpora corresponding to the speaker and the target style conditions, here expressive speech. This thesis aims at investigating approaches to overcome this barrier. More precisely, the objective is to propose and investigate approaches allowing expressive speech synthesis for a given speaker voice, using both the neutral speech data of that speaker, or the corresponding neutral speech model, and expressive speech data from other speakers. This will avoid lengthy and costly recording of specific ad hoc expressive speech corpora (e.g., emotional speech data from the target voice speaker).

Let recall that three main steps are involved in parametric speech synthesis: the generation of sequences of basic units (phonemes, pauses, etc.) from the source text; the generation of prosody parameters (durations of sounds, pitch values, etc.); and finally the generation of acoustic parameters, which leads to the synthetic speech signal. All the levels are involved in expressive speech synthesis: alteration of pronunciations and presence of pauses, modification of prosody correlates and modification of the spectral characteristics.

The thesis will essentially focus on the two last points, i.e., a correct prediction of prosody and spectral characteristics to produced expressive speech through deep learning-based approaches. Some aspects to be investigated include the combined used of only the neutral speech data of the target voice speaker and expressive speech of other speakers in the training process, or in an adaptation process, as well as data augmentation processes.

The baseline experiments will rely on neutral speech corpora and expressive speech corpora previously collected for speech synthesis in the Multispeech team. Further experiments will consider using other expressive speech data, possibly extracted from audiobooks.

Skills and profile:

  • Master in automatic language processing or in computer science
  • Background in statistics, and in deep learning
  • Experience with deep learning tools
  • Good computer skills (preferably in Python)
  • Experience in speech synthesis is a plus

Bibliography: (if any)

  • M. Schröder. Emotional speech synthesis: A review. Proc. EUROSPEECH, 2001.
  • M. Schröder. Expressive speech synthesis: Past, present, and possible futures. Affective information processing, pp. 111–126, 2009.
  • A. Iida, N. Campbell, F. Higuchi and M. Yasumura. A corpus-based speech synthesis system with emotion. Speech Communication, vol. 40, n. 1, pp. 161–187, 2003.
  • J.F. Pitrelli, R. Bakis, E.M. Eide, R. Fernandez, W. Hamza and M.A. Picheny. The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, n. 4, pp. 1099–1108, 2006.
  • D. Jiang,W. Zhang, L. Shen and L. Cai. Prosody analysis and modeling for emotional speech synthesis. Proc. ICASSP, 2005.
  • Z. Wu, P. Swietojanski, C. Veaux, S. Renals, S. King. A study of speaker adaptation for DNN-based speech synthesis. Proc. INTERSPEECH, pp. 879–883, 2015.

Additional information:

Deadline to apply: May 1st, 2018

The candidates are required to provide the following documents in a single pdf or ZIP file:

  • CV
  • A cover/motivation letter describing their interest in the topic
  • Degree certificates and transcripts for Bachelor and Master (or the last 5 years)
  • Master thesis (or equivalent) if it is already completed, or a description of the work in progress, otherwise
  • The publications (or web links) of the candidate, if any (it is not expected that they have any)
  • In addition, one recommendation letter from the person who supervises(d) the Master thesis (or research project or internship) should be sent directly by his/her author to the prospective PhD advisor.

Logo du CNRS

Logo d'Inria

Logo Université de Lorraine