[Phd position] Audio-visual expressive speech synthesis in an interaction context.

Audio-visuel expressive speech synthesis in an interaction context

Location: LORIA , Nancy, France

Research team: MULTISPEECH (https://team.inria.fr/multispeech/)

Scientific Context:

Speech can be considered as multimodal. It is carried both by the acoustic signal and by facial and body movements. This multimodal dimension is all the more important to consider when the speech conveys a certain emotion or emphasis on words or part of the discourse. We then can speak of expressivity of speech. Moreover, during a verbal interaction between two people, this expressivity even becomes a preponderant element because it accompanies the discourse and makes the act of communication more effective. Therefore, naturally, the acoustic signal will be accompanied by facial movements (mimics), hand or arm gestures when one of the interlocutors is expressing himself or herself, we speak of co-verbal gestures. But in the same way the person who listens will be brought to follow the discourse with regulating co-verbal gestures, like nods or facial mimics, to mark his/her agreement, his/her incomprehension or to indicate that he/she follows his/her interlocutor. Finally, for the speaker as for the listener, gestures, called extra-communicative, occur like the movements of the eyelids, the head or the body, which we can consider as not directly linked to the verbal interaction but to the person himself (comfort movements).

In recent years, significant advances have been made in the field of voice assistants or conversational avatars, particularly in the conduct of a dialogue between a human and a conversational entity. Voice technologies in speech synthesis and recognition have thus been the object of particular attention in the development of this type of interaction, the voice interface being the input (and output) point. If the recognition of the vocal request and the restitution of the answer (generally vocal) by an intelligible and high quality speech synthesis is necessary and required, they are not sufficient to achieve a realistic social interaction. Thus, in the context of the realization of a visible conversational entity (a 3D avatar or a robot), the acoustic signal will have to be both expressive (to transmit an emotion or an emphasis) and accompanied by the different types of gestures activated during a verbal interaction. Many works exist on the generation of body gestures (bust, arms, head) to accompany a spoken or listened speech [Alexanderson et al. 2020, Yoon et al. 2019, Yunus et al. 2020, Wu et al. 2021]. However, in the context of engagement in a verbal interaction, the generation of expressive speech simultaneously in the acoustic and visual domain (especially the face) is not yet mastered. It is in this context that the thesis is situated: the synthesis of audio-visual expressive speech in an interaction context.


The objective of the thesis is twofold. First, it is important to succeed in developing expressive speech synthesis systems (audio and audio-visual) capable of finely decoupling the elements that contribute to the generation of a signal. Many elements, such as prosody, semantic content and intrinsic characteristics of an emotion interfere in the generation process. Dissociating the contributions of elements such as language, emotion, speaker during neural network learning allows a better control of these elements but also facilitates the transfer of information or the adaptation to different network tasks [Kulkarni et al. 2020]. Indeed, variational auto-encoder VAE [Blei et al. 2017] and conditioning approaches have been used to leverage corpora of limited size [Dahmani et al . 2019], which is notably the case for audio-visual corpora of emotion or interaction. Extending these approaches by exploiting attention mechanisms or Glow-like approaches [Kingma et al. 2016], would improve the consideration of these dimensions. Corpora for expressive audio-visual synthesis already exist in the team.

Moreover, the act of interaction requires to take into account some specific elements or information. Thus, during the utterance of a discourse, (audio)speech and co-verbal gestures can be linked to the lexical content itself or to the emphasis of a particular word (focus). When generating a regulating gesture (with or without an acoustic part), it becomes necessary to take into account this time the linguistic or prosodic elements of the listener. For the thesis, the second objective will be to propose a credible and realistic gestural response. A speech recognition system (already available) could then be used.

Skills and profile:

Master in computer science

Background in deep learning

Good computer skills (preferably in Python)

Experience in speech synthesis and/or 3D image processing is a plus


Alexanderson, S., Székely, É., Henter, G. E., Kucherenko, T., & Beskow, J. (2020). Generating coherent spontaneous speech and gesture from text. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (pp. 1-3).

Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518), 859-877.

Dahmani S., Colotte V., Girard V., & Ouni S. (2019). Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis. In Interspeech – Sep 2019, Graz, Austria

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I.,  & Welling, M. (2016). Improving variational inference with inverse  autoregressive flow. In 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Kulkarni A., Colotte V., & Jouvet D. (2020). Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis. Interspeech, Oct 2020, Shanghai / Virtual, China

Yoon Y., Ko W-R., Jang M., Lee J., Kim J. and Lee G. (2019) “Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots,” 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 2019, pp. 4303-4309

Yunus, F., Clavel, C., & Pelachaud, C. (2020). Sequence-to-Sequence Predictive Model: From Prosody To Communicative Gestures. Workshop sur les Affects, Compagnons artificiels et Interactions.

Wu, B., Liu, C., Ishi, C. T., & Ishiguro, H. (2021). Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN. Electronics10(3), 228.

Additional information:

Supervision and contact:

  • Vincent Colotte (Vincent.Colotte@univ-lorraine.fr)
  • Slim Ouni (Slim.Ouni@univ-lorraine.fr)

Duration : 3 ans

Starting date : autumn 2021

Logo d'Inria