[PhD position] Acoustic to Articulatory Inversion by using dynamic MRI images

[PhD position] Acoustic to Articulatory Inversion by using dynamic MRI images

Supervisors :

Scientific challenge

Articulatory synthesis mimics the speech production process by first generating the shape of the vocal tract from the sequence of phonemes to be pronounced [6], then the acoustic signal by solving the aeroacoustic equations [1, 2]. Compared to other approaches to speech synthesis which offer a very high level of quality, the main interest is to control the whole production process, beyond the acoustic signal alone.

The objective of this PhD is to succeed in the inverse transformation, called acoustic to articulatory inversion, in order to recover the geometric shape of the vocal tract from the acoustic signal. A simple voice recording will allow the dynamics of the different articulators to be followed during the production of the sentence.

Beyond its interest in terms of scientific challenge, articulatory acoustic inversion has many potential applications. Alone, it can be used as a diagnostic tool to evaluate articulatory gestures in an educational or medical context.

Almost all current inversion works are based on the use of data from EMA (ElectroMagnetic Articulography) which gives the position of a few sensors glued on the tongue and other easily accessible articulators. From the point of view of the inversion techniques themselves, deep learning is widely used because it allows EMA data corpus to be efficiently exploited. Currently, the LSTM (LongShort-Term Memory) approach and its bidirectional variant gives the best results [3].

Despite their very good geometric accuracy, and because EMA data can only cover the part of the vocal tract closest to the mouth, the current approaches do not allow the complete geometry of the vocal tract to be retrieved, while it is known, for example, that the larynx plays a determining role on the acoustics of the vocal tract. In practice, this considerably limits the interest of existing inversion techniques since the results cannot be used to reconstruct the speech signal.

The objective of this project is to remove this limitation and the originality is to recover the complete geometry of the vocal tract using dynamic MRI data that we can acquire in Nancy at the IADI laboratory [5].

Description of work

The first objective is the inversion of the acoustic signal to recover the temporal evolution of the medio-sagittal slice. Indeed, dynamic MRI provides two-dimensional images in the medio-sagittal plane at 50Hz of very good quality and the speech signal acquired with an optical microphone can be very efficiently deconstructed with the algorithms developed in the MultiSpeech team (examples available on https://artspeech.loria.fr/resources/). We plan to use corpora already acquired or in the process of being acquired. These corpora represent a very large volume of data (several hundreds of thousands of images) and it is therefore necessary to preprocess them in order to identify the contours of the articulators involved in speech production (mandible, tongue, lips, velum, larynx, epiglottis). Last year we developed an approach for tracking the contours of articulators in MRI images that gives very good results [4]. Each articulator is tracked independently of the others in order to keep the possibility to analyze the individual behavior of an articulator, e.g. in case one of them fails. The automatically tracked contours can therefore be used to train the inversion.

Initially, the goal is to perform the inversion using the LSTM approach on data from a small number of speakers for which sufficient data exists. This approach will have to be adapted to the nature of the data and to be able to identify the contribution of each articulator.

In itself, successful inversion to recover the shape of the vocal tract in the medio-sagittal plane will be a remarkable success since the current results only cover a very small part of the vocal tract (a few points on the front part of the vocal tract). However, it is important to be able to transpose this result to any subject, which raises the question of speaker adaptation, which is the second objective.


The PhD student will be able to use the databases already acquired in the framework of the ANR ArtSpeech (about 10 minutes of speech for 10 speakers) and the much larger databases being acquired in the framework of the ANR Full3DTalkingHead (about 2 hours of speech for 2 speakers). The PhD student will of course also be able to acquire complementary data using the MRI system available in the IADI laboratory (40% of the MRI time reserved for research).

The scientific environment of the two teams is very complementary with a very strong competence in all fields of MRI and anatomy in the IADI laboratory and in deep learning in the MultiSpeech team of Loria. The two teams are geographically close (1.5 km). The PhD student will have an office in both laboratories and the technical means (computer, access to the computing clusters) allowing him to work in very good conditions. A follow-up meeting will take place every week and each of the two teams will organize a weekly scientific seminar. The PhD student will also have the opportunity to participate in one or two summer schools and conferences in MRI and automatic speech processing. He/she will also be assisted in writing conference or journal papers.


Master’s degree in computer science or applied mathematics.

Your application including all attachments must be in English and submitted electronically at


log into Inria’s recruitment system (link above) in order to apply to this position.

Please include

  1. Motivated letter of application (max. one page)
  2. Your motivation for applying for the specific PhD project
  3. Curriculum vitae including information about your education, experience, language skills and other skills relevant for the position
  4. Publication list (if possible)
  5. Reference letters (if available)

The deadline for applications is Friday 13 May 2022, 23:59 GMT +2.


Logo d'Inria