[PhD position] Acoustic to Articulatory Inversion by using dynamic MRI images

Scientific challenge

Articulatory synthesis mimics the speech production process by first generating the shape of the vocal tract from the sequence of phonemes to be pronounced, then the acoustic signal by solving the aeroacoustic equations [1, 2]. Compared to other approaches to speech synthesis which offer a very high level of quality, the main interest is to control the whole production process, beyond the acoustic signal alone.

The objective of this PhD is to succeed in the inverse transformation, called acoustic to articulatory inversion, in order to recover the geometric shape of the vocal tract from the acoustic signal. A simple voice recording will allow the dynamics of the different articulators to be followed during the production of the sentence.

Beyond its interest in terms of scientific challenge, articulatory acoustic inversion has many potential applications. Alone, it can be used as a diagnostic tool to evaluate articulatory gestures in an educational or medical context.

Associated with articulatory synthesis tools, it can also be used to provide audiovisual feedback in situations of remediation (e.g. to help a hearing-impaired person to produce the correct articulation of a phoneme), learning (e.g. to master the achievement of phonetic contrasts in a foreign language) or to improve singing techniques in a professional context.

Interdisciplinary character and teams involved

Our research in articulatory synthesis has made spectacular progress in recent years in the prediction of the instantaneous geometry of the vocal tract [12], the acquisition and processing of dynamic MRI (Magnetic Resonance Imaging) at a framerate of 50 images per second [11], the automatic tracking of the articulatory contour in these images [10], and the numerical simulations of the aeroacoustics of the vocal tract [1].

These advances would not have been possible without a strong interdisciplinary approach at the national level, but above all at the local level for the medical imaging (dynamic MRI acquisition) and anatomy aspects. The IADI laboratory (47 people including all the professors in radiology at the Nancy CHRU and 14 PhD students at the beginning of 2021) brings its skills in the development of MRI protocols in order to acquire images of excellent quality and in anatomy in order to identify anatomical structures in a relevant manner. Loria (MultiSpeech team, 50 people including 25 PhD students in 2021) brings its skills in machine learning, denoising, phonetics and vocal tract modeling. Finally, the two teams have jointly developed an approach for tracking articulator contours in dynamic images that is one of the most complete and efficient in the world and that makes it possible to process databases of more than 100,000 images and make them available to the scientific community.

This collaboration has essentially taken the form of a CPER Digital Humanities project (purchase of the real-time acquisition system) and two ANR projects (ArtSpeech from 2016 to 2019 and Full3DTalkingHead from 2020 to 2024), both of which focus on articulatory synthesis, the first specifically targeting acoustics and the second aiming to produce natural and well-coordinated articulatory gestures between the vocal tract and the face. This cooperation, initiated by Pierre-André Vuissoz and Yves Laprie and currently directly involving Karyna Isaieva (postdoc), Justine Leclere (PH dentist and PhD student) and Vinicius Ribeiro (PhD student), has resulted in 22 communications since 2016, including 3 international journal articles in 2020-2021. Our consortium is one of the most advanced teams on the use of dynamic MRI data in the world, and probably the most advanced in the prediction of the complete shape of the vocal tract.

State of the art and innovative character

Almost all current inversion works are based on the use of data from EMA (ElectroMagnetic Articulography) which gives the position of a few sensors glued on the tongue and other easily accessible articulators. From the point of view of the inversion techniques themselves, deep learning is widely used because it allows EMA data corpus to be efficiently exploited. Currently, the LSTM (LongShort-Term Memory) approach and its bidirectional variant gives the best results [3].

Despite their very good geometric accuracy, and because EMA data can only cover the part of the vocal tract closest to the mouth, the current approaches do not allow the complete geometry of the vocal tract to be retrieved, while it is known, for example, that the larynx plays a determining role on the acoustics of the vocal tract. In practice, this considerably limits the interest of existing inversion techniques since the results cannot be used to reconstruct the speech signal.

The objective of this project is to remove this limitation and the originality is to recover the complete geometry of the vocal tract using dynamic MRI data that we can acquire in Nancy at the IADI laboratory.

This approach will open a really operational bridge between articulatory gestures and acoustics in both directions (physical numerical simulations for the direct transformation and inversion). Another innovative aspect of the inversion that we propose is the identification of the role of each articulator in order to take into account a possible perturbation concerning a specific articulator.

Description of work

The first objective is the inversion of the acoustic signal to recover the temporal evolution of the medio-sagittal slice. Indeed, dynamic MRI provides two-dimensional images in the medio-sagittal plane at 50Hz of very good quality and the speech signal acquired with an optical microphone can be very efficiently deconstructed with the algorithms developed in the MultiSpeech team (examples available on https://artspeech.loria.fr/resources/). We plan to use corpora already acquired or in the process of being acquired. These corpora represent a very large volume of data (several hundreds of thousands of images) and it is therefore necessary to preprocess them in order to identify the contours of the articulators involved in speech production (mandible, tongue, lips, velum, larynx, epiglottis). Last year we developed an approach for tracking the contours of articulators in MRI images that gives very good results [10]. Each articulator is tracked independently of the others in order to keep the possibility to analyze the individual behavior of an articulator, e.g. in case one of them fails. The automatically tracked contours can therefore be used to train the inversion.

Initially, the goal is to perform the inversion using the LSTM approach on data from a small number of speakers for which sufficient data exists. This approach will have to be adapted to the nature of the data and to be able to identify the contribution of each articulator.

In itself, successful inversion to recover the shape of the vocal tract in the medio-sagittal plane will be a remarkable success since the current results only cover a very small part of the vocal tract (a few points on the front part of the vocal tract). However, it is important to be able to transpose this result to any subject, which raises the question of speaker adaptation, which is the second objective.

The most recent speaker adaptation techniques are based on the construction of embeddings, which are widely used in speaker recognition or identification, with the idea of “embedding” an individual in a continuous space in order to adapt the system to a new speaker [6, 7]. Here, both acoustic and anatomical data are available. In the context of this thesis, the objective is to construct anatomical plots because we wish to be able to study each articulator independently of the others, which requires a fairly precise knowledge of its position and its immediate anatomical environment. This adaptation to the speaker on the basis of a few static MRIs only, answers a double constraint: the rarity and the cost of dynamic MRI on the one hand, and the impossibility of using MRI on the other hand, for example after the insertion of a cochlear implant whose compatibility with MRI is not guaranteed.

We have already addressed the issue of anatomical adaptation through the construction of dynamic atlases of consonant articulation [8], which is based on the use of a fairly classical transformation in medical image processing [5]. It has the drawback of not identifying the remarkable anatomical landmarks as such, and the path we intend to follow will be inspired by anatomical plunging recently proposed for the processing of radiological images [4]. In spirit, the idea of these plungers is quite close to LSTM (Long Short Term Memory) networks since they combine a global plunge and a local plunge.

Environment

The PhD student will be able to use the databases already acquired in the framework of the ANR ArtSpeech (about 10 minutes of speech for 10 speakers) and the much larger databases being acquired in the framework of the ANR Full3DTalkingHead (about 2 hours of speech for 2 speakers). The PhD student will of course also be able to acquire complementary data using the MRI system available in the IADI laboratory (40% of the MRI time reserved for research).

The scientific environment of the two teams is very complementary with a very strong competence in all fields of MRI and anatomy in the IADI laboratory and in deep learning in the MultiSpeech team of Loria. The two teams are geographically close (1.5 km). The PhD student will have an office in both laboratories and the technical means (computer, access to the computing clusters) allowing him to work in very good conditions. A follow-up meeting will take place every week and each of the two teams will organize a weekly scientific seminar. The PhD student will also have the opportunity to participate in one or two summer schools and conferences in MRI and automatic speech processing. He/she will also be assisted in writing conference or journal papers.

References

  1. Benjamin Elie, and Yves Laprie, Extension of the single-matrix formulation of the vocal tract: consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink. Speech Comm. 82, pp. 85-96 (2016). https://hal.archives-ouvertes.fr/hal-01199792v3

  2. Benjamin Elie, and Yves Laprie. Copy-synthesis of phrase-level utterances. EUSIPCO, Budapest 2016 https://hal.archives-ouvertes.fr/hal-01278462

  3. Maud Parrot, Juliette Millet, Ewan Dunbar. Independent and Automatic Evaluation of Speaker-Independent Acoustic-to-Articulatory Reconstruction. Interspeech 2020 – 21st Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai / Virtual, China. hal-03087264

  4. Ke Yan, Jinzheng Cai, Dakai Jin et al. Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images. arXiv:2012.02383 [cs.CV], 2020

  5. Rueckert D, Sonoda LI, Hayes C, Hill DL, Leach MO, Hawkes DJ. Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans Med Imaging. 1999 Aug;18(8):712-21. doi: 10.1109/42.796284.

  6. David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification.,” pp. 999–1003, Interspeech, 2017, https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0620.PDF

  7. David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro-bust dnn embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2018, pp.5329–5333. https://ieeexplore.ieee.org/document/8461375

  8. Ioannis Douros, Ajinkya Kulkarni, Chrysanthi Dourou, Yu Xie, Jacques Felblinger, Karyna Isaieva, Pierre-Andé Vuissoz and Yves Laprie. Using Silence MR Image to Synthesise Dynamic MRI Vocal Tract Data of CV. INTERSPEECH 2020, Oct 2020, Shangaï / Virtual, China. hal-03090808

  9. Slim Ouni. Tongue Gestures Awareness and Pronunciation Training. 12th Annual Conference of the International Speech Communication Association – Interspeech 2011, Aug 2011, Florence, Italy. inria-00602418
  10. Karyna Isaieva, Yves Laprie, Nicolas Turpault, Alexis Houssard, Jacques Felblinger & Pierre-André Vuissoz (2020), Automatic Tongue Delineation from MRI Images with a Convolutional Neural Network Approach, Applied Artificial Intelligence, 34:14, 1115-1123, https://hal.archives-ouvertes.fr/hal-02962336
  11. Karyna Isaieva, Y. Laprie, J. Leclère, Ioannis K. Douros, Jacques Felblinger & Pierre-André Vuissoz. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data 8, 258 (2021). https://doi.org/10.1038/s41597-021-01041-3
  12. Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie. Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated. iNTERSPEECH 2021, Aug 2021, Brno, Czech Republic. hal-03360113

Logo d'Inria