[PhD position] Acoustic-to-articulatory inversion of speech based on dynamic MRI images
[PhD position] Acoustic-to-articulatory inversion of speech based on dynamic MRI images
MultiSpeech – Inria and IADI-INSERM U1254 – Nancy, France
Pierre-André Vuissoz (email@example.com) and Yves Laprie (Yves.Laprie@loria.fr)
Articulatory synthesis mimics the speech production process by first generating the shape of the vocal tract from a sequence of phonemes to be pronounced and then the acoustic signal by solving the aeroacoustic equations [7, 8]. Compared to other approaches of speech synthesis that offer a very high level of quality, the main interest is to control the whole production process, beyond the acoustic signal alone.
Articulatory acoustic inversion is the reverse process that consists in recovering the shape of the vocal tract from the acoustic signal. By associating articulatory acoustic inversion with the rapidly progressing tools of articulatory synthesis, it will be possible to study speech production in depth, to address the question of speech production disorders, to explain the difficulties encountered by the hearing impaired during the acquisition of oral language, and to provide audiovisual articulatory feedback for remediation.
Almost all current work in inversion is based on the use of data from EMA (ElectroMagnetic Articulography) which gives the position of sensors glued onto the tongue and other easily accessible articulators. From the point of view of the inversion techniques themselves, deep learning is widely used because it allows EMA data corpora to be efficiently exploited. Currently, the LSTM (LongShort-Term Memory) approach and its bidirectional variant gives the best results .
Despite their very good geometric accuracy, and because EMA data can only cover the part of the vocal tract closest to the mouth, current approaches do not allow the complete geometry of the vocal tract to be retrieved, while it is well known, for example, that the larynx plays a determining role on the acoustics of the vocal tract. In practice, this considerably limits the interest of inversion techniques since the results cannot be used to reconstruct the speech signal.
The objective of this thesis is to remove this limitation and to recover the full geometry of the vocal tract. To do so, we propose to use dynamic MRI data that we can acquire in Nancy at the IADI laboratory.
Acoustic to articulatory inversion of speech requires solving two problems. The first is the inversion itself. Dynamic MRI provides very good quality two-dimensional images in the medio-sagittal plane at 50Hz and the speech signal acquired with an optical microphone and debugged very efficiently (examples available on https://artspeech.loria.fr/resources ). We plan to use corpora already acquired or in the process of being acquired. These corpora represent a very large volume of data (several hundreds of thousands of images) and it is therefore necessary to preprocess them in order to identify the contour of the articulators involved in speech production (mandible, tongue, lips, velum, larynx, epiglottis). Last year we developed an approach to contour tracking of articulators in MRI images that gives very good results . The automatically tracked contours can therefore be used to perform the inversion. In order to use inversion to analyze the individual behavior of an articulator that may be malfunctioning, each articulator is tracked independently of the others.
In a first step, the objective is to perform the inversion, probably using the LSTM approach, on data from a small number of speakers for which sufficient data exist. This approach will have to be adapted to the nature of the data and to be able to identify the contribution of each articulator.
In itself, achieving inversion for a few subjects will already be a remarkable success since the current results only partially cover the vocal tract (a few points on the front part of the vocal tract). However, it is important to be able to transpose this result to any subject, which raises the question of speaker adaptation.
The most recent speaker adaptation techniques are based on the construction of embeddings, which are widely used in speaker recognition or identification, with the idea of “embedding” an individual in a continuous space in order to adapt the system to a new speaker [4, 5]. Here, both acoustic and anatomical data are available. In the context of this thesis, the primary objective is to construct anatomical plots because we wish to be able to study the gestures of one articulator separately from the others, which requires a fairly precise knowledge of its position and its immediate anatomical environment. This adaptation to the speaker on the basis of a few static MRIs only, answers a double constraint: the rarity and the cost of dynamic MRI on the one hand, and the impossibility to use MRI on the other hand, for example after the installation of a cochlear implant whose compatibility with MRI is not guaranteed.
We have already addressed the issue of anatomical adaptation through the construction of dynamic atlases of consonant articulation , which relies in particular on the use of a fairly classical transformation in medicine . It has the drawback of not identifying the remarkable anatomical landmarks as such and the path that we intend to follow will consist in being inspired by anatomical plunges recently proposed for the processing of radiological images . In spirit, the idea of these dives is quite close to LSTM (Long Short Term Memory) networks since they combine a global and a local dives.
This PhD position is a INSERM- Inria common project:
- Multispeech, INRIA Nancy Grand-Est, https://team.inria.fr/multispeech/fr/
- IADI, INSERM U1254, Nancy http://www.iadi.fr/
This doctoral project will be conducted jointly by the IADI laboratory (INSERM U1254) and Loria (Inria MultiSpeech team), which have been collaborating for several years on vocal tract imaging and the study of speech production. This project will rely in particular on the real-time two-dimensional MRI acquisition system (at 50 images per second) which the IADI laboratory has acquired in the framework of a regional collaboration with Loria. This system is unique in France and allows to image the vocal tract at a frequency of 50 Hz in any direction.
Master in computer science or applied mathematics.
A good knowledge of medical imaging and/or speech processing is a plus.
- Maud Parrot, Juliette Millet, Ewan Dunbar. Independent and Automatic Evaluation of Speaker-Independent Acoustic-to-Articulatory Reconstruction. Interspeech 2020 – 21st Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai / Virtual, China. ⟨hal-03087264⟩
- Ke Yan, Jinzheng Cai, Dakai Jin, Shun Miao, Adam P. Harrison, Dazhou Guo, Youbao Tang, Jing Xiao, Jingjing Lu, Le Lu Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images. arXiv:2012.02383 [cs.CV], 2020
- Rueckert D, Sonoda LI, Hayes C, Hill DL, Leach MO, Hawkes DJ. Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans Med Imaging. 1999 Aug;18(8):712-21. doi: 1109/42.796284.
- David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification.,” pp. 999–1003, Interspeech, 2017,https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0620.PDF
- David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro-bust dnn embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2018, pp.5329–5333.
- Karyna Isaieva, Yves Laprie, Freddy Odille, Ioannis Douros, Jacques Felblinger, et al.. Measurement of Tongue Tip Velocity from Real-Time MRI and Phase-Contrast Cine-MRI in Consonant Production. Journal of Imaging, MDPI, 2020, 6 (5), pp.31. ⟨3390/jimaging6050031⟩. ⟨hal-02923466⟩
- Benjamin Elie, and Yves Laprie, Extension of the single-matrix formulation of the vocal tract: consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink. Speech Comm. 82, pp. 85-96 (2016). https://hal.archives-ouvertes.fr/hal-01199792v3
- Benjamin Elie, and Yves Laprie. Copy-synthesis of phrase-level utterances. EUSIPCO, Budapest 2016 https://hal.archives-ouvertes.fr/hal-01278462
- Ioannis Douros, Ajinkya Kulkarni, Chrysanthi Dourou, Yu Xie, Jacques Felblinger, Karyna Isaieva, Pierre-Andé Vuissoz and Yves Laprie. Using Silence MR Image to Synthesise Dynamic MRI Vocal Tract Data of CV. INTERSPEECH 2020, Oct 2020, Shangaï / Virtual, China. ⟨hal-03090808⟩