[Thèse / PhD 2020] Deep supervision of the vocal tract shape for articulatory synthesis of speech

[Thèse / PhD 2020] Deep supervision of the vocal tract shape for articulatory synthesis of speech
Supervisor: Yves Laprie (Yves.Laprie@loria.fr)
(Team MultiSpeech)
Contact: Yves.Laprie@loria.fr

Application – Application

Deadline 20 May 2020

Contact Yves.Laprie@loria.fr

Context

Articulatory synthesis mimics the process of speech production first by generating the vocal tract shape from a sequence of phonemes be pronounced, and then the acoustic signal by solving aero-acoustic equations. Compared to other approaches of synthesis that offer a high level of quality, the strength of articulatory synthesis is above all to control the entire process of speech production, and not only a superficial control at the level of the acoustical signal. It is thus possible to investigate the production of speech in depth to explain the origin of expressions in speech, address the issue of speech production disorders and find out the difficulties faced by learners of a foreign language among other applications.

The challenge is to generate the articulator positions without constructing an explicit geometric model, but by learning their position from a vast corpus of MRI films of continuous speech. Speech articulators (jaw, tongue, lips, larynx, soft palate and epiglottis) are used to modify the shape of the vocal tract, and therefore the acoustic properties including the resonances of the vocal tract. The expected result is thus to control the temporal evolution of the vocal tract in a realistic and effective way, so as to enable the use of acoustic simulations and thus achieve articulatory synthesis. In addition to the shape of the vocal tract, these simulations require a signal source, i.e. the vibration of the vocal chords or turbulence noise inside the vocal tract. This information will either be measured on natural speech or generated automatically.

The interest of articulatory synthesis is to explain the articulatory origin of phonetic contrasts, enable changing the movement of articulators (or even block one of them), modify the control parameters of vocal folds, enable a realistic adaptation to a new speaker by modifying the size and shape of the articulators, and finally give access to physical quantities (e.g. pressure) in the vocal tract for example) without requiring the introduction of sensors in the vocal tract.

The generation of the geometric shape of the vocal tract at each time point of the synthesis is a critical step since this is the input of acoustic simulations. Most often the determination of the vocal tract shape relies on the use of an articulatory model [1,2] that gives the shape of the tract with a small number of parameters. Each of the parameters corresponds to a deformation mode of the articulator considered; for instance the tongue which is the most deformable articulator requires at least 6 six parameters. An articulatory model is generally constructed from about 100 static MRI images of the vocal tract with two weaknesses. On the one hand, there is no guarantee that this model has the capacity to produce all the vocal tract shapes corresponding to natural speech and on the other hand this raises the question of the anticipation of the vocal tract shape according to the phonemes to be articulated, i.e. coarticulation.

Description of work

The objective of the PhD is to exploit machine learning techniques, particularly deep learning, to perform supervision of the vocal tract shape, i.e. calculating the geometrical position and shape of each articulator, for the sequence of phonemes to be articulated.

This transformation will be learned from a corpus of MRI films of the vocal tract. Recently IADI laboratory (INSERM U1254) at Nancy hospital has been equipped with a two-dimensional real-time MRI acquisition system (at 50 images per second) for the vocal tract as part of a regional collaboration with Loria, and a database of several hours of speech for several speakers is now available.

The quality of these images of the mid-sagittal shape of the vocal tract is very good and recently we carried out preliminary tracking experiments for the tongue contour with a good precision. Based on this preliminary work inspired from [3,4,5,6] we want to track the contours of all the speech articulators, i.e. mandible, tongue, lips, larynx, epiglottis and velum. Unlike other works we want to track each of the articulators independently of the others because speech involves complex compensatory and coordinating gestures that would be hidden if the vocal tract is processed in one piece[7].

The first part of the work will consist of tracking the contour of all the articulators to get a complete geometric description of the vocal tract. One aspect to address is to guarantee the geometrical and temporal consistency of all the contours in order to prevent tracking from providing non realistic shapes and gestures. The work will consist of improving the training strategy to increase robustness against intra and speaker variability and situations where several articulators are in contact.

The second and most important part of the work will be devoted to controlling the shape of the vocal tract. The idea is to develop a deep learning approach to determine the position of the articulators according to the phonemes to be articulated. The constraint is to be able to identify the role of each articulator in sufficient detail so as to be able to control its impact on the overall shape of the vocal tract, and to study coordination and compensation strategies between the articulators. Another question to be studied concerns the existence of a repertoire of articulatory gestures identified as such.

The second input of acoustic simulations is the glottis opening and vocal fold activity[8]. These two data streams will be fed into digital acoustic simulations [9] to verify the quality of the speech produced, and to study the articulatory factors of expressive speech.

References

[1] B. J. Kröger, V. Graf-Borttscheller, A. Lowit. (2008). Two- and Three-Dimensional Visual Articulatory Models for Pronunciation Training and for Treatment of Speech Disorders, Proc. Of Interspeech 2008, Brisbane, Australia

[2] Y. L aprie, J. Busset. (2011). Construction and evaluation of an articulatory model of the vocal tract, In : 19th European Signal Processing Conference – EUSIPCO-2011. – Barcelona, Spain

[3] A. Jaumard-Hakoun, K. Xu, P. Roussel, G. Dreyfus, M. Stone and B. Denby. Tongue contour extraction from ultrasound images based on deep neulral network. Proc. of International Congress of Phonetic Sciences, Glasgow, 2015.

[4] I. Fasel and J. Berry. Deep Belied Networks for Real-Time Extraction of Tongue Contours from Ultrasound During Speech. Proc. of 20th ICPR, Istanbul, 2010.

[5] O. Ronneberger, P. Fischer &T. Brox. U-Net: Convolutional Networks for BiomedicalImage Segmentation (2015). Proc. Of Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234–241, 2015

[6] G. Litjens, T. Kooi et al. A survey on deep learning in medical image analysis. Medical Image Analysis, 42 :60-88, 2017.

[7] Silva, S., & Teixeira, A. (2016). Quantitative systematic analysis of vocal tract data. Computer Speech & Language, 36, 307–329. doi:10.1016/j.csl.2015.05.004

[8] Y. Laprie, B. Elie, A. Amelot and S. Maeda (2019). Glottal Opening Measurements in VCV and VCCV Sequences, Proceedings of 23rd International Congress on Acoustics, Aachen, Germany, pp. 1810-1815

[9] B. Elie, Y Laprie. (2016). Extension of the single-matrix formulation of the vocal tract : consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink. – Speech Communication 82, pp. 85–96.

Required skills

deep learning, computer science, speech processing, applied mathematics

Logo d'Inria