[PhD Position] Coordination between articulatory gestures and vocal fold control for articulatory speech synthesis

[PhD Position] Coordination between articulatory gestures and vocal fold control for articulatory speech synthesis

CNRS/Loria Nancy, France

Yves.Laprie@loria.fr

 

Context

The production of speech requires a source signal – the vibration of the vocal folds or a turbulent noise somewhere in the vocal tract – and a system of resonant cavities – the vocal tract. The articulators of speech – the jaw, tongue, lips, larynx, soft palate and epiglottis – are used to modify the shape of the vocal tract, and thus its acoustic properties, including resonances.

Articulatory synthesis mimics this process by first generating the shape of the vocal tract from a sequence of phonemes to be pronounced, and then the acoustic signal by solving the aeroacoustic equations [2, 3]. Compared to other approaches to speech synthesis that offer a very high quality, the main interest is to control the whole production process, beyond the acoustic signal alone. It becomes possible to explain the articulatory origin of phonetic contrasts, to play on the movement of articulators (or even to block some of them), to modify the control parameters of vocal folds, to adapt to a new speaker by modifying the size and shape of articulators, and finally to access physical quantities (for example the pressure at any point of the vocal tract) without having to introduce sensors.

We developed an approach to articulatory synthesis that generates the shape of the vocal tract using an articulatory model and then uses our aeroacoustic simulation to synthesize the speech signal.

However, the quality of the acoustic signal depends on the closeness of the geometrical shape of the artificial vocal tract to the one realized by a human speaker on the one hand, and on the coordination between the source parameters and those of the vocal tract shape on the other hand.  So far, we have used an empirical approach that requires manual adjustments and is not optimal.

The goal of this thesis is to optimize the control of the geometrical shape at each instant of the synthesis and to develop an optimal coordination strategy between the source and the vocal tract.

Description of work

Two data flows feed the numerical simulations of the aeroacoustics in the vocal tract.

The first data flow concerns the source that excites the vocal tract. The glottis opening determines the voicing and the vibration frequency of the vocal folds on the one hand, and on the other hand conditions the existence of a possible source of noise due to a strong constriction inside the vocal tract, for example between the teeth and the tip of the tongue for the /s/ sound.

The second is the temporal evolution of the vocal tract geometry. As the three-dimensional shape is difficult to reach, it is approximated by the geometry of the vocal tract in the mid-sagittal plane which corresponds to the contour data of the different articulators. These contours can be predicted from the phonemes to be articulated or extracted from MRI images. The uncertainties are due to the acquisition of the MRI data which is not instantaneous since an “image” requires a total acquisition time of 20 ms and corresponds in fact to a thick slice (8 mm) with an integration effect. The loss of the 3rd dimension (in the direction perpendicular to the mid-sagittal plane) adds uncertainty to the acoustic properties of the vocal tract.

 

The coordination of these two data streams plays a decisive role in the quality of the speech produced and will therefore be the subject of the first part of the work. For this purpose, we have at our disposal data on glottis opening obtained by EPGG (electro-photo-glottography) at the LPP laboratory in Paris with which we collaborate in this project [1]. The work will consist in developing a first scenario of glottis control according to the sequence of phonemes to be articulated and then to optimize it by using machine learning exploiting EPGG data and real time MRI data which provide the geometry of the vocal tract.

The second part of the work will concern the optimization of the vocal tract geometry which directly influences the acoustic characteristics of the synthesized signal both in terms of quality and intelligibility. Real-time MRI data provide a grayscale image whose contours are extracted using automatic tracking techniques. Their performance has improved considerably in recent years due to deep learning and the extracted contours are generally of very good quality [4]. However, there are two weak points. The first one is related to the MRI acquisition technique which requires 20 ms to acquire an image corresponding to an 8mm thick slice, which means that the image does not correspond exactly to the shape of the vocal tract. The second weak point is the transition from the mid-sagittal slice to the volume which uses a simplistic transformation to recover the 3rd direction.

Several levels of optimization from coarse to fine will be studied:

– position of critical articulators (e.g. the position of the tongue tip relative to the teeth for the fricative /s/) to ensure the essential acoustic properties,

– area function (area transverse to the wave propagation in the vocal tract),

– acoustic targets by fine modification of the area function.

These different optimizations, which will make extensive use of deep learning, will exploit phonetic knowledge, dynamic MRI images in a plane perpendicular to the medio-sagittal plane and dynamic MRI image data supplemented by the acoustic signal.

The solutions explored so far, notably on the prediction of the area function [5], provide very insufficient solutions. This PhD project should lead to significant progress in articulatory synthesis. The development of optimal control strategies for articulatory synthesis will be a remarkable success at the international level.

Environment

This project is part of an ANR project; it will be conducted jointly by the Loria laboratory (Inria MultiSpeech team) and the IADI laboratory (INSERM U1254), which have been working together for several years on speech production and vocal tract imaging.

In particular, this will allow us to use the two-dimensional MRI acquisition system in real time (at 50 images per second) of the IADI laboratory. This system, unique in France, enables imaging the vocal tract at a frequency of 50 Hz in any direction, which is interesting for the recovery of the area function.

References

  1. Benjamin Elie, Angelique Amelot, Yves Laprie, Shinji Maeda. Glottal Opening Measurements in VCV and VCCV Sequences. ICA 2019 – 23rd International Congress on Acoustics, Sep 2019, Aachen, Germany. ⟨hal-02180626⟩
  2. Benjamin Elie, and Yves Laprie, Extension of the single-matrix formulation of the vocal tract: consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink. Speech Comm. 82, pp. 85-96 (2016). https://hal.archives-ouvertes.fr/hal-01199792v3
  3. Benjamin Elie, and Yves Laprie. Copy-synthesis of phrase-level utterances. EUSIPCO, Budapest 2016 https://hal.archives-ouvertes.fr/hal-01278462
  4. Karyna Isaieva, Yves Laprie, Freddy Odille, Ioannis Douros, Jacques Felblinger, et al.. Measurement of Tongue Tip Velocity from Real-Time MRI and Phase-Contrast Cine-MRI in Consonant Production. Journal of Imaging, MDPI, 2020, 6 (5), pp.31. ⟨3390/jimaging6050031⟩. ⟨hal-02923466⟩
  5. Richard S. McGowan and Michel T-T. Jackson, Analyses of vocal tract cross-distance to area mapping: An investigation of a set of vowel images, JASA, 131, pp. 424-434 (2012); https://doi.org/10.1121/1.3665988

 

Application

Master in computer science or applied mathematics

A good knowledge in applied mathematics or physics is mandatory since the project requires interacting with numerical simulations of aero-acoustics.

Send a CV to Yves.Laprie@loria.fr

Logo d'Inria