Vinicius Ribeiro (Multispeech) will defend his thesis, entitled “Deep Supervision of the Vocal Tract Shape for Articulatory Synthesis of Speech“, on Tuesday, December 5th at 2 pm in room A008.
Speech is a dynamic and non-stationary process that requires the interaction of several vocal tract articulators. The context in which a phoneme is articulated strongly influences its production, a phenomenon known as coarticulation. Articulatory speech synthesis and its counterpart, acoustic-to-articulatory inversion, hold many potential applications, such as L2 learning and speech therapy design. Moreover, these models are helpful for speech synthesis and automatic speech recognition since they create a link to the speech production process.
Modeling speech articulations presents challenges such as coarticulation, non-uniqueness, and speaker normalization. Historically, the research focused on geometrical, mathematical, and statistical models to describe speech dynamics. Nevertheless, developing such models faces the difficulty of obtaining relevant articulatory data from actual speakers. Since the vocal tract is not observable from the outside, various invasive and non-invasive methods have been used to collect these data, including flesh point tracking and medical imaging. The first attempts to extract articulatory data used X-rays, but it was abandoned due to the exposition to ionizing radiation. Then, electromagnetic articulography rapidly grew in popularity due to the high sampling rate and the low cost compared to the alternatives. More recently, real-time magnetic resonance imaging (RT-MRI) has been the preferred acquisition method due to the visibility of the vocal tract from the glottis to the lips.
This thesis explores the synthesis of speech articulation gestures corresponding to a sequence of phonemes. The primary objective is to design a model that predicts the temporal evolution of the vocal tract shape for each phoneme in the input sequence. Nevertheless, developing a realistic temporal model of the vocal tract is challenging. Therefore, we split the problem into three contributions.
The first is obtaining the vocal tract profile from the RT-MRI films by developing a robust method for segmenting vocal tract articulations. The second contribution is to build an articulatory model that predicts the vocal tract shape for any phonetic input in French. The challenges are learning coarticulation and enforcing the places of articulation and articulatory gestures that lead to the expected acoustics. The third contribution is the evaluation of the predicted shapes. We propose to quantify phonetic information with the aid of phoneme recognition. We measure the phonetic information retained by the mid-sagittal contours and that reproduced by the vocal tract shape synthesizer using the phoneme error rate and the recognizer’s internal representations.
This thesis points to significant directions in speech articulation synthesis. We observe that directly modeling, without an articulatory model, leads to the best and most natural results. Nevertheless, using an intermediate articulatory model permits the introduction of relevant phonetic knowledge into the model. Finally, we open a new direction to evaluate articulatory models through their phonetic representations.
- Damien Lolive — ENSSAT Lannion, Université de Rennes, France
- Antoine Serrurier — RWTH University Aachen, Germany
- Anne Boyer — Université de Lorraine, France
- Eduardo Valle — University of Campinas, Brazil
- Alice Turk — University of Edinburgh, Scotland
- Pierre-André Vuissoz — Université de Lorraine, France
- Yves Laprie — Université de Lorraine, France