Soutenance de thèse : Nasser-Eddine Monir (équipe Multispeech)
Le 22 mai 2026, Nasser-Eddine Monir, (Multispeech), soutiendra sa thèse intitulée
« Phoneme-Level Evaluation and Training Losses for Multichannel Speech Enhancement »
-
Supervisors:
- Romain SERIZEL (Director), Full Professor, Loria, Université de Lorraine
- Paul MAGRON (Co-director), Research Scientist, Inria Centre at Université de Lorraine
-
Reviewers:
- Simone GRAETZER, Senior Research Scientist, University of Salford, UK
- Richard MARXER, Full Professor, LIS, Université de Toulon, France
-
Examiners :
- Joël DUCOURNEAU (President), Full Professor, LEMTA, Université de Lorraine
- Tobias MAY, Associate Professor, CAHR, Danmarks Tekniske Universitet, Denmark
- Dorothée ARZOUNIAN, Research Scientist, Institut Pasteur, France
Speech communication in complex acoustic environments remains a significant challenge, particularly for hearing-impaired individuals and automatic speech recognition (ASR) systems. While deep learning has significantly advanced multichannel speech enhancement, most existing frameworks rely on global, utterance-level optimization criteria. This thesis addresses the limitations of such approaches by explicitly accounting for the structured and non-uniform nature of speech signals across time, frequency, and phoneme categories.
The first part of this work introduces a phoneme-level evaluation framework to characterize enhancement performance beyond traditional metrics such as the signal-to-distortion ratio. By analyzing performance across phonetic classes, we demonstrate that global metrics are often dominated by high-energy, stationary segments (such as vowels), effectively masking significant degradation in perceptually critical but low-energy transient units (such as plosives and fricatives). This analysis is further extended to investigate speaker-dependent variability, revealing that speaker gender significantly influences enhancement behavior at the phoneme scale.
Building on these findings, the second part of the thesis focuses on the design of phoneme-aware training objectives. We first explore frequency-weighted loss functions that emphasize spectral regions and time–frequency bins dominated by interference. Results show that adaptive weighting leads to improved preservation of spectral cues, particularly for consonants. Finally, we propose a structured gated weighting framework that integrates speech presence, local speech-noise competition, and transient spectral structure into the optimization process. Evaluation across signal-level, phonetic, and ASR-based metrics confirms that these loss functions lead to an enhancement behavior that is more closely aligned with improved speech recognition performance and spectral reconstruction.
Overall, this thesis demonstrates that incorporating phonetic and spectro-temporal structure into both evaluation and training is essential for developing speech enhancement systems that better preserve the information relevant to human and automatic speech recognition.
Keywords: Multichannel speech enhancement, speech intelligibility, hearing aids, phoneme-based evaluation, loss functions.

