Sandipana Dowerah (Multispeech) will defend her thesis, entitled “Deep Learning-based Multichannel Speech Enhancement for far-field Speaker Verification“, on Tuesday May 30th at 2pm in room C005.
Smart applications like speaker verification have become essential in verifying the user’s identity for availing of personal assistants or online banking services based on the user’s voice characteristics. However, far-field or distant speaker verification is constantly affected by surrounding noises which can severely distort the speech signal. Moreover, speech signals propagating in long-range get reflected by various objects in the surrounding area, which creates reverberation and further degrades the signal quality. This PhD thesis explores deep learning-based multichannel speech enhancement techniques to improve the performance of speaker verification systems in real conditions. Multichannel speech enhancement aims to enhance distorted speech using multiple microphones. It has become crucial to many smart devices, which are flexible and convenient for speech applications.
Three novel approaches are proposed to improve the robustness of speaker verification systems in noisy and reverberated conditions. Firstly, we integrate a deep neural network architecture with signal-processing techniques for speech enhancement as a pre-processing to an x-vector-based speaker verification system. We examine the importance of using such pre-processing during the enrollment phase, which has been largely overlooked in the literature. Experimental evaluation shows that pre-processing improves speaker verification performance if the enrollment files are processed similarly to the test data and if the test and enrollment occur within similar signal-to-noise ranges. We then propose to implement novel score-based diffusion probabilistic models for multichannel speech enhancement as a front-end to an ECAPA-TDNN speaker verification system. Particular emphasis is put on multi-channel speech enhancement techniques. We compute the time-frequency masks and multichannel filters using diffusion probabilistic models. As individual training of the speech enhancement module often introduces certain artefacts and distortions, leading to mismatch problems. We propose joint optimization of both modules as it helps in retaining the information. We expanded the aforementioned approaches by jointly optimizing speech enhancement and speaker verification with and without knowledge distillation loss. The knowledge distillation loss minimizes the distance between the speaker embeddings obtained from the proposed system and those obtained clean speech signals, further improving the performance of the speaker verification system on different noise conditions.
Sylvain Meignier, Professeur, LIUM, Le Mans Université
Sylvain Marchand, Professeur, IUT de La Rochelle
Nancy Bertin, Chercheuse, Oracle
Frédéric Sur, Professeur, Loria, Université de Lorraine
Romain Serizel, Maître de conférences, Université de Lorraine
Denis Jouvet, Ancien Directeur de Recherches, Inria Nancy – Grand Est