[PhD Thesis MULTISPEECH] Generalized Multichannel Speech Enhancement with End-to-end Deep Learning

Context

Speech technologies have undergone spectacular transformations over the recent years. The multiplication of devices equipped with one or more microphones in our daily environment (hearing aids, smartphones, home assistants, robots, teleconferencing systems…) has created growing needs for systems processing and recognizing speech in challenging real-world conditions, and generates ever-increasing public and industrial interest [1]. From a scientific standpoint, the field has been revolutionized by deep learning methodologies, now pervading all classical audio signal-processing subtasks: source separation, de-noising, de-reverberation, echo cancelation, de-clipping, etc. Amongst these tasks, single-channel speech separation, i.e., recovering the voices of two or more speakers from a mono signal, is perhaps the one for which most dramatic performance leaps have been recently obtained. Since 2018, the state-of-the-art for this task has been led by the TasNet framework [2], which casts it as an end-to-end supervised learning problem using a three-stage architecture. The first stage transforms the input time-domain signal into a 2D representation akin to a spectrogram through a set of learned filters (filterbank); the second stage estimates a mask assigning each entry to a distinct source via a convolutional or recurrent neural network; and the third stage reverts the masked representation back to the time domain through another learned filterbank. While this approach yields impressive performance in controlled settings, recent studies have shown that it degrades in the presence of noise, reverberation, or speakers with similar voices [3,4,5]. Also, its performance in multichannel settings is far less impressive, and it has not yet been applied to echo cancelation or de-clipping. This can be attributed to the generic black-box architecture of the neural network which prevents it from estimating and applying the long time-invariant filters required for multichannel separation, de-reverberation, or echo cancelation.

Mission

The goal of this PhD thesis is to overcome the current limitations of end-to-end speech separation methods by designing a unified modular framework for multichannel speech enhancement. Current methods only work on single channel input, and considerable improvement could be gained by leveraging spatial features captured by a microphone array. However, this comes with the challenge of modeling real-world conditions in which the microphone array calibration may only be partially known while spatial properties of desired sources and interferers are completely unknown, potentially changing over time, and distorted by the effects of early acoustic reflections, near-field propagation or scattered and non-isotropic radiation. To tackle this, a new family of neural network architectures is needed that leverages acoustic signal processing expertise and incorporates trainable submodules dedicated to spatial processing, array calibration, source separation, de-reverberation, de-noising, and echo cancelation. To do so, we will take inspiration from the recent attempt by our team in [6] but generalize it to other tasks and replace conventional local Gaussian models and spectrograms by end-to-end learning counterparts.

Main Activities

This ambitious mission will be articulated into the following research tasks

  1. Investigation of different filterbank designs including parameterized filters, multiple time scale representations, multi-directional spatial filtering and 3D representations.
  2. Incorporation of de-reverberation and echo cancelation by generalizing entry-wise masking to local deconvolution.
  3. Incorporation of unfolded iterations between spatial and spectral filtering
  4. Investigation of joint non-linear spectro-spatial filtering.
  5. Development of conditional architectures integrating microphone-array calibration, possibly known source locations or loudspeakers output for echo-cancelation.

The PhD candidate will benefit from the strong expertise of the MULTISPEECH team in deep learning, audio, speech and acoustic signal processing. He/she will be able to build on the python library Asteroid [7] for learning-based audio source separation research, developed within the team. He/she will be co-supervised by Antoine Deleforge and Emmanuel Vincent.

Skills Required

  • Master degree in computer science, machine learning, signal processing
  • Strong skills in programming language Python
  • Experience using a deep learning library (Pytorch, Tensorflow, Keras, etc.)
  • Experience in audio signal processing

Contact to Apply

Antoine (dot) Deleforge (at) Inria (dot) fr

Bibliography

[1] Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot (Eds.). Audio source separation and speech enhancement. Wiley, 2018.

[2] Yi Luo and Nima Mesgarani. “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018).

[3] Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. “WHAM!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160 (2019).

[4]  Jens Heitkaemper, Darius Jakobeit, Christoph Boeddeker, Lukas Drude, and Reinhold Haeb-Umbach,  “Demystifying TasNet: A Dissecting Approach.” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.

[5] David Ditter and Timo Gerkmann “Influence of Speaker-Specific Parameters on Speech Separation Systems” in Proc. Interspeech 2019.

[6] Guillaume Carbajal, Romain Serizel, Emmanuel Vincent, and Éric Humbert, “Joint DNN-based multichannel reduction of acoustic echo, reverberation and noise”, submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing (2019). https://hal.inria.fr/hal-02372579v2

[7] Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olivera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martı́n-Doñas, Ariel Frank, Antoine Deleforge, Emmanuel Vincent, “Asteroid : the PyTorch-based audio source separation toolkit for researchers”, submitted to INTERSPEECH 2020

Logo d'Inria