PhD proposal: Deep learning based noise reduction approach for ad-hoc microphone arrays

Position type: PhD position

Research theme: Perception, Cognition, Interaction

Project-team: MULTISPEECH

Supervision and contact: Romain Serizel (romain.serizel@loria.fr) and Emmanuel Vincent (emmanuel.vincent@inria.fr)

Keywords: Speech enhancement, signal processing, deep learning, microphone arrays

Project description

Scientific Context: Speech is one of the most intuitive means of communication between humans. Since the early 2010’s, with the emergence of reliable end-user voice applications, speech has even become one of the preferred ways of interacting with mobile devices and soon with your home. However, most of the applications that are based on speech communication rely on the assumption that a “clean” version of the speech is available. In real-life scenarios this is rarely true and speech is most generally corrupted by noise which can severely degrade communication. One solution to this noise problem is to apply so-called speech enhancement techniques that aim at extracting the speech component from a noisy speech mixture. In particular, multichannel approaches have attracted a lot attention over the years mainly because of their superiority to single channel approaches in many aspects. Yet, traditional microphone arrays have limitations in particular due to space constraints and ad-hoc microphone arrays composed of a set of wireless microphone nodes have recently proven to be a viable alternative.

Missions: The goal of this thesis is to generalize the recent improvements in speech enhancement obtained with deep learning techniques [1] to the case of ad-hoc microphone arrays. Current techniques are mostly limited to single channel [2, 3] or rely at some point on a standard beamforming techniques [4, 5] or averaging [6] in order to produce a single channel input to the deep network. These approaches therefore depend on a centralized processing at some stage and on assumptions about the microphone array topology. Therefore, their extension to ad-hoc arrays where the array topology is unconstrained and can vary over time and where distributed processing is usually preferred is not obvious. Reformulating the multichannel speech enhancement problem as a deep learning problem that takes multichannel audio as input and proposing distributed and online learning methods should allow extending the applicability of deep learning based speech enhancement to ad-hoc arrays and improve performance compared to state-of-the-art approaches [7].

Skills and profile:

MSc in computer science, machine learning, signal processing

Experience with programming language Python

Experience with deep learning toolkits is a plus

The required documents for applying are the following:

  • CV;
  • a motivation letter;
  • your degree certificates and transcripts for Bachelor and Master (or the last 5 years if not applicable).
  • Master thesis (or equivalent) if it is already completed, or a description of the work in progress, otherwise;
  • all your publications, if any (it is not expected that you have any).
  • At least one recommendation letter from the person who supervises(d) your Master thesis (or research project or internship); you can also send at most two other recommendation letters.
    The recommendation letter(s) should be sent directly by their author to the prospective PhD advisor.

All the documents should be sent in at most 2 pdf files; one file should contain the publications, if any, the other file should contain all the other documents. These two files should be sent to your prospective PhD advisor.

 

Bibliography

[1] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, 2014.

[2] Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1849-1858.

[3] Chen, J., Wang, Y., & Wang, D. (2015). Noise perturbation improves supervised speech separation. In International Conference on Latent Variable Analysis and Signal Separation (pp. 83-90).

[4] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation (pp. 91-99).

[5] Pfeifenberger, L., Schrank, T., Zohrer, M., Hagm, M., & Pernkopf, F. (2015). Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 452-459).

[6] Nugraha, A. A., Liutkus, A., & Vincent, E. (2015). Multichannel audio source separation with deep neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (10), pp.1652-1664.

[7] Markovich-Golan, S., Bertrand, A., Moonen, M., & Gannot, S. (2015). Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Processing, 107, 4-20.

Now

Logo du CNRS
Logo Inria
Logo Université de Lorraine