[PhD position F/M] Nongaussian models for deep learning based audio signal processing

To apply, please follow instructions here.

 

Location: Inria Nancy – Grand Est, team MULTISPEECH
Supervisors: Emmanuel Vincent (Senior Researcher, INRIA), Paul Magron (Researcher, INRIA).

Context

Audio signal processing and machine listening systems have achieved considerable progress over the past years, notably thanks to the advent of deep learning. Such systems usually process a time-frequency representation of the data, such as a magnitude spectrogram, and model its structure using a deep neural network (DNN).

Generally speaking, these systems implicitly rely on the local Gaussian model [1], that is an elementary statistical model for the data. Even though it is convenient to manipulate, this model builds upon several hypotheses which are limiting in practice: (i) circular symmetry, which boils down to discarding the phase information (= the argument of the complex-valued time-frequency coefficients); (ii) independence of the coefficients, which ignores the inherent structure of audio signals (temporal dynamics, frequency dependencies); and (iii) Gaussian density, which is not observed in practice.

Statistical audio signal modeling is an active research field. However, recent advances in this field are usually not leveraged in deep learning-based approaches, thus their potential is currently under-exploited. Besides, some of these advances are not mature enough to be fully deployed yet.

Therefore, the objective of this PhD is to design advanced statistical signal models for audio which overcome the limitations of the local Gaussian model, while combining them with DNN-based spectrogram modeling. The developed approaches will be applied to audio source separation and speech enhancement.

Main activities

The main objectives of the PhD student will be:

  1. To develop structured statistical models for audio signals, which alleviate the limitations of the local Gaussian model. In particular, the PhD student will focus on designing models by leveraging properties that originate from signal analysis, such as the temporal continuity [2] or the consistency of the representation [3], in order to favor interpretability and meaningfulness of the models. For instance, alpha-stable distributions have been exploited in audio for their robustness [4]. Anisotropic models are an interesting research direction since they overcome the circular symmetry assumption, while enabling an interpretable parametrization of the statistical moments [5]. Finally, a careful design of the covariance matrix allows for explicitly incorporating time and frequency dependencies [6].
  2. To combine these statistical models with DNNs. This raises several technical difficulties regarding the design of, e.g., the neural architecture, the loss function, and the inference algorithm. The student will exploit and adapt the formalism developed in Bayesian deep learning, notably the variational autoencoding framework [7], as well as the inference procedures developed in DNN-free nongaussian models [8].
  3. To validate experimentally these methods on realistic sound datasets. To that end, the PhD student will use public datasets such as LibriMix (speech) and MUSDB (music), which are reference datasets for source separation and speech enhancement.

The PhD student will disseminate his/her research results in international peer-reviewed journals and conferences. In order to promote reproducible research, these publications will be self-archived at each step of the publication lifecycle, and accessible through open access repositories (e.g., arXiv, HAL). The code will be integrated to Asteroid, that is the reference software for source separation and speech enhancement developed by Multispeech.

Required skills

  • Master or engineering degree in computer science, data science, signal processing, or machine learning.
  • Professional capacity in English (spoken, read, and written).
  • Some programming experience in Python and in some deep learning framework (e.g., PyTorch).
  • Previous experience and/or interest for speech and audio processing is a plus.

Working environment

The PhD student will join the Multispeech team of Inria, that is the largest French research group in the field of speech processing. He/she will benefit from the research environment and the expertise in audio signal processing and machine learning of the team, which includes many researchers, PhD students, post-docs, and software engineers working in this field.

Bibliography

[1] E. Vincent, M. Jafari, S. Abdallah, M. Plumbley, M. Davies, Probabilistic modeling paradigms for audio source separation, Machine Audition: Principles, Algorithms and Systems, p. 162–185, 2010.

[2] T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 15, no. 3, pp. 1066-1074, 2007.

[3] J. Le Roux, N. Ono, S. Sagayama, Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction, Proc. SAPA, 2008.

[4] S. Leglaive, U. Şimşekli, A. Liutkus, R. Badeau and G. Richard, Alpha-stable multichannel audio source separation, Proc. IEEE ICASSP, 2017.

[5] P. Magron, R. Badeau, B. David, Phase-dependent anisotropic Gaussian model for audio source separation, Proc. IEEE ICASSP, 2017.

[6] M. Pariente, Implicit and explicit phase modeling in deep learning-based source separation, PhD thesis – Université de Lorraine, 2021.

[7] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, X. Alameda-Pineda, Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, vol. 15, no. 1-2, 2021.

[8] P. Magron, T. Virtanen, Complex ISNMF: a phase-aware model for monaural audio source separation, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 27, no. 1, pp. 20-31, 2019.