PhD proposal: Deep learning for sound scene analysis in real environments

Position type: PhD position

Research theme: Perception, Cognition, Interaction

Project-team: MULTISPEECH

Supervision and contact: Romain Serizel ( and Emmanuel Vincent (

Keywords: deep learning, environmental sound analysis

Project description

Scientific Context: We are constantly surrounded by a complex audio stream carrying information about our environment. Hearing is a privileged way to detect and identify events that may require quick action (ambulance siren, baby cries…). Indeed, audition offers several advantages compared to vision: it allows for omnidirectional detection, up to a few tens of meters and independently of the lighting conditions. For these reasons, automatic audio analysis has become increasingly popular over the past five years [1]. Yet, most work has focused on controlled scenarios and the deployment of automatic audio analysis systems into the real world still raises several issues: variability of the sounds associated to each event, signal degradation due to the acoustic propagation in far field conditions or to overlapping events and constraints on the location and the quality of the microphones. Current approaches do not fully take these problems into account and therefore quickly become unusable in real conditions.

Missions: The goal of this PhD is to design an automatic sound scene analysis system based on deep learning [2] that is robust to the variabilities and degradations induced by real conditions. A first research axis consists, based on an initial system trained for example on Audio Set [3], in simulating degradations in order to increase the variability and the amount of training data. We recently proposed an algorithm to automatically optimize this process that could be applied to sound scene analysis [4]. A second research axis is to exploit multiple microphones distributed over the environment forming a wireless ad-hoc sensor network. Such networks have been largely studied under a signal processing perspective [5]. We propose to exploit them within a deep learning framework in order to perform multi-view learning [6]. The goal is then to design an algorithm that allows each node of the array to refine its perception of the sound scene and to track moving sources based on the information exchanged with neighboring nodes. The resulting system will be evaluated on real urban sound scenes.

Skills and profile:

MSc in computer science, machine learning, signal processing

Experience with programming language Python

Experience with deep learning toolkits is a plus

The required documents for applying are the following:

  • CV;
  • a motivation letter;
  • your degree certificates and transcripts for Bachelor and Master (or the last 5 years if not applicable).
  • Master thesis (or equivalent) if it is already completed, or a description of the work in progress, otherwise;
  • all your publications, if any (it is not expected that you have any).
  • At least one recommendation letter from the person who supervises(d) your Master thesis (or research project or internship); you can also send at most two other recommendation letters.
    The recommendation letter(s) should be sent directly by their author to the prospective PhD advisor.

All the documents should be sent in at most 2 pdf files; one file should contain the publications, if any, the other file should contain all the other documents. These two files should be sent to your prospective PhD advisor.



[2] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, 2014.

[3] Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., … & Ritter, M. (2017). Audio Set: An ontology and human-labeled dataset for audio events. In Proc. ICASSP.

[4] Sivasankaran, S., Vincent, E., & Illina, I. (2017). Discriminative importance weighting of augmented training data for acoustic model training. In Proc. ICASSP.

[5] Bertrand, A. (2011). Applications and trends in wireless acoustic sensor networks: a signal processing perspective. In Proc. SCVT.

[6] Wang, W., Arora, R., Livescu, K., & Bilmes, J. A. (2015). On deep multi-view representation learning. In Proc. ICML.



Colloquium Loria 2018

Previous talks

Logo du CNRS
Logo Inria
Logo Université de Lorraine