[Post-doc/engineer] Real-time embedded machine listening


Loria/INRIA Nancy – Grand Est research center


This PhD fits within the scope of the ANR project “LEAUDS” involving the Multispeech research team at Inria Nancy – Grand Est (https://team.inria.fr/multispeech/), the Synalp research team at Loria in Nancy (https://synalp.loria.fr/), the machine learning team at Laboratoire d’Informatique, de Traitement de l’Information et des Systèmes in Rouen (https://www.litislab.fr/), and the company Netatmo in Paris (https://www.netatmo.com/).

Scientific context

As humans, we constantly rely on the sounds around us to get information about our environment (birds singing, a car passing by, the constant hum from a highway nearby…) and to get feedback about our actions (the noise of a door closing, the beeps from an ATM keyboard…). Being able to interpret environmental audio could be of great importance in numerous applications from video indexing and classification to noise pollution analysis in smart cities and speech communication in real environments. Therefore, environmental sound analysis has become increasingly popular, as indicated by the success of the DCASE Challenge series [1]. State-of-the-art approaches are able to classify coarse-grained acoustic scenes (in a bar, in a street, on a bus…) [2] or daily activities (cooking, talking…) [3], or to identify sequences of sound events [4] to some extent.

Many of these approaches could have potential applications if they were implemented in embedded devices. However, models tend to be more and more complex and are far from meeting the computational and latency requirements when targeting embedded devices. There have been some recent works addressing model complexity, for example in Task 1 of the DCASE Challenge [1] but this is not the only problem faced when considering implementation on embedded devices.


A first step is to simplify the machine listening algorithms, for example by taking into account the specificities of the signals and the task, by introducing regularizations [5], or applying task and model factorization [6]. Alternative approaches can include pruning techniques [7] or knowledge distillation [8] to reduce further the complexity of the models used at inference. Most of the current machine listening approaches are operating offline and consider that the whole signal is available, potentially introducing high processing latencies. A second step is therefore to systematically study to which extent it is possible to reduce this latency to meet target criteria [9]. A final step towards implementation is to reduce the numerical complexity and the ensure adequacy with the numerical precision used on the targeted hardware [8, 10]. All the steps mentioned above must be addressed while making sure that the performance of the considered machine listening algorithm is minimally affected.


  • Strong background in embedded software, machine learning, or audio signal processing
  • Excellent programming skills
  • Excellent English writing and speaking skills


Romain Serizel (https://members.loria.fr/RSerizel/) and Emmanuel Vincent (https://members.loria.fr/EVincent/)


[1] http://dcase.community/

[2] Mesaros, A., Heittola, T., & Virtanen, T. (2016). TUT database for acoustic scene classification and sound event detection. In 2016 24th European Signal Processing Conference (EUSIPCO).

[3] Dekkers, G., Lauwereins, S., Thoen, B., Adhana, M. W., Brouckxon, H., Van den Bergh, B., … & Karsmakers, P. (2017). The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In 2017 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE).

[4] Turpault, N., Serizel, R., Salamon, J., & Shah, A. P. (2019). Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In 2019 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE).

[5] Aydore, S., Thirion, B., & Varoquaux, G. (2019). Feature grouping as a stochastic regularizer for high-dimensional structured data. In 36th International Conference on Machine Learning (ICML).

[6] Sun, X., Gao, Z. F., Lu, Z. Y., Li, J., & Yan, Y. (2020). A Model compression method with matrix product operators for speech enhancement. IEEE/ACM Transactions on Audio, Speech and Language Processing.

[7] He, Y., Zhang, X., & Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Cerutti, G., Prasad, R., Brutti, A., & Farella, E. (2020). Compact recurrent neural networks for acoustic event detection on low-energy low-complexity platforms. IEEE Journal of Selected Topics in Signal Processing.

[9] Delebecque, L., Furnon, N., Serizel, R. (2022). Towards an efficient computation of masks for multichannel speech enhancement. Pre-print: https://hal.archives-ouvertes.fr/hal-03604983

[10] Gontier, F., Lavandier, C., Aumond, P., Lagrange, M., & Petiot, J. F. (2019). Estimation of the perceived time of presence of sources in urban acoustic environments using deep learning techniques. Acta Acustica united with Acustica.