Salient sound extraction using deep neural networks predicting complex masks

This work addresses the problem of extracting sounds that are unexpected in an audio stream and stand out because of their spectrotemporal characteristics. In human auditory scene analysis, such sounds are referred to as (sensory) salient. Previous research initiatives are mostly limited to the dete...

Full description

Saved in:

Bibliographic Details
Published in	Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference proceedings pp. 166 - 171
Main Authors	Grzywalski, Tomasz, Botteldooren, Dick, Song, Yanjue, Madhu, Nilesh
Format	Conference Proceeding
Language	English
Published	Division of Signal Processing and Electronic Systems, Poznan University of Technology (DSPES PUT) 25.09.2024
Subjects	Artificial neural networks deep neural networks Image analysis Location awareness Predictive models Recording Research initiatives salient event detection salient sound extraction Signal processing algorithms Speech enhancement Time-frequency analysis Training
Online Access	Get full text
ISSN	2326-0319
DOI	10.23919/SPA61993.2024.10715626

Cover

More Information
Summary:	This work addresses the problem of extracting sounds that are unexpected in an audio stream and stand out because of their spectrotemporal characteristics. In human auditory scene analysis, such sounds are referred to as (sensory) salient. Previous research initiatives are mostly limited to the detection of presence of salient sounds and identification of their temporal localization within the signal. Other approaches aim at developing classifiers that detect fixed, predetermined categories of salient sounds. In contrast, this work aims at developing a solution capable of suppressing all background (non-salient) sounds from an audio stream, preserving, to the best extent possible, the salient sounds without any distortion. An additional assumption is that the algorithm should not be limited to any particular category of salient sound events. This challenging task is realized in two steps, both being novel contributions of this work. In the first step, a large-scale dataset of clean background samples and clean salient sound samples is created by automatically processing publicly available resource of field recordings. In the second step, a deep neural network (U-Net) trained to predict complex ideal ratio mask, a method typically used for speech enhancement, is adopted and evaluated in the context of salient sound extraction. The results of conducted experiments indicate potential high efficacy of the proposed solution and indicate directions for future research.
ISSN:	2326-0319
DOI:	10.23919/SPA61993.2024.10715626