Cross-modal variational inference for bijective signal-symbol translation
Extraction of symbolic information from signals is an active field of research enabling numerous applications especially in the Musical Information Retrieval domain. This complex task, that is also related to other topics such as pitch extraction or instrument recognition, is a demanding subject tha...
        Saved in:
      
    
          | Main Authors | , , , , | 
|---|---|
| Format | Journal Article | 
| Language | English | 
| Published | 
          
        10.02.2020
     | 
| Subjects | |
| Online Access | Get full text | 
| DOI | 10.48550/arxiv.2002.03862 | 
Cover
| Summary: | Extraction of symbolic information from signals is an active field of
research enabling numerous applications especially in the Musical Information
Retrieval domain. This complex task, that is also related to other topics such
as pitch extraction or instrument recognition, is a demanding subject that gave
birth to numerous approaches, mostly based on advanced signal processing-based
algorithms. However, these techniques are often non-generic, allowing the
extraction of definite physical properties of the signal (pitch, octave), but
not allowing arbitrary vocabularies or more general annotations. On top of
that, these techniques are one-sided, meaning that they can extract symbolic
data from an audio signal, but cannot perform the reverse process and make
symbol-to-signal generation. In this paper, we propose an bijective approach
for signal/symbol translation by turning this problem into a density estimation
task over signal and symbolic domains, considered both as related random
variables. We estimate this joint distribution with two different variational
auto-encoders, one for each domain, whose inner representations are forced to
match with an additive constraint, allowing both models to learn and generate
separately while allowing signal-to-symbol and symbol-to-signal inference. In
this article, we test our models on pitch, octave and dynamics symbols, which
comprise a fundamental step towards music transcription and label-constrained
audio generation. In addition to its versatility, this system is rather light
during training and generation while allowing several interesting creative uses
that we outline at the end of the article. | 
|---|---|
| DOI: | 10.48550/arxiv.2002.03862 |