OFAI

2013–2016

Automatic Segmentation, Labelling, and Characterisation of Audio Streams

bmvit    FWF

A project sponsored by the Austrian Federal Ministry for Transport, Innovation and Technology (bmvit), managed by the Austrian Science Fund (FWF), Research Line "Translational Research"
Project Number: TRP 307-N23

The goal of this project is to develop technologies for the automatic segmentation and interpretation of audio files and audio streams deriving from different media worlds: music repositories, (Web and terrestrial) radio streams, TV broadcasts, etc. A specific focus is on streams in which music plays an important role.

Specifically, the technologies to be developed should address the following tasks:

  • automatic segmentation (with or without meta-information) of audio streams into coherent or otherwise meaningful units or segments (based on general sound or rhythm similarity or homogeneity, on specific types of content and characteristics, on repeated occurrences of subsections, etc.);
  • the automatic categorisation of such audio segments into classes, and the association of segments and classes with meta-data derived from various sources (including the Web);
  • the automatic characterisation of audio segments and sound objects in terms of concepts intuitively understandable to humans.

To this end, we plan to develop and/or improve and optimise computational methods that analyse audio streams, identify specific kinds of audio content (e.g., music, singing, speech, applause, commercials, ...), detect boundaries and transitions between songs, and classify musical and other segments into appropriate categories; that combine information from various sources (the audio signal itself, databases, the Internet) in order to refine the segmentation and gain meta-information; that automatically discover and optimise audio features that improve segmentation and classification; and that learn to derive comprehensible descriptions of audio contents from such audio features (via machine learning).

The research is motivated by a large class of challenging applications in the media world that require efficient and robust audio segmentation and classification. Application scenarios include audio streaming services and Web stream analysis, automatic media monitoring, content- and descriptor-based search in large multimedia (audio) databases, and artistic applications.

Publications

  • Jan Schlüter and Thomas Grill: Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 2015. (PDF, BibTeX, code)
  • Thomas Grill and Jan Schlüter: Music Boundary Detection Using Neural Networks on Combined Features and Two-Level Annotations. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 2015. (PDF, BibTeX, www)
  • Bernhard Lehner and Gerhard Widmer: Monaural Blind Source Separation in the Context of Vocal Detection. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 2015. (PDF)
  • C. Dittmar, B. Lehner, T. Prätzlich, M. Müller, and G. Widmer: Cross-version Singing Voice Detection in Classical Opera Recordings. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 2015. (PDF)
  • H. Eghbal-zadeh, B. Lehner, M. Schedl, and G. Widmer: I-Vectors for Timbre-based Music Similarity and Music Artist Classification. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 2015. (PDF)
  • Bernhard Lehner, Gerhard Widmer, and Reinhard Sonnleitner: Improving Voice Activity Detection in Movies. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), Dresden, Germany, 2015. (PDF)
  • Thomas Grill and Jan Schlüter: Music Boundary Detection Using Neural Networks on Spectrograms and Self-Similarity Lag Matrices. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO 2015), Nice, France, 2015. (PDF, BibTeX, www)
  • Bernhard Lehner, Gerhard Widmer, and Sebastian Böck: A Low-latency, Real-time-capable Singing Voice Detection Method with LSTM Recurrent Neural Networks. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO 2015), Nice, France, 2015. (PDF)
  • Karen Ullrich, Jan Schlüter, and Thomas Grill: Boundary Detection in Music Structure Analysis using Convolutional Neural Networks. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), Taipei, Taiwan, 2014. (PDF, BibTeX, www)
  • Bernhard Lehner, Gerhard Widmer, and Reinhard Sonnleitner: On the Reduction of False Positives in Singing Voice Detection. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), Florence, Italy, 2014. (PDF)
  • Jan Schlüter and Sebastian Böck: Improved Musical Onset Detection with Convolutional Neural Networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), Florence, Italy, 2014. (PDF, BibTeX, www)
  • Bernhard Lehner, Reinhard Sonnleitner, and Gerhard Widmer: Towards Light-weight, Real-time-capable Singing Voice Detection. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), Curitiba, Brazil, 2013. (PDF)
  • Jan Schlüter: Learning Binary Codes for Efficient Large-Scale Music Similarity Search. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), Curitiba, Brazil, 2013. (PDF, BibTeX)

Software / Demos

learned speech features
learned music features
We learned spectro-temporal features for independently detecting the presence of speech and music in radio broadcasts.
screenshot of annotation tool
We developed a prototypical web-based tool to visualise and compare different annotations or segmentations by humans or our algorithms.
RadioAnalyzer tool
We developed and licensed a software component for segmenting radio broadcasts to the Danish company RadioAnalyzer.

Additional sponsoring

We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU used for this research.