OFAI-TR-2014-06 ( 150kB PDF file)

On inter-rater agreement in audio music similarity

Arthur Flexer

One of the central tasks in the annual MIREX evaluation campaign is the "Audio Music Similarity and Retrieval (AMS)" task. Songs which are ranked as being highly similar by algorithms are evaluated by human graders as to how similar they are according to their subjective judgment. By analyzing results from the AMS tasks of the years 2006 to 2013 we demonstrate that: (i) due to low inter-rater agreement there exists an upper bound of performance in terms of subjective gradings; (ii) this upper bound has already been achieved by participating algorithms in 2009 and not been surpassed since then. Based on this sobering result we discuss ways to improve future evaluations of audio music similarity.

Keywords: music information retrieval, audio similarity, evaluation, rater agreement

Citation: Flexer A.: On inter-rater agreement in audio music similarity, Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 2014.