What's this?
  Raw Data
  Loudness Sensation
  MFS
  Median
  PCA
What's this?
What's this?
Sitemap
 Last Updated: 20.01.2002

Raw Data

There are different possibilities to code digitized music. This thesis focuses on Pulse Coded Modulation (PCM) audio data to which any digital music representation can be transformed. For example, MP3 files can easily be decoded to PCM using almost any of the currently available audio players.

The PCM data is the discrete representation of a continuous acoustical wave. The amplitude is usually represented by 16 bits, which allows the description of more than 65,000 different amplitude levels. The time axis is usually sampled 44,100 times per second, which is one amplitude value approximately every 23 microseconds. The unit of the sampling frequency is Hertz (Hz), measured in cycles per second. The amplitude values are dimensionless although they correspond to sound pressure levels. The actual level depends on the sound system used to create physical acoustic waves.

Figure 1 illustrates 44kHz PCM data at different magnitude levels. The song Freak on a Leash by Korn is rather aggressive, loud, and perhaps even a little noisy. At least one electric guitar, a strange sounding male voice, and drums create a freaky sound experience. In contrast Für Elise by Beethoven is a rather peaceful, classical piano solo. These two pieces of music will be used throughout the thesis to illustrate the different steps of the feature extraction process. The envelope of the acoustical wave in Für Elise over the interval of one-minute has several modulations and seldom reaches the highest levels, while Freak on a Leash is constantly around the maximum amplitude level. There is also a significant difference in the interval of only 20 milliseconds (ms), where Freak on a Leash has a more jagged structure than Für Elise.
Figure 1: The 44kHz PCM data of two very different pieces of music at different time scales. The titles of the subplots indicate their time intervals. The second minute was chosen since the first minute includes fade-in effects. Starting with an interval of 60 seconds each subsequent subplot magnifies the first fifth of the previous interval. This is indicated by the dotted lines. For each of the two pieces of music, the amplitude values are relative to the highest amplitude within the one-minute intervals.

Besides finding a representation that corresponds with our perception, it is also necessary to reduce the amount of data. A 5-second stereo sound sampled at 44kHz with 16 bit values equals almost 7 megabytes of data. For this reason the music is down-sampled to 11kHz, the two stereo channels are added up to one (mono), and only a fraction of every song is used for further processing. Specifically only every third 6-second sequence is further processed, starting after the first 12 seconds (fade-in) and ending before the last 12 seconds (fade-out). Additionally, zeros at the beginning or end of the music are truncated. This leads to a data reduction by a factor of over 16.

Changing stereo to mono has no significant impact on the perception of the music genre. Using only a small fraction of an entire piece of music should be sufficient since humans are able to recognize the genre within seconds. Often it is possible to recognize the genre by listening to only one 6-second sequence of a piece of music. A more accurate classification is possible if a few sequences throughout the piece of music are listened to. However, the first and the last seconds of a piece of music usually contain fade-in and fade-out effects, which do not help in determining the genre. Neither should the down-sampling affect the ability to recognize the genre. In simple experiments using average computer speakers it was hardly possible to recognize the difference between 44kHz and 11kHz for most songs, while the genres remain clearly recognizable. Figure 2 depicts the effect of down-sampling. Notice that some of the fine details are lost, however, the signals still look alike. It is important to mention that down-sampling to 11kHz means that only frequencies up to 5.5kHz are noticeable. This is definitely far below the 16kHz an average human can hear, however, 5.5kHz are sufficient to cover the spectrum we use in speech, and almost all frequencies used in music. Furthermore, very high frequencies are usually not perceived as pleasant and thus do not play a very significant role in music.
Figure 2: The effects of down-sampling on a 20ms sequence of Freak on a Leash. This sequence is the same as the one in the last subplot of Figure 1.