Music Genre Detection With Deep Learning

How a TensorFlow model can classify audio files with just a few lines of code.

Published in

Towards Data Science

7 min readJul 20, 2021

Whatever your level of musical knowledge, it is very hard to describe what a music genre is. Why does jazz sound like jazz? How can you tell country from disco? As there is no systematic definition for genres, it is impossible to classify them programmatically with mere “if /else” statements. This is when Deep Learning comes into play. We will see that each genre has its own specific type of acoustic signature that we can learn from. Let’s dig in!

Introduction

The objective of this project is to classify audio files (in “.wav” format) into 10 musical genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock. To do so, we will use TensorFlow2/Keras as our Deep Learning framework and Librosa as our audio pre-processing library. Librosa is a Python package for music and audio analysis which provides the building blocks necessary to create Music Information Retrieval (M. I. R.) systems.

To train our model, we will use the GTZAN dataset which consists of a collection of 10 genres with 100 audio files each, all having a length of 30 seconds. The dataset is available here.

The preprocessing, training and inference steps are all performed from a Jupyter Notebook with a Conda-Python3 environment with modular blocks and functions for better clarity and portability. Please find the link to my GitHub for this project in the conclusion section of this post.

Sounds and audio signal processing

Once sampled, a soundwave resides on your hard drive in the form of a series of float values. Its size depends on the sampling rate (usually 44100Hz or 22050Hz) and duration of the recording. These time-series can easily be loaded into Numpy arrays to play around with. Let’s visualize one audio file from the GTZAN dataset using Librosa:

song, sr = librosa.load('GTZAN_Dataset/genres_original/rock/rock.00000.wav')librosa.display.waveshow(song)

Representation of sound amplitude variations over time (30 seconds) for a rock song

It is common practice to use the Short Time Fourier Transform (STFT) to get a better understanding of the qualitative behavior of audio signals. It is a mathematical tool that allows us to analyse the variations of the frequency spectrum over time. Here is how.

Firstly, we define a frame_size (usually a batch of 2048 samples). Over each frame, a frequency vector (of n frequency bins) is calculated by applying the Discrete Fourier Transform (DFT). The frequency vector is the instant representation of our audio signal in the frequency domain. You can see it as a description of the sound in terms of energy distribution across all frequency bins at a given time (e.g a given frame).

Secondly, as music plays, frames change (the next frame is one hop_length away from the previous one) and the energy distribution changes with them. Therefore, we get the visual representation of the STFT (or spectrogram) by applying DFT to all frames successively and represent them all on a heatmap as follows:

def plot_spectrogram(Y, sr, hop_length, y_axis="linear"):
    plt.figure(figsize=(10, 5))
    librosa.display.specshow(Y, sr=sr, hop_length=1024, x_axis="time", y_axis=y_axis)
    plt.colorbar(format="%+2.f")Y_log = librosa.power_to_db(np.abs(librosa.stft(song, hop_length=1024))**2)
plot_spectrogram(Y_log, sr, 1024)

Log-amplitude variations over time (30 seconds) for all frequency bins

The above graph is our rock song. This heatmap shows how each frequency bin evolves over time: the more red, the more energy, the louder. Note how higher frequencies (around 10kHz bins) remain blue from 0 to 30 seconds, as signal amplitude is lesser for these bins. Also, evenly spaced vertical red lines appear throughout the file, it is obviously all musical instruments following the beat.

Ever wondered what Rock’n’Roll really is?
Well now at least you know what it looks like!

But our spectrogram still looks somewhat fuzzy and we are still a few more tweaking steps away from making it fit for ingestion by a Deep Learning model. As you may know, we humans don’t hear sounds linearly: the perceived pitch difference between A1 (55Hz) and A2(110Hz) is the same as A4 (440Hz) and A5 (880Hz). These intervals are both octaves but there is a difference of 55Hz for the first octave, whereas there is a difference of 440Hz for the other octave. The Mel-Scale defines frequency bands that are evenly distributed with respect to perceived frequencies. Mel filter banks are calculated so that they are more discriminative for lower frequencies and less so for higher ones, just like human ears. The harmonic structure is smoothened by performing weighted discriminations as follows:

After Mel-scaling, the final step consists in applying the Discrete Cosine Transform that generates real-valued coefficients only. As a best practice, we mostly retain the first 13 coefficients (n_mfcc=13) called Mel-Frequency-Cepstral-Coefficients (MFCC). They describe the most simple aspects of the spectral shape, while higher order coefficients are less important to train on as they tend to describe more noise-like information.

MFCC are useful descriptors of larger structures of the spectrum that are easily readable and “interpretable” by a Deep Learning model. Let’s extract the MFCC of our rock song:

mfcc_song = librosa.feature.mfcc(song, n_mfcc=13, sr=sr)plt.figure(figsize=(15, 8))
librosa.display.specshow(mfcc_song, x_axis="time", sr=sr)
plt.colorbar(format="%+2.f")
plt.show()

Now that looks quite neat! So, what do we have exactly here? The shape of mfcc_song returns (13, 1293), e.g 13 lines corresponding to each coefficient and 1293 frames (of 2048 samples each) for a duration of 30 seconds. This is the acoustic signature of one rock song. The GTZAN dataset has 99 more songs to train on in order to learn the rock genre. Let’s see how to do that.

Preparing the training, validation and test datasets

After browsing through all songs, I noticed their length may slightly vary around 30 seconds. To make sure we process files that are of equal duration, we will crop them all at 29 seconds. We also set the sampling rate sr = 22050Hz and calculate the number of samples per slice.

# Sampling rate.
sr = 22050# Let’s make sure all files have the same amount of samples, pick a duration right under 30 seconds.
TOTAL_SAMPLES = 29 * sr# The dataset contains 999 files (1000–1 defective). Lets make it bigger.
# X amount of slices => X times more training examples.
NUM_SLICES = 10
SAMPLES_PER_SLICE = int(TOTAL_SAMPLES / NUM_SLICES)

Also, we need to generate more training examples since 1000 songs is not all that much. Slicing each song in 10 sub-parts seems like a reasonable size in order to retain enough information from their acoustic signature.

Our dataset consists of a collection of 10 folders with their genre name, each containing 100 songs. Therefore, the labeling process is straightforward: folder name → label. We can now formulate this classification as a supervised problem, as each song has a genre label assigned to it.

All MFCCs and associated labels will be stored in a separate json file and are processed in a for loop as follows:

# Let's browse each file, slice it and generate the 13 band mfcc for each slice.
for i, (dirpath, dirnames, filenames) in enumerate(os.walk(source_path)):

    for file in filenames:
      song, sr = librosa.load(os.path.join(dirpath, file), duration=29)

      for s in range(NUM_SLICES):
          start_sample = SAMPLES_PER_SLICE * s
          end_sample = start_sample + SAMPLES_PER_SLICE
          mfcc = librosa.feature.mfcc(y=song[start_sample:end_sample], sr=sr, n_mfcc=13)
          mfcc = mfcc.T
          mydict["labels"].append(i-1)
          mydict["mfcc"].append(mfcc.tolist())

# Let's write the dictionary in a json file.    
with open(json_path, 'w') as f:
    json.dump(mydict, f)
    f.close()

Finally, validation and test datasets are generated via Scikit-Learn train_test_split() with a 20% ratio.

Model architecture, training and predicting

We will use a TensorFlow2 framework to design our Convolutional Neural Network (CNN) with 3 layers of convolutions and a final fully connected layer with softmax activation with 10 outputs (for 10 genres). The overall architecture is as follows:

Training over 30 epochs with batch_size=32 takes a few minutes on an Intel dual-core i5 CPU and converges to 77% accuracy on the validation dataset. Let’s plot the metrics curves with Matplotlib:

Conclusion

I tested other model architectures on the same data with the following results:

Multi Layer Perceptron → 76% train acc and 59% val acc
Recurrent Neural Network (LSTM type) → 90% train acc and 76% val acc

The CNN architecture presented in this document ended up getting the best performance on data it has never seen before (77%). In order to reduce variance, various techniques can be applied to fine-tune the model such as regularization, data augmentation, more dropout layers, etc. Let me know if you can come up with a better score and what approach you implemented to achieve it.

Feel free to check out the model benchmark on my GitHub repository. The CNN model was also submitted on my Kaggle. Thank you for reading!