A framework and “hub” for music cover identification, enabling researchers to compare various CSI methods and run experiments through a user-friendly Gradio interface.
- Setup Instructions
- Project Overview
- Technology Stack
- Models Implemented
- Available Datasets
- Gradio App Usage
- Experiments and Tests
- Future Challenges
- Performance Metrics
- Bibliography Review
- Unit Tests
-
Clone or Download this repository:
git clone https://github.com/cncPomper/CoverDetectionHub.git --recurse-submodules cd CoverDetectionHub
Note: --recurse-submodules is very important, as our hub compares various models that are stored in submodules.
-
Create and Activate a Virtual Environment (optional but recommended):
# Create a virtual environment (Linux/macOS) python -m venv venv source venv/bin/activate # On Windows python -m venv venv venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Download or Place Model Checkpoints:
- Certain models (ByteCover, CoverHunter, Lyricover) require large checkpoint files that are not included in this repo. Here you will find checkpoints to download.
- Update or create
configs/paths.yaml
to point to where you store these checkpoints.
- Prepare Datasets:
- Datasets are available to download here.
- Update or create
configs/paths.yaml
to point to provide paths for datasets.
The config has been tested on Linux and Windows machines with CUDA. Please note you may need to install software from Technology stack section "Needed to run". The checkpoints are stored here.
This project is part of a Music Information Retrieval (MIR) course. We developed a hub for cover detection, providing:
- A unified interface to compare different cover detection models.
- Experiments for evaluating effectiveness on known cover-song datasets.
- A fast, user-friendly GUI using Gradio.
Initially, we planned to use the Da-TACOS dataset. However, we pivoted to using SHS100k for training models due to practical constraints. For experiments, we focus on Covers80 and synthetic datasets derived from it.
Main technologies in use:
- Python: Our proposed technology stack is based on Python, considering its great capabilities for working with data in an easy way.
- PyTorch: Deep learning library.
- Gradio: User interface will be implemented in Gradio library, because it is a very convenient tool for a fast prototype building.
- Numpy: Library for maths operations.
- Librosa: Used for audio loading and some feature extraction (MFCC, spectral centroid) in the simpler comparison methods.
- venv (or other tool): For making the project portable in an easy way
- OpenAI Whisper: Used by Lyricover to transcribe lyrics and measure similarity in lyric space.
Other programs that are needed to run:
We currently have 3 main cover-detection models:
- ByteCover:
- A neural-based approach, leveraging specialized embeddings for audio.
- Checkpoints are loaded from bytecover_checkpoint_path.
- CoverHunter:
- Another deep learning–based model, configured via a YAML file and checkpoint directory.
- Paths in paths.yaml guide where to load the model weights.
- Inspired by the paper "The Words Remain the Same: Cover Detection with Lyrics Transcription".
- It is our own implementation joining text extraction using OpenAI Whisper with n-gram representation and spectral features.
- The result is obtained via a simple neural network that joins the predictions.
Each of these models outputs a similarity score for a given pair of audio files. A threshold then decides if two songs are considered “covers.”
- Re-move:
- From the paper "Less is more: Faster and better music version identification with embedding distillation" (https://arxiv.org/pdf/2010.03284).
- It was training with a feature that compute "convolutional and recurrent estimators for music analysis" (CREMA)
- The model is lightweight and training fast
Apart from the main deep-learning models, we also included two simpler methods for demonstration and baseline comparison:
- MFCC (Mel-Frequency Cepstral Coefficients)
- Spectral Centroid
These can be used to compare two audio files based on feature similarity (e.g., cosine similarity).
The hub includes references to the following datasets:
-
Covers80 A standard collection used in many cover-song identification studies.
-
Covers80but10 A smaller variant with only 10 songs (for quick testing).
-
Injected Abracadabra A synthetic dataset where a portion of “Abracadabra” by Steve Miller Band is injected into other audio samples, as described in Batlle-Roca et al..
Note: The actual training of ByteCover, CoverHunter, Remove and Lyricover was performed on SHS100k, using a university server with GPU machines. The Covers80-related datasets are primarily for testing and demonstration.
As described in this document, we decided mainly to use SHS100k dataset for training. We based on metadata file from bytecover repository.
The dataset is organised into 9998 cliques (groups of different performances of a single sample; we consider all performances from one particular clique to be each other's cover). Each clique contains several samples. Additionally, there is provided the title of a song, its YouTube video ID and SecondHandSongs ID. We managed to obtain approximately 78k samples, whose size was ~300GB (Some were deleted, approx ~30GB).
For further processing, the main identification of each individual sample is its YouTube video ID.
After installing dependencies and ensuring your checkpoints/datasets exist:
Launch the Gradio interface:
python gradio_app.py
A browser tab should open with two tabs:
- Cover Song Identification
- Upload two audio files (e.g., .mp3, formats from python-soundfile are supported), select a model (ByteCover, CoverHunter, Lyricover, Remove, MFCC, or Spectral Centroid), and set a threshold.
- The interface will compute a similarity score and return whether it considers them covers.
- Model Testing
- Choose a CSI model (ByteCover, CoverHunter, Remove, MFCC, Spectral Centroid).
- Pick a dataset (Covers80, Covers80but10, or Injected Abracadabra).
- Select a threshold. The system then computes evaluation metrics (mAP, Precision@10, MR1, etc.) on that dataset, printing a summary.
- Synthetic Injection (“Injected Abracadabra”)
- Based on the partial injection method from Batlle-Roca et al..
- We inject a segment of an original piece (Abracadabra) into other tracks to create partial covers, then measure how well each model detects the covers.
- Covers80 evaluation - the samples were compared on a "pairwise" basis, i.e., pairs of tracks were selected either from the same groups (in which case they were considered covers) or from different groups (in which case they were classified as non-covers).
We maintain a tests/
directory with pytest
test files. Below is a summary of the test coverage:
test_paths_exist
: Verifies that all paths inconfigs/paths.yaml
exist and match their expected type (directory or file).
test_gather_covers80_dataset_files
: Ensuresgather_covers80_dataset_files
returns a non-empty list of(audio_path, label)
tuples.test_covers80_random_subfolders_have_two_mp3s
: Randomly checks subfolders incovers80_data_dir
for at least two.mp3
files.test_gather_covers80but10_dataset_files
: Similar to the above but for a dataset variant.test_covers80but10_random_subfolders_have_two_mp3s
: Verifies subfolders incovers80but10_data_dir
contain at least two.mp3
files.
test_compute_metrics_for_ranking_no_relevant
: Tests metrics when no relevant items are in the ranking.test_compute_metrics_for_ranking_all_relevant
: Tests metrics when all items in the ranking are relevant.test_compute_metrics_for_ranking_mixed
: Validates metrics for mixed relevant and irrelevant items.test_compute_mean_metrics_for_rankings_empty
: Ensures mean metrics for an empty ranking set return 0.test_compute_mean_metrics_for_rankings_single
: Verifies mean metrics match individual metrics for a single ranking.test_compute_mean_metrics_for_rankings_multiple
: Validates mean metrics across multiple rankings against manual calculations.
Paper | Notes |
---|---|
CoverHunter: Cover Song Identification with Refined Attention and Alignments | Paper proposes new method for CSI task called CoverHunter. It explores features deeper with refined attention and alignments. It has 3 crucial modules: 1) Convolution-augumented transformer; 2) Attention-based time pooling module; 3) A novel training scheme. Authors share their excellent results, beating all existing methods at this time. They test in on benchmarks like DaTacos and SHS100K. In general, they propose a state-of-the-art model, which is definitely one of the current best, which is why it is worth including it in our project. PyTorch implementation of this method can be found in repository CoverHunter, along with checkpoints of pretrained models. |
BYTECOVER: COVER SONG IDENTIFICATION VIA MULTI-LOSS TRAINING | This paper from 2021 introduces new feature learning method for song cover identification task. It is built on a classical ResNet model with improvements designed for CSI. In a set of experiments, authors demonstrate its effectiveness and efficiency. They evaluate the method on multiple datasets, including DaTacos. In the repository bytecover, there is a shared implementation of this method with the best-trained model checkpoints. Thoughts: There is no transformer in a method, which may imply worse results than CoverHunter. |
ESSENTIA: AN AUDIO ANALYSIS LIBRARY FOR MUSIC INFORMATION RETRIEVAL | This paper describes a framework for multiple MIR applications. This tool consists of a number of reconfigurable modules that come in handy for researchers. For our case, an interesting approach is to use the harmonic pitch class profile and the chroma features of audio signals to calculate the similarity of two tracks. This model is very basic and well-known; therefore, it will serve as a reference. The used metric in this model is obtained from a binary cross similarity matrix, which could finally be transferred into a numeric value using a smith-waterman sequence alignment algorithm. We dedcided to reproduce the experiments for embeddings using MFCC and spectral centroid, but using librosa library. |
THE WORDS REMAIN THE SAME: COVER DETECTION WITH LYRICS TRANSCRIPTION | Some authors have other applications. In this paper, they proposed another approach, called the Lyrics-recognition-based system and a classic tonal-based system. The authors used datasets like Da-Tacos and DALI to detect the cover. Moreover, they used a few fusion models: 1) Lyrics recognition state-of-the-art framework obtained in MIREX 2020 and it uses a model Time Delay Neural Network (TDNN) trained using the English tracks of the DALI dataset. In the background music preprocessing step Singing Voice Separation (SVS). Moreover, the complete framework they used Mel-Frequency Cepstral Coefficients (MFCC) 2) To calculate the similarity between pairs of transcripts - String matching 3) Finally, they used Tonal-based cover detection which is called Re-MOVE and its training of dataset part of Da-Tacos. Another one, that is interesting is more classic HPCP features for cover detection. There has also been proposed joint approach implemented by us, LyriCover. It joins the text retrieval with classic methods. |
Dataset | Details |
---|---|
Da-TACOS – Dataset for Cover Song Identification and Understanding |
Two subsets: 1. Benchmark Subset (15,000 songs) 2. Cover Analysis Subset (10,000 songs) Thoughts: This dataset has become a classic benchamark for testing CSI systems. Moreover, authors of the paper, along with the dataset, also provided a framework for feature extraction and benchmarking - acoss: Audio Cover Song Suite. 'acoss' includes a standard feature extraction framework with audio features for CSI task and open source implementations of seven CSI algorithms. It was designed to facilitate the future work in this line of research. Although dataset in relatively new (2019), both repositories have not been updated since 5 years ago and considering how rapidly MIR domain develops - 5 years is a lot. That is why our project can be an attempt to create a refreshed and modern version of this framework. It would include state-of-the-art methods with hopefully additional datasets to test them. |
Covers80 |
Thoughts: We will not use the Covers80 dataset as primary dataset because it is relatively small and is old (2007). Additionally, the audio files are of low quality (32 kbps, 16 kHz mono).The dataset was assembled somewhat randomly, and it may not provide the diversity or representativeness. However, it has become a CSI systems benchmark, that is why, if we have enough time, we will try to include it in out project. Dataset appeared in a paper THE 2007 LABROSA COVER SONG DETECTION SYSTEM. |
SHS100K | Thoughts: This dataset served us as primary for training purposes |
ZAIKS dataset | It's a friendly organization in Poland. The organization will provide a music dataset for testing purposes - these will probably be Polish songs and their famous cover versions. |
The selection of metrics is based on Mirex Cover Song Identification Contest.
Dataset / Model | Mean Average Precision (mAP) | Precision at 10 (P@10) / mP@10 | Mean Rank of First Correct Cover (MR1 / mMR1) |
---|---|---|---|
Model: re-move | 0.86600 | 0.87500 | 1.00000 |
Model: CoverHunter | 0.78400 | 0.90000 | 1.00000 |
Model: Lyricover | 0.82029 | 0.90000 | 1.00000 |
Model: ByteCover | 0.51435 | 0.70000 | 1.00000 |
MFCC | 0.24015 | 0.30000 | 3.00000 |
Spectral Centroid | 0.24159 | 0.30000 | 3.00000 |
Dataset / Model | Mean Average Precision (mAP) | Precision at 10 (P@10) / mP@10 | Mean Rank of First Correct Cover (MR1 / mMR1) |
---|---|---|---|
Model: re-move | 0.83020 | 0.09695 | 8.04268 |
Model: Lyricover | 0.83425 | 0.09939 | 7.41463 |
Model: CoverHunter | 0.74161 | 0.09268 | 9.52439 |
Model: ByteCover | 0.52172 | 0.07927 | 14.78049 |
MFCC | 0.24015 | 0.30000 | 3.00000 |
Spectral Centroid | 0.04352 | 0.00793 | 76.80488 |
- Gathering literature
- preparing design proposal
- tools selection
- selection of datasets
- Preparing the environment
- choice for the models
- initial dataset preprocessing
- Implementation of the first functional prototype
- including training at least one model
- minimal GUI
- First results evaluation
- implementing improvements
- training
- adding subsequent models
- scraping SHS100K dataset
- Automated tests design
- training of remaining models
- Evaluation of the results
- improving GUI
- re-training the models if necessary
- Working on the final presentation
- tests
- gathering final results, Bożenarodzeniowy chill (optional)
- Final results evaluation
- preparation of the paper (?)
🎉 Public project presentation 🎉
- obtaining Da-Tacos dataset and possibly perform training and evaluation on it
- improving the LyriCover model for more sophisticated audio features exctraction; training on a larger subset of SHS100k after improving performance of the model
- possible co-operation with ZAIKS to form a new dataset and deploy the solution
- performing more experiments, similar to "Injected Abracadabra" or others found in the literature
- augmentations are a very interesting field of experiments
- currently, model Re-Move uses
essentia
package, which unavailable on Windows. This makes the whole app runnable only on unix operating systems. It would be advisable to make own implementation of the necessary methods from this package so that it app is runnable on all systems.
The logo has been designed using DALL-E model.