CoverDetectionHub

Authors: Kamil Szczepanik, Dawid Ruciński, Piotr Kitłowski

Project description

A framework and “hub” for music cover identification, enabling researchers to compare various CSI methods and run experiments through a user-friendly Gradio interface.

Setup Instructions

Clone or Download this repository:
```
git clone https://github.com/cncPomper/CoverDetectionHub.git --recurse-submodules
cd CoverDetectionHub
```
Note: --recurse-submodules is very important, as our hub compares various models that are stored in submodules.

Create and Activate a Virtual Environment (optional but recommended):

# Create a virtual environment (Linux/macOS)
python -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
venv\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```
Download or Place Model Checkpoints:

Certain models (ByteCover, CoverHunter, Lyricover) require large checkpoint files that are not included in this repo. Here you will find checkpoints to download.
Update or create configs/paths.yaml to point to where you store these checkpoints.

Prepare Datasets:

Datasets are available to download here.
Update or create configs/paths.yaml to point to provide paths for datasets.

The config has been tested on Linux and Windows machines with CUDA. Please note you may need to install software from Technology stack section "Needed to run". The checkpoints are stored here.

Project Overview

This project is part of a Music Information Retrieval (MIR) course. We developed a hub for cover detection, providing:

A unified interface to compare different cover detection models.
Experiments for evaluating effectiveness on known cover-song datasets.
A fast, user-friendly GUI using Gradio.

Original Scope

Initially, we planned to use the Da-TACOS dataset. However, we pivoted to using SHS100k for training models due to practical constraints. For experiments, we focus on Covers80 and synthetic datasets derived from it.

Technology stack

Main technologies in use:

Python: Our proposed technology stack is based on Python, considering its great capabilities for working with data in an easy way.
PyTorch: Deep learning library.
Gradio: User interface will be implemented in Gradio library, because it is a very convenient tool for a fast prototype building.
Numpy: Library for maths operations.
Librosa: Used for audio loading and some feature extraction (MFCC, spectral centroid) in the simpler comparison methods.
venv (or other tool): For making the project portable in an easy way
OpenAI Whisper: Used by Lyricover to transcribe lyrics and measure similarity in lyric space.

Other programs that are needed to run:

ffmpeg: Used for samples validation and simple preprocessing
SoX: Used for audio processing

Models Implemented

We currently have 3 main cover-detection models:

ByteCover:

A neural-based approach, leveraging specialized embeddings for audio.
Checkpoints are loaded from bytecover_checkpoint_path.

CoverHunter:

Another deep learning–based model, configured via a YAML file and checkpoint directory.
Paths in paths.yaml guide where to load the model weights.

LyriCover:

Inspired by the paper "The Words Remain the Same: Cover Detection with Lyrics Transcription".
It is our own implementation joining text extraction using OpenAI Whisper with n-gram representation and spectral features.
The result is obtained via a simple neural network that joins the predictions.

Each of these models outputs a similarity score for a given pair of audio files. A threshold then decides if two songs are considered “covers.”

Re-move:

From the paper "Less is more: Faster and better music version identification with embedding distillation" (https://arxiv.org/pdf/2010.03284).
It was training with a feature that compute "convolutional and recurrent estimators for music analysis" (CREMA)
The model is lightweight and training fast

Feature Extraction Methods

Apart from the main deep-learning models, we also included two simpler methods for demonstration and baseline comparison:

MFCC (Mel-Frequency Cepstral Coefficients)
Spectral Centroid

These can be used to compare two audio files based on feature similarity (e.g., cosine similarity).

Available Datasets

The hub includes references to the following datasets:

Covers80 A standard collection used in many cover-song identification studies.
Covers80but10 A smaller variant with only 10 songs (for quick testing).
Injected Abracadabra A synthetic dataset where a portion of “Abracadabra” by Steve Miller Band is injected into other audio samples, as described in Batlle-Roca et al..

Note: The actual training of ByteCover, CoverHunter, Remove and Lyricover was performed on SHS100k, using a university server with GPU machines. The Covers80-related datasets are primarily for testing and demonstration.

Training dataset

As described in this document, we decided mainly to use SHS100k dataset for training. We based on metadata file from bytecover repository.

The dataset is organised into 9998 cliques (groups of different performances of a single sample; we consider all performances from one particular clique to be each other's cover). Each clique contains several samples. Additionally, there is provided the title of a song, its YouTube video ID and SecondHandSongs ID. We managed to obtain approximately 78k samples, whose size was ~300GB (Some were deleted, approx ~30GB).

For further processing, the main identification of each individual sample is its YouTube video ID.

Gradio App Usage

After installing dependencies and ensuring your checkpoints/datasets exist:

Launch the Gradio interface:

python gradio_app.py

A browser tab should open with two tabs:

Cover Song Identification

Upload two audio files (e.g., .mp3, formats from python-soundfile are supported), select a model (ByteCover, CoverHunter, Lyricover, Remove, MFCC, or Spectral Centroid), and set a threshold.
The interface will compute a similarity score and return whether it considers them covers.

Model Testing
- Choose a CSI model (ByteCover, CoverHunter, Remove, MFCC, Spectral Centroid).
- Pick a dataset (Covers80, Covers80but10, or Injected Abracadabra).
- Select a threshold. The system then computes evaluation metrics (mAP, Precision@10, MR1, etc.) on that dataset, printing a summary.

Performed experiments

Synthetic Injection (“Injected Abracadabra”)
- Based on the partial injection method from Batlle-Roca et al..

We inject a segment of an original piece (Abracadabra) into other tracks to create partial covers, then measure how well each model detects the covers.

Covers80 evaluation - the samples were compared on a "pairwise" basis, i.e., pairs of tracks were selected either from the same groups (in which case they were considered covers) or from different groups (in which case they were classified as non-covers).

Unit Tests

We maintain a tests/ directory with pytest test files. Below is a summary of the test coverage:

Path Validation

test_paths_exist: Verifies that all paths in configs/paths.yaml exist and match their expected type (directory or file).

Covers80 Dataset Integrity

test_gather_covers80_dataset_files: Ensures gather_covers80_dataset_files returns a non-empty list of (audio_path, label) tuples.
test_covers80_random_subfolders_have_two_mp3s: Randomly checks subfolders in covers80_data_dir for at least two .mp3 files.
test_gather_covers80but10_dataset_files: Similar to the above but for a dataset variant.
test_covers80but10_random_subfolders_have_two_mp3s: Verifies subfolders in covers80but10_data_dir contain at least two .mp3 files.

Ranking Metrics

test_compute_metrics_for_ranking_no_relevant: Tests metrics when no relevant items are in the ranking.
test_compute_metrics_for_ranking_all_relevant: Tests metrics when all items in the ranking are relevant.
test_compute_metrics_for_ranking_mixed: Validates metrics for mixed relevant and irrelevant items.
test_compute_mean_metrics_for_rankings_empty: Ensures mean metrics for an empty ranking set return 0.
test_compute_mean_metrics_for_rankings_single: Verifies mean metrics match individual metrics for a single ranking.
test_compute_mean_metrics_for_rankings_multiple: Validates mean metrics across multiple rankings against manual calculations.

Bibliography Review

Cover song detection methods

Paper	Notes
CoverHunter: Cover Song Identification with Refined Attention and Alignments	Paper proposes new method for CSI task called CoverHunter. It explores features deeper with refined attention and alignments. It has 3 crucial modules: 1) Convolution-augumented transformer; 2) Attention-based time pooling module; 3) A novel training scheme. Authors share their excellent results, beating all existing methods at this time. They test in on benchmarks like DaTacos and SHS100K. In general, they propose a state-of-the-art model, which is definitely one of the current best, which is why it is worth including it in our project. PyTorch implementation of this method can be found in repository CoverHunter, along with checkpoints of pretrained models.
BYTECOVER: COVER SONG IDENTIFICATION VIA MULTI-LOSS TRAINING	This paper from 2021 introduces new feature learning method for song cover identification task. It is built on a classical ResNet model with improvements designed for CSI. In a set of experiments, authors demonstrate its effectiveness and efficiency. They evaluate the method on multiple datasets, including DaTacos. In the repository bytecover, there is a shared implementation of this method with the best-trained model checkpoints. Thoughts: There is no transformer in a method, which may imply worse results than CoverHunter.
ESSENTIA: AN AUDIO ANALYSIS LIBRARY FOR MUSIC INFORMATION RETRIEVAL	This paper describes a framework for multiple MIR applications. This tool consists of a number of reconfigurable modules that come in handy for researchers. For our case, an interesting approach is to use the harmonic pitch class profile and the chroma features of audio signals to calculate the similarity of two tracks. This model is very basic and well-known; therefore, it will serve as a reference. The used metric in this model is obtained from a binary cross similarity matrix, which could finally be transferred into a numeric value using a smith-waterman sequence alignment algorithm. We dedcided to reproduce the experiments for embeddings using MFCC and spectral centroid, but using librosa library.
THE WORDS REMAIN THE SAME: COVER DETECTION WITH LYRICS TRANSCRIPTION	Some authors have other applications. In this paper, they proposed another approach, called the Lyrics-recognition-based system and a classic tonal-based system. The authors used datasets like Da-Tacos and DALI to detect the cover. Moreover, they used a few fusion models: 1) Lyrics recognition state-of-the-art framework obtained in MIREX 2020 and it uses a model Time Delay Neural Network (TDNN) trained using the English tracks of the DALI dataset. In the background music preprocessing step Singing Voice Separation (SVS). Moreover, the complete framework they used Mel-Frequency Cepstral Coefficients (MFCC) 2) To calculate the similarity between pairs of transcripts - String matching 3) Finally, they used Tonal-based cover detection which is called Re-MOVE and its training of dataset part of Da-Tacos. Another one, that is interesting is more classic HPCP features for cover detection. There has also been proposed joint approach implemented by us, LyriCover. It joins the text retrieval with classic methods.

Datasets and benchmarks

Dataset	Details
Da-TACOS – Dataset for Cover Song Identification and Understanding Da-TACOS Dataset Paper Da-TACOS GitHub Repository	Two subsets: 1. Benchmark Subset (15,000 songs) 2. Cover Analysis Subset (10,000 songs) Annotations obtained with API from SecondHandSongs.com Features extracted from MP3 audio files encoded at 44.1 kHz sample rate No audio files included, only pre-extracted features and metadata 7 state-of-the-art CSI algorithms benchmarked on the Benchmark Subset Cover Analysis Subset used to study modifiable musical characteristics Thoughts: This dataset has become a classic benchamark for testing CSI systems. Moreover, authors of the paper, along with the dataset, also provided a framework for feature extraction and benchmarking - acoss: Audio Cover Song Suite. 'acoss' includes a standard feature extraction framework with audio features for CSI task and open source implementations of seven CSI algorithms. It was designed to facilitate the future work in this line of research. Although dataset in relatively new (2019), both repositories have not been updated since 5 years ago and considering how rapidly MIR domain develops - 5 years is a lot. That is why our project can be an attempt to create a refreshed and modern version of this framework. It would include state-of-the-art methods with hopefully additional datasets to test them.
Covers80	The dataset contains 80 songs, with 2 different performances of each song by different artists (160 tracks in total). All audio files are encoded as 32 kbps MP3 (mono, 16 kHz sampling rate, bandwidth limited to 7 kHz). Thoughts: We will not use the Covers80 dataset as primary dataset because it is relatively small and is old (2007). Additionally, the audio files are of low quality (32 kbps, 16 kHz mono).The dataset was assembled somewhat randomly, and it may not provide the diversity or representativeness. However, it has become a CSI systems benchmark, that is why, if we have enough time, we will try to include it in out project. Dataset appeared in a paper THE 2007 LABROSA COVER SONG DETECTION SYSTEM.
SHS100K	Contains metadata and audio features for a large number of songs and their covers. Includes a diverse range of musical genres Metadata: song title, artist, release year Thoughts: This dataset served us as primary for training purposes
ZAIKS dataset	It's a friendly organization in Poland. The organization will provide a music dataset for testing purposes - these will probably be Polish songs and their famous cover versions.

Performance Metrics

The selection of metrics is based on Mirex Cover Song Identification Contest.

Dataset: Injected Abracadabra

Dataset / Model	Mean Average Precision (mAP)	Precision at 10 (P@10) / mP@10	Mean Rank of First Correct Cover (MR1 / mMR1)
Model: re-move	0.86600	0.87500	1.00000
Model: CoverHunter	0.78400	0.90000	1.00000
Model: Lyricover	0.82029	0.90000	1.00000
Model: ByteCover	0.51435	0.70000	1.00000
MFCC	0.24015	0.30000	3.00000
Spectral Centroid	0.24159	0.30000	3.00000

Dataset: Covers80

Dataset / Model	Mean Average Precision (mAP)	Precision at 10 (P@10) / mP@10	Mean Rank of First Correct Cover (MR1 / mMR1)
Model: re-move	0.83020	0.09695	8.04268
Model: Lyricover	0.83425	0.09939	7.41463
Model: CoverHunter	0.74161	0.09268	9.52439
Model: ByteCover	0.52172	0.07927	14.78049
MFCC	0.24015	0.30000	3.00000
Spectral Centroid	0.04352	0.00793	76.80488

Project schedule

W1 (14-20.10.2024)

Gathering literature
preparing design proposal
tools selection
selection of datasets

W2 (21-27.10.2024)

Preparing the environment
choice for the models
initial dataset preprocessing

W3-W4 (28.10-10.11.2024)

Implementation of the first functional prototype
including training at least one model
minimal GUI

W5-W6 (11-24.11.2024)

W7 (25.11-1.12.2024)

Automated tests design
training of remaining models

W8-W9 (02-15.12.2024)

Evaluation of the results
improving GUI
re-training the models if necessary

W10-W12 (16.12.2024-05.01.2025)

Working on the final presentation
tests
gathering final results, Bożenarodzeniowy chill (optional)

W13 (06-13.01.2025)

Final results evaluation
preparation of the paper (?)

W14 (13-19.01.2025)

🎉 Public project presentation 🎉

Future challenges

obtaining Da-Tacos dataset and possibly perform training and evaluation on it
improving the LyriCover model for more sophisticated audio features exctraction; training on a larger subset of SHS100k after improving performance of the model
possible co-operation with ZAIKS to form a new dataset and deploy the solution
performing more experiments, similar to "Injected Abracadabra" or others found in the literature
augmentations are a very interesting field of experiments
currently, model Re-Move uses essentia package, which unavailable on Windows. This makes the whole app runnable only on unix operating systems. It would be advisable to make own implementation of the necessary methods from this package so that it app is runnable on all systems.

The logo has been designed using DALL-E model.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
configs		configs
csi_models		csi_models
datasets		datasets
evaluation		evaluation
feature_extraction		feature_extraction
img		img
scripts		scripts
tests		tests
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
examples.jsonl		examples.jsonl
gradio_app.py		gradio_app.py
requirements.txt		requirements.txt

cncPomper/CoverDetectionHub

Folders and files

Latest commit

History

Repository files navigation