(English|Chinese)

Whispering: A dynamic multi-language, multi-task Whisper model training framework

Whispering supports pre-training and fine-tuning of all Whisper models open-sourced by OpenAI on hugging face, using the UIO method for data loading, the IO bottleneck in large-scale data training has been greatly improved. This framework has been verified on datasets of tens of thousands of hours, with stable and efficient training.

Core Features ｜ Environment Setup ｜ Model Download ｜ Data Preparation ｜ Quick Start ｜ Contact Us ｜ Acknowledgments

Core Features

Supports multiple tasks such as speech recognition, speech translation, VAD, etc., in multiple languages simultaneously
Supports raw/shard two types of training data formats
Supports static/dynamic two types of training batch types
Supports spec_aug, shuffle enhancement and other data enhancement methods
Supports cer, wer, bleu and other indicators to select the optimal model

Environment Setup

Mandatory requirements: torch>=1.13.0 transformers>=4.28.0

conda create -n whispering python==3.10
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
conda activate whispering

pip install -r requirements.txt

Model Download

Please download the pre-trained model from openai/whisper

mkdir pretrain_model/ && cd pretrain_model/

git clone https://huggingface.co/openai/whisper-base

Note: The config.json provided by the official model has both bos_token_id and eos_token_id values set to 50257, which might be a bug.

Therefore, when doing padding, the decoder_start_token_id pointing to 50258 is used to remove the first token in labels, instead of the bos_token_id in the official tutorial.

Data Preparation

First, prepare the text and wav.scp files, then use the provided script to automatically convert to raw or shard training data format

Create train dev test folders

cd examples/aishell/s0

bash run.sh --stage -1 --stop_stage -1

Manually generate text and wav.scp files and place them under the train dev test folders

Single language single task data text and wav.scp example

==> text <==
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
BAC009S0002W0123 也成为地方政府的眼中钉
BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后

==> wav.scp <==
BAC009S0002W0122 /data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /data_aishell/wav/train/S0002/BAC009S0002W0124.wav

Multi-language multi-task data text and wav.scp example Explanation of parameters in the text:

All parameters are not necessary, the minimum input format is key {}, i.e., training without annotation, equivalent to setting sentence to <|nospeech|>

Sentences are not necessary (for training with timestamps), multiple timestamps can be added to the sentences list

==> text <==
BAC009S0002W0122 {"key": "BAC009S0002W0122", "language": "chinese", "task": "transcribe", "sentence": "而对楼市成交抑制作用最大的限购", "sentences": [{"start": 0, "end": 6.0, "text": "而对楼市成交抑制作用最大的限购"}]}
BAC009S0002W0123 {"key": "BAC009S0002W0123", "language": "chinese", "task": "transcribe", "sentence": "也成为地方政府的眼中钉", "sentences": [{"start": 0, "end": 3.87, "text": "也成为地方政府的眼中钉"}]}
BAC009S0002W0124 {"key": "BAC009S0002W0124", "language": "chinese", "task": "transcribe", "sentence": "自六月底呼和浩特市率先宣布取消限购后", "sentences": [{"start": 0, "end": 5.41, "text": "自六月底呼和浩特市率先宣布取消限购后"}]}

==> wav.scp <==
BAC009S0002W0122 /data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /data_aishell/wav/train/S0002/BAC009S0002W0124.wav

Generate training data format data.list

# Make sure examples/aishell/s0/data has the following files
data/
├── dev
│   ├── text
│   └── wav.scp
├── test
│   ├── text
│   └── wav.scp
└── train
    ├── text
    └── wav.scp

# Generate raw/shard format training data, shard is recommended for large data volume
bash run.sh --stage 0 --stop_stage 0 --data_type shard

Quick Start

Training phase

bash run.sh --stage 1 --stop_stage 1

Log monitoring

# View training log
tail -f finetuned_model/whispering/train_log/log_2024-03-28_11-40-25.txt

# View tensorboard
tensorboard --host 0.0.0.0 --port 6006 --logdir finetuned_model/whispering/tensorboard/

Testing phase

bash run.sh --stage 2 --stop_stage 2

# View test results
tail finetuned_model/whispering/test_cer.txt

Contact Us

If you encounter problems in use, you can directly raise Issues on the github page. We welcome enthusiasts of voice to communicate and discuss.

Acknowledgments

The dataloader and trainer largely refer to the implementation of wenet
The tokenizer part refers to the implementation of Whisper-Finetune

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Whispering: A dynamic multi-language, multi-task Whisper model training framework

Core Features ｜ Environment Setup ｜ Model Download ｜ Data Preparation ｜ Quick Start ｜ Contact Us ｜ Acknowledgments

Core Features

Environment Setup

Model Download

Data Preparation

Quick Start

Contact Us

Acknowledgments

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Whispering: A dynamic multi-language, multi-task Whisper model training framework

Core Features ｜ Environment Setup ｜ Model Download ｜ Data Preparation ｜ Quick Start ｜ Contact Us ｜ Acknowledgments

Core Features

Environment Setup

Model Download

Data Preparation

Quick Start

Contact Us

Acknowledgments