(English|Chinese)
Whispering supports pre-training and fine-tuning of all Whisper models open-sourced by OpenAI on hugging face, using the UIO method for data loading, the IO bottleneck in large-scale data training has been greatly improved. This framework has been verified on datasets of tens of thousands of hours, with stable and efficient training.
- Supports multiple tasks such as speech recognition, speech translation, VAD, etc., in multiple languages simultaneously
- Supports raw/shard two types of training data formats
- Supports static/dynamic two types of training batch types
- Supports spec_aug, shuffle enhancement and other data enhancement methods
- Supports cer, wer, bleu and other indicators to select the optimal model
- Mandatory requirements: torch>=1.13.0 transformers>=4.28.0
conda create -n whispering python==3.10
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
conda activate whispering
pip install -r requirements.txt
Please download the pre-trained model from openai/whisper
mkdir pretrain_model/ && cd pretrain_model/
git clone https://huggingface.co/openai/whisper-base
Note: The config.json
provided by the official model has both bos_token_id
and eos_token_id
values set to 50257, which might be a bug.
Therefore, when doing padding, the decoder_start_token_id
pointing to 50258 is used to remove the first token in labels, instead of the bos_token_id
in the official tutorial.
First, prepare the text and wav.scp files, then use the provided script to automatically convert to raw or shard training data format
- Create train dev test folders
cd examples/aishell/s0
bash run.sh --stage -1 --stop_stage -1
- Manually generate text and wav.scp files and place them under the train dev test folders
- Single language single task data text and wav.scp example
==> text <==
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
BAC009S0002W0123 也成为地方政府的眼中钉
BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后
==> wav.scp <==
BAC009S0002W0122 /data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /data_aishell/wav/train/S0002/BAC009S0002W0124.wav
-
Multi-language multi-task data text and wav.scp example Explanation of parameters in the text:
All parameters are not necessary, the minimum input format is key {}, i.e., training without annotation, equivalent to setting sentence to <|nospeech|>
Sentences are not necessary (for training with timestamps), multiple timestamps can be added to the sentences list
==> text <==
BAC009S0002W0122 {"key": "BAC009S0002W0122", "language": "chinese", "task": "transcribe", "sentence": "而对楼市成交抑制作用最大的限购", "sentences": [{"start": 0, "end": 6.0, "text": "而对楼市成交抑制作用最大的限购"}]}
BAC009S0002W0123 {"key": "BAC009S0002W0123", "language": "chinese", "task": "transcribe", "sentence": "也成为地方政府的眼中钉", "sentences": [{"start": 0, "end": 3.87, "text": "也成为地方政府的眼中钉"}]}
BAC009S0002W0124 {"key": "BAC009S0002W0124", "language": "chinese", "task": "transcribe", "sentence": "自六月底呼和浩特市率先宣布取消限购后", "sentences": [{"start": 0, "end": 5.41, "text": "自六月底呼和浩特市率先宣布取消限购后"}]}
==> wav.scp <==
BAC009S0002W0122 /data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /data_aishell/wav/train/S0002/BAC009S0002W0124.wav
- Generate training data format data.list
# Make sure examples/aishell/s0/data has the following files
data/
├── dev
│ ├── text
│ └── wav.scp
├── test
│ ├── text
│ └── wav.scp
└── train
├── text
└── wav.scp
# Generate raw/shard format training data, shard is recommended for large data volume
bash run.sh --stage 0 --stop_stage 0 --data_type shard
Training phase
bash run.sh --stage 1 --stop_stage 1
Log monitoring
# View training log
tail -f finetuned_model/whispering/train_log/log_2024-03-28_11-40-25.txt
# View tensorboard
tensorboard --host 0.0.0.0 --port 6006 --logdir finetuned_model/whispering/tensorboard/
Testing phase
bash run.sh --stage 2 --stop_stage 2
# View test results
tail finetuned_model/whispering/test_cer.txt
If you encounter problems in use, you can directly raise Issues on the github page. We welcome enthusiasts of voice to communicate and discuss.
-
The dataloader and trainer largely refer to the implementation of wenet
-
The tokenizer part refers to the implementation of Whisper-Finetune