Skip to content

Latest commit

 

History

History
150 lines (120 loc) · 6.13 KB

README_en.md

File metadata and controls

150 lines (120 loc) · 6.13 KB

(English|Chinese)

Whispering: A dynamic multi-language, multi-task Whisper model training framework

Whispering supports pre-training and fine-tuning of all Whisper models open-sourced by OpenAI on hugging face, using the UIO method for data loading, the IO bottleneck in large-scale data training has been greatly improved. This framework has been verified on datasets of tens of thousands of hours, with stable and efficient training.

Core Features

  • Supports multiple tasks such as speech recognition, speech translation, VAD, etc., in multiple languages simultaneously
  • Supports raw/shard two types of training data formats
  • Supports static/dynamic two types of training batch types
  • Supports spec_aug, shuffle enhancement and other data enhancement methods
  • Supports cer, wer, bleu and other indicators to select the optimal model

Environment Setup

  • Mandatory requirements: torch>=1.13.0 transformers>=4.28.0
conda create -n whispering python==3.10
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
conda activate whispering

pip install -r requirements.txt

Model Download

Please download the pre-trained model from openai/whisper

mkdir pretrain_model/ && cd pretrain_model/

git clone https://huggingface.co/openai/whisper-base

Note: The config.json provided by the official model has both bos_token_id and eos_token_id values set to 50257, which might be a bug.

Therefore, when doing padding, the decoder_start_token_id pointing to 50258 is used to remove the first token in labels, instead of the bos_token_id in the official tutorial.

Data Preparation

First, prepare the text and wav.scp files, then use the provided script to automatically convert to raw or shard training data format

  1. Create train dev test folders
cd examples/aishell/s0

bash run.sh --stage -1 --stop_stage -1
  1. Manually generate text and wav.scp files and place them under the train dev test folders
  • Single language single task data text and wav.scp example
==> text <==
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
BAC009S0002W0123 也成为地方政府的眼中钉
BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后

==> wav.scp <==
BAC009S0002W0122 /data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /data_aishell/wav/train/S0002/BAC009S0002W0124.wav
  • Multi-language multi-task data text and wav.scp example Explanation of parameters in the text:

    All parameters are not necessary, the minimum input format is key {}, i.e., training without annotation, equivalent to setting sentence to <|nospeech|>

    Sentences are not necessary (for training with timestamps), multiple timestamps can be added to the sentences list

==> text <==
BAC009S0002W0122 {"key": "BAC009S0002W0122", "language": "chinese", "task": "transcribe", "sentence": "而对楼市成交抑制作用最大的限购", "sentences": [{"start": 0, "end": 6.0, "text": "而对楼市成交抑制作用最大的限购"}]}
BAC009S0002W0123 {"key": "BAC009S0002W0123", "language": "chinese", "task": "transcribe", "sentence": "也成为地方政府的眼中钉", "sentences": [{"start": 0, "end": 3.87, "text": "也成为地方政府的眼中钉"}]}
BAC009S0002W0124 {"key": "BAC009S0002W0124", "language": "chinese", "task": "transcribe", "sentence": "自六月底呼和浩特市率先宣布取消限购后", "sentences": [{"start": 0, "end": 5.41, "text": "自六月底呼和浩特市率先宣布取消限购后"}]}

==> wav.scp <==
BAC009S0002W0122 /data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /data_aishell/wav/train/S0002/BAC009S0002W0124.wav
  1. Generate training data format data.list
# Make sure examples/aishell/s0/data has the following files
data/
├── dev
│   ├── text
│   └── wav.scp
├── test
│   ├── text
│   └── wav.scp
└── train
    ├── text
    └── wav.scp

# Generate raw/shard format training data, shard is recommended for large data volume
bash run.sh --stage 0 --stop_stage 0 --data_type shard

Quick Start

Training phase

bash run.sh --stage 1 --stop_stage 1

Log monitoring

# View training log
tail -f finetuned_model/whispering/train_log/log_2024-03-28_11-40-25.txt

# View tensorboard
tensorboard --host 0.0.0.0 --port 6006 --logdir finetuned_model/whispering/tensorboard/

Testing phase

bash run.sh --stage 2 --stop_stage 2

# View test results
tail finetuned_model/whispering/test_cer.txt

Contact Us

If you encounter problems in use, you can directly raise Issues on the github page. We welcome enthusiasts of voice to communicate and discuss.

Acknowledgments

  1. The dataloader and trainer largely refer to the implementation of wenet

  2. The tokenizer part refers to the implementation of Whisper-Finetune