Merge pull request #70 from sanchit-gandhi/u5-public

U5 - ASR
huggingface · Jun 27, 2023 · 0245987 · 0245987
2 parents a77d4ea + 7b556e3
commit 0245987
Show file tree

Hide file tree

Showing 19 changed files with 1,654 additions and 33 deletions.
diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml
@@ -65,29 +65,26 @@
   - local: chapter4/hands_on
     title: Hands-on exercise
 
-#- title: Unit 5. Transcribe a meeting recording
-#  sections:
-#  - local: chapter5/introduction
-#    title: What you'll learn and what you'll build
-#  - local: chapter5/choosing_dataset
-#    title: Choosing a dataset
-#  - local: chapter5/asr_models
-#    title: Pre-trained models for automatic speech recognition
-#  - local: chapter5/preprocessing_data
-#    title: Loading and preprocessing data
-#  - local: chapter5/evaluation
-#    title: Evaluation metrics for ASR
-#  - local: chapter5/fine-tuning
-#    title: Fine-tuning the ASR model
+- title: Unit 5. Automatic Speech Recognition
+  sections:
+  - local: chapter5/introduction
+    title: What you'll learn and what you'll build
+  - local: chapter5/asr_models
+    title: Pre-trained models for speech recognition
+  - local: chapter5/choosing_dataset
+    title: Choosing a dataset
+  - local: chapter5/evaluation
+    title: Evaluation and metrics for speech recognition
+  - local: chapter5/fine-tuning
+    title: How to fine-tune an ASR system with the Trainer API
 #  - local: chapter5/speaker_diarization
 #    title: Automatic speech recognition with speaker diarization
-#  - local: chapter5/quiz
-#    title: Quiz
-#    quiz: 5
-#  - local: chapter5/hands_on
-#    title: Hands-on exercise
-#  - local: chapter5/supplemental_reading
-#    title: Supplemental reading and resources
+  - local: chapter5/demo
+    title: Building a demo
+  - local: chapter5/hands_on
+    title: Hands-on exercise
+  - local: chapter5/supplemental_reading
+    title: Supplemental reading and resources
 #
 #- title: Unit 6. From text to speech
 #  sections:

diff --git a/chapters/en/chapter5/asr_models.mdx b/chapters/en/chapter5/asr_models.mdx
diff --git a/chapters/en/chapter5/choosing_dataset.mdx b/chapters/en/chapter5/choosing_dataset.mdx
@@ -0,0 +1,126 @@
+# Choosing a dataset
+
+As with any machine learning problem, our model is only as good as the data that we train it on. Speech recognition
+datasets vary considerably in how they are curated and the domains that they cover. To pick the right dataset, we need
+to match our criteria with the features that a dataset offers.
+
+Before we pick a dataset, we first need to understand the key defining features.
+
+## Features of speech datasets
+
+### 1. Number of hours
+Simply put, the number of training hours indicates how large the dataset is. It’s analogous to the number of training
+examples in an NLP dataset. However, bigger datasets aren’t necessarily better. If we want a model that generalises well,
+we want a **diverse** dataset with lots of different speakers, domains and speaking styles.
+
+### 2. Domain
+The domain entails where the data was sourced from, whether it be audiobooks, podcasts, YouTube or financial meetings.
+Each domain has a different distribution of data. For example, audiobooks are recorded in high-quality studio conditions
+(with no background noise) and text that is taken from written literature. Whereas for YouTube, the audio likely contains
+more background noise and a more informal style of speech.
+
+We need to match our domain to the conditions we anticipate at inference time. For instance, if we train our model on
+audiobooks, we can’t expect it to perform well in noisy environments.
+
+### 3. Speaking style
+The speaking style falls into one of two categories:
+
+* Narrated: read from a script
+* Spontaneous: un-scripted, conversational speech
+
+The audio and text data reflect the style of speaking. Since narrated text is scripted, it tends to be spoken articulately
+and without any errors:
+
+```
+“Consider the task of training a model on a speech recognition dataset”
+```
+
+Whereas for spontaneous speech, we can expect a more colloquial style of speech, with the inclusion of repetitions,
+hesitations and false-starts:
+
+```
+“Let’s uhh let's take a look at how you'd go about training a model on uhm a sp- speech recognition dataset”
+```
+
+### 4. Transcription style
+The transcription style refers to whether the target text has punctuation, casing or both. If we want a system to generate
+fully formatted text that could be used for a publication or meeting transcription, we require training data with punctuation
+and casing. If we just require the spoken words in an un-formatted structure, neither punctuation nor casing are necessary.
+In this case, we can either pick a dataset without punctuation or casing, or pick one that has punctuation and casing and
+then subsequently remove them from the target text through pre-processing.
+
+## A summary of datasets on the Hub
+
+Here is a summary of the most popular English speech recognition datasets on the Hugging Face Hub:
+
+| Dataset                                                                                 | Train Hours | Domain                      | Speaking Style        | Casing | Punctuation | License         | Recommended Use                  |
+|-----------------------------------------------------------------------------------------|-------------|-----------------------------|-----------------------|--------|-------------|-----------------|----------------------------------|
+| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)                          | 960         | Audiobook                   | Narrated              | ❌      | ❌           | CC-BY-4.0       | Academic benchmarks              |
+| [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | 3000        | Wikipedia                   | Narrated              | ✅      | ✅           | CC0-1.0         | Non-native speakers              |
+| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                         | 540         | European Parliament         | Oratory               | ❌      | ✅           | CC0             | Non-native speakers              |
+| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium)                                | 450         | TED talks                   | Oratory               | ❌      | ❌           | CC-BY-NC-ND 3.0 | Technical topics                 |
+| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)                    | 10000       | Audiobook, podcast, YouTube | Narrated, spontaneous | ❌      | ✅           | apache-2.0      | Robustness over multiple domains |
+| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech)                         | 5000        | Financial meetings          | Oratory, spontaneous  | ✅      | ✅           | User Agreement  | Fully formatted transcriptions   |
+| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22)                     | 119         | Financial meetings          | Oratory, spontaneous  | ✅      | ✅           | CC-BY-SA-4.0    | Diversity of accents             |
+| [AMI](https://huggingface.co/datasets/edinburghcstr/ami)                                | 100         | Meetings                    | Spontaneous           | ✅      | ✅           | CC-BY-4.0       | Noisy speech conditions          |
+
+This table serves as a reference for selecting a dataset based on your criterion. Below is an equivalent table for
+multilingual speech recognition. Note that we omit the train hours column, since this varies depending on the language
+for each dataset, and replace it with the number of languages per dataset:
+
+| Dataset                                                                                       | Languages | Domain                                | Speaking Style | Casing | Punctuation | License   | Recommended Usage       |
+|-----------------------------------------------------------------------------------------------|-----------|---------------------------------------|----------------|--------|-------------|-----------|-------------------------|
+| [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 6         | Audiobooks                            | Narrated       | ❌      | ❌           | CC-BY-4.0 | Academic benchmarks     |
+| [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)       | 108       | Wikipedia text & crowd-sourced speech | Narrated       | ✅      | ✅           | CC0-1.0   | Diverse speaker set     |
+| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                               | 15        | European Parliament recordings        | Spontaneous    | ❌      | ✅           | CC0       | European languages      |
+| [FLEURS](https://huggingface.co/datasets/google/fleurs)                                       | 101       | European Parliament recordings        | Spontaneous    | ❌      | ❌           | CC-BY-4.0 | Multilingual evaluation |
+
+For a detailed breakdown of the audio datasets covered in both tables, refer to the blog post [A Complete Guide to Audio Datasets](https://huggingface.co/blog/audio-datasets#a-tour-of-audio-datasets-on-the-hub).
+While there are over 180 speech recognition datasets on the Hub, it may be possible that there isn't a dataset that matches
+your needs. In this case, it's also possible to use your own audio data with 🤗 Datasets. To create a custom audio dataset,
+refer to the guide [Create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset). When creating a custom
+audio dataset, consider sharing the final dataset on the Hub so that others in the community can benefit from your
+efforts - the audio community is inclusive and wide-ranging, and others will appreciate your work as you do theirs.
+
+Alright! Now that we've gone through all the criterion for selecting an ASR dataset, let's pick one for the purpose of this tutorial.
+We know that Whisper already does a pretty good job at transcribing data in high-resource languages (such as English and Spanish), so
+we'll focus ourselves on low-resource multilingual transcription. We want to retain Whisper's ability to predict punctuation and casing,
+so it seems from the second table that Common Voice 13 is a great candidate dataset!
+
+## Common Voice 13
+
+Common Voice 13 is a crowd-sourced dataset where speakers record text from Wikipedia in various languages. It forms part of
+the Common Voice series, a collection of Common Voice datasets released by Mozilla Foundation. At the time of writing,
+Common Voice 13 is the latest edition of the dataset, with the most languages and hours per language out of any release to date.
+
+We can get the full list of languages for the Common Voice 13 dataset by checking-out the dataset page on the Hub:
+[mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0).
+The first time you view this page, you'll be asked to accept the terms of use. After that, you'll be given full access to the dataset.
+
+Once we've provided authentication to use the dataset, we'll be presented with the dataset preview. The dataset preview
+shows us the first 100 samples of the dataset for each language. What's more, it's loaded up with audio samples ready for us
+to listen to in real time. For this Unit, we'll select [_Dhivehi_](https://en.wikipedia.org/wiki/Maldivian_language)
+(or _Maldivian_), an Indo-Aryan language spoken in the South Asian island country of the Maldives. While we're selecting
+Dhivehi for this tutorial, the steps covered here apply to any one of the 108 languages in the Common Voice 13 dataset, and
+more generally to any one of the 180+ audio datasets on the Hugging Face Hub, so there's no restriction on language or dialect.
+
+We can select the Dhivehi subset of Common Voice 13 by setting the subset to `dv` using the dropdown menu (`dv` being the language
+identifier code for Dhivehi):
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/cv_13_dv_selection.png" alt="Selecting the Dhivehi split from the Dataset's Preview">
+</div>
+
+If we hit the play button on the first sample, we can listen to the audio and see the corresponding text. Have a scroll
+through the samples for the train and test sets to get a better feel for the audio and text data that we're dealing with.
+You can tell from the intonation and style that the recordings are taken from narrated speech. You'll also likely notice
+the large variation in speakers and recording quality, a common trait of crowd-sourced data.
+
+The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any
+dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether
+it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can
+start using it.
+
+Now, I personally don't speak Dhivehi, and expect the vast majority of readers not to either! To know if our fine-tuned model
+is any good, we'll need a rigorous way of _evaluating_ it on unseen data and measuring its transcription accuracy.
+We'll cover exactly this in the next section!
diff --git a/chapters/en/chapter5/demo.mdx b/chapters/en/chapter5/demo.mdx
@@ -0,0 +1,89 @@
+# Build a demo with Gradio
+
+Now that we've fine-tuned a Whisper model for Dhivehi speech recognition, let's go ahead and build a [Gradio]((https://gradio.app))
+demo to showcase it to the community!
+
+The first thing to do is load up the fine-tuned checkpoint using the `pipeline()` class - this is very familiar now from 
+the section on [pre-trained models](asr_models.mdx). You can change the `model_id` to the namespace of your fine-tuned 
+model on the Hugging Face Hub, or one of the pre-trained [Whisper models](https://huggingface.co/models?sort=downloads&search=openai%2Fwhisper-) 
+to perform zero-shot speech recognition:
+
+```python
+from transformers import pipeline
+
+model_id = "sanchit-gandhi/whisper-small-dv"  # update with your model id
+pipe = pipeline("automatic-speech-recognition", model=model_id)
+```
+
+Secondly, we'll define a function that takes the filepath for an audio input and passes it through the pipeline. Here,
+the pipeline automatically takes care of loading the audio file, resampling it to the correct sampling rate, and running
+inference with the model. We can then simply return the transcribed text as the output of the function. To ensure our 
+model can handle audio inputs of arbitrary length, we'll enable *chunking* as described in the section 
+on [pre-trained models](asr_models.mdx):
+
+```python
+def transcribe_speech(filepath):
+    output = pipe(
+        filepath,
+        max_new_tokens=256,
+        generate_kwargs={
+            "task": "transcribe",
+            "language": "sinhalese",
+        },  # update with the language you've fine-tuned on
+        chunk_length_s=30,
+        batch_size=8,
+    )
+    return output["text"]
+```
+
+We'll use the Gradio [blocks](https://gradio.app/docs/#blocks) feature to launch two tabs on our demo: one for microphone
+transcription, and the other for file upload.
+
+```python
+import gradio as gr
+
+demo = gr.Blocks()
+
+mic_transcribe = gr.Interface(
+    fn=transcribe_speech,
+    inputs=gr.Audio(source="microphone", type="filepath"),
+    outputs=gr.outputs.Textbox(),
+)
+
+file_transcribe = gr.Interface(
+    fn=transcribe_speech,
+    inputs=gr.Audio(source="upload", type="filepath"),
+    outputs=gr.outputs.Textbox(),
+)
+```
+
+Finally, we launch the Gradio demo using the two blocks that we've just defined:
+
+```python
+with demo:
+    gr.TabbedInterface(
+        [mic_transcribe, file_transcribe],
+        ["Transcribe Microphone", "Transcribe Audio File"],
+    )
+
+demo.launch(debug=True)
+```
+
+This will launch a Gradio demo similar to the one running on the Hugging Face Space:
+
+<iframe src="https://course-demos-whisper-small.hf.space" frameBorder="0" height="450" title="Gradio app" class="container p-0 flex-grow space-iframe" allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking" sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads"></iframe>
+
+Should you wish to host your demo on the Hugging Face Hub, you can use this Space as a template for your fine-tuned model.
+
+Click the link to duplicate the template demo to your account: https://huggingface.co/spaces/course-demos/whisper-small?duplicate=true
+
+We recommend giving your space a similar name to your fine-tuned model (e.g. whisper-small-dv-demo) and setting the visibility to "Public".
+
+Once you've duplicated the Space to your account, click "Files and versions" -> "app.py" -> "edit". Then change the
+model identifier to your fine-tuned model (line 6). Scroll to the bottom of the page and click "Commit changes to main".
+he demo will reboot, this time using your fine-tuned model. You can share this demo with your friends and family so that 
+they can use the model that you've trained!
+
+Checkout our video tutorial to get a better understanding of how to duplicate the Space 👉️ [YouTube Video](https://www.youtube.com/watch?v=VQYuvl6-9VE)
+
+We look forward to seeing your demos on the Hub!