huggingface · Taegeon · Mar 18, 2024
diff --git a/chapters/ko/_toctree.yml b/chapters/ko/_toctree.yml
@@ -51,6 +51,19 @@
     quiz: 3
   - local: chapter3/supplemental_reading
     title: 보충자료 및 리소스
+
+- title: 4단원. 음악장르 분류기
+  sections:
+  - local: chapter4/introduction
+    title: 이 코스에서 배울것과 만들어볼것
+  - local: chapter4/classification_models
+    title: 오디오 분류를 위한 사전학습 모델
+  - local: chapter4/fine-tuning
+    title: 음악 분류를 위한 파인튜닝 모델
+  - local: chapter4/demo
+    title: Gradio를 활용한 데모 만들기
+  - local: chapter4/hands_on
+    title: 실습
 
 - title: 코스 이벤트
   sections:

diff --git a/chapters/ko/chapter4/classification_models.mdx b/chapters/ko/chapter4/classification_models.mdx
@@ -0,0 +1,320 @@
+# Pre-trained models and datasets for audio classification
+
+The Hugging Face Hub is home to over 500 pre-trained models for audio classification. In this section, we'll go through
+some of the most common audio classification tasks and suggest appropriate pre-trained models for each. Using the `pipeline()`
+class, switching between models and tasks is straightforward - once you know how to use `pipeline()` for one model, you'll
+be able to use it for any model on the Hub no code changes! This makes experimenting with the `pipeline()` class extremely
+fast, allowing you to quickly select the best pre-trained model for your needs.
+
+Before we jump into the various audio classification problems, let's quickly recap the transformer architectures typically
+used. The standard audio classification architecture is motivated by the nature of the task; we want to transform a sequence
+of audio inputs (i.e. our input audio array) into a single class label prediction. Encoder-only models first map the input
+audio sequence into a sequence of hidden-state representations by passing the inputs through a transformer block. The
+sequence of hidden-state representations is then mapped to a class label output by taking the mean over the hidden-states,
+and passing the resulting vector through a linear classification layer. Hence, there is a preference for _encoder-only_
+models for audio classification.
+
+Decoder-only models introduce unnecessary complexity to the task, since they assume that the outputs can also be a _sequence_
+of predictions (rather than a single class label prediction), and so generate multiple outputs. Therefore, they have slower
+inference speed and tend not to be used. Encoder-decoder models are largely omitted for the same reason. These architecture
+choices are analogous to those in NLP, where encoder-only models such as [BERT](https://huggingface.co/blog/bert-101)
+are favoured for sequence classification tasks, and decoder-only models such as GPT reserved for sequence generation tasks.
+
+Now that we've recapped the standard transformer architecture for audio classification, let's jump into the different
+subsets of audio classification and cover the most popular models!
+
+## 🤗 Transformers Installation
+
+At the time of writing, the latest updates required for audio classification pipeline are only on the `main` version of
+the 🤗 Transformers repository, rather than the latest PyPi version. To make sure we have these updates locally, we'll
+install Transformers from the `main` branch with the following command:
+
+```
+pip install git+https://github.com/huggingface/transformers
+```
+
+## Keyword Spotting
+
+Keyword spotting (KWS) is the task of identifying a keyword in a spoken utterance. The set of possible keywords forms the
+set of predicted class labels. Hence, to use a pre-trained keyword spotting model, you should ensure that your keywords
+match those that the model was pre-trained on. Below, we'll introduce two datasets and models for keyword spotting.
+
+### Minds-14
+
+Let's go ahead and use the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset that you have explored
+in the previous unit. If you recall, MINDS-14 contains recordings of people asking an e-banking system questions in several
+languages and dialects, and has the `intent_class` for each recording. We can classify the recordings by intent of the call.
+
+```python
+from datasets import load_dataset
+
+minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
+```
+
+We'll load the checkpoint [`"anton-l/xtreme_s_xlsr_300m_minds14"`](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14),
+which is an XLS-R model fine-tuned on MINDS-14 for approximately 50 epochs. It achieves 90% accuracy over all languages
+from MINDS-14 on the evaluation set.
+
+```python
+from transformers import pipeline
+
+classifier = pipeline(
+    "audio-classification",
+    model="anton-l/xtreme_s_xlsr_300m_minds14",
+)
+```
+
+Finally, we can pass a sample to the classification pipeline to make a prediction:
+```python
+classifier(minds[0]["audio"])
+```
+**Output:**
+```
+[
+    {"score": 0.9631525278091431, "label": "pay_bill"},
+    {"score": 0.02819698303937912, "label": "freeze"},
+    {"score": 0.0032787492964416742, "label": "card_issues"},
+    {"score": 0.0019414445850998163, "label": "abroad"},
+    {"score": 0.0008378693601116538, "label": "high_value_payment"},
+]
+```
+
+Great! We've identified that the intent of the call was paying a bill, with probability 96%. You can imagine this kind of
+keyword spotting system being used as the first stage of an automated call centre, where we want to categorise incoming
+customer calls based on their query and offer them contextualised support accordingly.
+
+### Speech Commands
+
+Speech Commands is a dataset of spoken words designed to evaluate audio classification models on simple command words.
+The dataset consists of 15 classes of keywords, a class for silence, and an unknown class to include the false positive.
+The 15 keywords are single words that would typically be used in on-device settings to control basic tasks or launch
+other processes.
+
+A similar model is running continuously on your mobile phone. Here, instead of having single command words, we have
+'wake words' specific to your device, such as "Hey Google" or "Hey Siri". When the audio classification model detects
+these wake words, it triggers your phone to start listening to the microphone and transcribe your speech using a speech
+recognition model.
+
+The audio classification model is much smaller and lighter than the speech recognition model, often only several millions
+of parameters compared to several hundred millions for speech recognition. Thus, it can be run continuously on your device
+without draining your battery! Only when the wake word is detected is the larger speech recognition model launched, and
+afterwards it is shut down again. We'll cover transformer models for speech recognition in the next Unit, so by the end
+of the course you should have the tools you need to build your own voice activated assistant!
+
+As with any dataset on the Hugging Face Hub, we can get a feel for the kind of audio data it has present without downloading
+or committing it memory. After heading to the [Speech Commands' dataset card](https://huggingface.co/datasets/speech_commands)
+on the Hub, we can use the Dataset Viewer to scroll through the first 100 samples of the dataset, listening to the audio
+files and checking any other metadata information:
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/speech_commands.png" alt="Diagram of datasets viewer.">
+ </div>
+
+The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any
+dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether
+it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can start
+using it.
+
+Let's do exactly that and load a sample of the Speech Commands dataset using streaming mode:
+
+```python
+speech_commands = load_dataset(
+    "speech_commands", "v0.02", split="validation", streaming=True
+)
+sample = next(iter(speech_commands))
+```
+
+We'll load an official [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)
+checkpoint fine-tuned on the Speech Commands dataset, under the namespace [`"MIT/ast-finetuned-speech-commands-v2"`](https://huggingface.co/MIT/ast-finetuned-speech-commands-v2):
+
+```python
+classifier = pipeline(
+    "audio-classification", model="MIT/ast-finetuned-speech-commands-v2"
+)
+classifier(sample["audio"].copy())
+```
+**Output:**
+```
+[{'score': 0.9999892711639404, 'label': 'backward'},
+ {'score': 1.7504888774055871e-06, 'label': 'happy'},
+ {'score': 6.703040185129794e-07, 'label': 'follow'},
+ {'score': 5.805884484288981e-07, 'label': 'stop'},
+ {'score': 5.614546694232558e-07, 'label': 'up'}]
+```
+
+Cool! Looks like the example contains the word "backward" with high probability. We can take a listen to the sample
+and verify this is correct:
+```
+from IPython.display import Audio
+
+Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])
+```
+
+Now, you might be wondering how we've selected these pre-trained models to show you in these audio classification examples.
+The truth is, finding pre-trained models for your dataset and task is very straightforward! The first thing we need to do
+is head to the Hugging Face Hub and click on the "Models" tab: https://huggingface.co/models
+
+This is going to bring up all the models on the Hugging Face Hub, sorted by downloads in the past 30 days:
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/all_models.png">
+ </div>
+
+You'll notice on the left-hand side that we have a selection of tabs that we can select to filter models by task, library,
+dataset, etc. Scroll down and select the task "Audio Classification" from the list of audio tasks:
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/by_audio_classification.png">
+ </div>
+
+We're now presented with the sub-set of 500+ audio classification models on the Hub. To further refine this selection, we
+can filter models by dataset. Click on the tab "Datasets", and in the search box type "speech_commands". As you begin typing,
+you'll see the selection for `speech_commands` appear underneath the search tab. You can click this button to filter all
+audio classification models to those fine-tuned on the Speech Commands dataset:
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/by_speech_commands.png">
+ </div>
+
+Great! We see that we have 6 pre-trained models available to us for this specific dataset and task. You'll recognise the
+first of these models as the Audio Spectrogram Transformer checkpoint that we used in the previous example. This process
+of filtering models on the Hub is exactly how we went about selecting the checkpoint to show you!
+
+## Language Identification
+
+Language identification (LID) is the task of identifying the language spoken in an audio sample from a list of candidate
+languages. LID can form an important part in many speech pipelines. For example, given an audio sample in an unknown language,
+an LID model can be used to categorise the language(s) spoken in the audio sample, and then select an appropriate speech
+recognition model trained on that language to transcribe the audio.
+
+### FLEURS
+
+FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition
+systems in 102 languages, including many that are classified as 'low-resource'. Take a look at the FLEURS dataset
+card on the Hub and explore the different languages that are present: [google/fleurs](https://huggingface.co/datasets/google/fleurs).
+Can you find your native tongue here? If not, what's the most closely related language?
+
+Let's load up a sample from the validation split of the FLEURS dataset using streaming mode:
+
+```python
+fleurs = load_dataset("google/fleurs", "all", split="validation", streaming=True)
+sample = next(iter(fleurs))
+```
+
+Great! Now we can load our audio classification model. For this, we'll use a version of [Whisper](https://arxiv.org/pdf/2212.04356.pdf)
+fine-tuned on the FLEURS dataset, which is currently the most performant LID model on the Hub:
+
+```python
+classifier = pipeline(
+    "audio-classification", model="sanchit-gandhi/whisper-medium-fleurs-lang-id"
+)
+```
+
+We can then pass the audio through our classifier and generate a prediction:
+```python
+classifier(sample["audio"])
+```
+**Output:**
+```
+[{'score': 0.9999330043792725, 'label': 'Afrikaans'},
+ {'score': 7.093023668858223e-06, 'label': 'Northern-Sotho'},
+ {'score': 4.269149485480739e-06, 'label': 'Icelandic'},
+ {'score': 3.2661141631251667e-06, 'label': 'Danish'},
+ {'score': 3.2580724109720904e-06, 'label': 'Cantonese Chinese'}]
+```
+
+We can see that the model predicted the audio was in Afrikaans with extremely high probability (near 1). The FLEURS dataset
+contains audio data from a wide range of languages - we can see that possible class labels include Northern-Sotho, Icelandic,
+Danish and Cantonese Chinese amongst others. You can find the full list of languages on the dataset card here: [google/fleurs](https://huggingface.co/datasets/google/fleurs).
+
+Over to you! What other checkpoints can you find for FLEURS LID on the Hub? What transformer models are they using under-the-hood?
+
+## Zero-Shot Audio Classification
+
+In the traditional paradigm for audio classification, the model predicts a class label from a _pre-defined_ set of
+possible classes. This poses a barrier to using pre-trained models for audio classification, since the label set of the
+pre-trained model must match that of the downstream task. For the previous example of LID, the model must predict one of
+the 102 langauge classes on which it was trained. If the downstream task actually requires 110 languages, the model would
+not be able to predict 8 of the 110 languages, and so would require re-training to achieve full coverage. This limits the
+effectiveness of transfer learning for audio classification tasks.
+
+Zero-shot audio classification is a method for taking a pre-trained audio classification model trained on a set of labelled
+examples and enabling it to be able to classify new examples from previously unseen classes. Let's take a look at how we
+can achieve this!
+
+Currently, 🤗 Transformers supports one kind of model for zero-shot audio classification: the [CLAP model](https://huggingface.co/docs/transformers/model_doc/clap).
+CLAP is a transformer-based model that takes both audio and text as inputs, and computes the _similarity_ between the two.
+If we pass a text input that strongly correlates with an audio input, we'll get a high similarity score. Conversely, passing
+a text input that is completely unrelated to the audio input will return a low similarity.
+
+We can use this similarity prediction for zero-shot audio classification by passing one audio input to the model and
+multiple candidate labels. The model will return a similarity score for each of the candidate labels, and we can pick the
+one that has the highest score as our prediction.
+
+Let's take an example where we use one audio input from the [Environmental Speech Challenge (ESC)](https://huggingface.co/datasets/ashraq/esc50)
+dataset:
+
+```python
+dataset = load_dataset("ashraq/esc50", split="train", streaming=True)
+audio_sample = next(iter(dataset))["audio"]["array"]
+```
+
+We then define our candidate labels, which form the set of possible classification labels. The model will return a
+classification probability for each of the labels we define. This means we need to know _a-priori_ the set of possible
+labels in our classification problem, such that the correct label is contained within the set and is thus assigned a
+valid probability score. Note that we can either pass the full set of labels to the model, or a hand-selected subset
+that we believe contains the correct label. Passing the full set of labels is going to be more exhaustive, but comes
+at the expense of lower classification accuracy since the classification space is larger (provided the correct label is
+our chosen subset of labels):
+
+```python
+candidate_labels = ["Sound of a dog", "Sound of vacuum cleaner"]
+```
+
+We can run both through the model to find the candidate label that is _most similar_ to the audio input:
+
+```python
+classifier = pipeline(
+    task="zero-shot-audio-classification", model="laion/clap-htsat-unfused"
+)
+classifier(audio_sample, candidate_labels=candidate_labels)
+```
+**Output:**
+```
+[{'score': 0.9997242093086243, 'label': 'Sound of a dog'}, {'score': 0.0002758323971647769, 'label': 'Sound of vacuum cleaner'}]
+```
+
+Alright! The model seems pretty confident we have the sound of a dog - it predicts it with 99.96% probability, so we'll
+take that as our prediction. Let's confirm whether we were right by listening to the audio sample (don't turn up your
+volume too high or else you might get a jump!):
+
+```python
+Audio(audio_sample, rate=16000)
+```
+
+Perfect! We have the sound of a dog barking 🐕, which aligns with the model's prediction. Have a play with different audio
+samples and different candidate labels - can you define a set of labels that give good generalisation across the ESC
+dataset? Hint: think about where you could find information on the possible sounds in ESC and construct your labels accordingly!
+
+You might be wondering why we don't use the zero-shot audio classification pipeline for **all** audio classification tasks?
+It seems as though we can make predictions for any audio classification problem by defining appropriate class labels _a-priori_,
+thus bypassing the constraint that our classification task needs to match the labels that the model was pre-trained on.
+This comes down to the nature of the CLAP model used in the zero-shot pipeline: CLAP is pre-trained on _generic_ audio
+classification data, similar to the environmental sounds in the ESC dataset, rather than specifically speech data, like
+we had in the LID task. If you gave it speech in English and speech in Spanish, CLAP would know that both examples were
+speech data 🗣️ But it wouldn't be able to differentiate between the languages in the same way a dedicated LID model is
+able to.
+
+## What next?
+
+We've covered a number of different audio classification tasks and presented the most relevant datasets and models that
+you can download from the Hugging Face Hub and use in just several lines of code using the `pipeline()` class. These tasks
+included keyword spotting, language identification and zero-shot audio classification.
+
+But what if we want to do something **new**? We've worked extensively on speech processing tasks, but this is only one
+aspect of audio classification. Another popular field of audio processing involves **music**. While music has inherently
+different features to speech, many of the same principles that we've learnt about already can be applied to music.
+
+In the following section, we'll go through a step-by-step guide on how you can fine-tune a transformer model with 🤗
+Transformers on the task of music classification. By the end of it, you'll have a fine-tuned checkpoint that you can plug
+into the `pipeline()` class, enabling you to classify songs in exactly the same way that we've classified speech here!