Skip to content

Latest commit

 

History

History
150 lines (116 loc) · 5.75 KB

File metadata and controls

150 lines (116 loc) · 5.75 KB

End-to-End Speech Command Recognition with Capsule Network

INTERSPEECH 2018 paper: link

We apply the capsule network to capture the spatial relationship and pose information of speech spectrogram features in both frequency and time axes, and show that our proposed end-to-end SR system with capsule networks on one-second speech commands dataset achieves better results on both clean and noise-added test than baseline CNN models.

  • 20 JAN 2019: Other baseline Keyword Spotting(KWS) models are also provided in CNN code.

Getting Started

The code is implemented based on python2(2.7.12)

Prerequistes

You should be ready to import below libraries:

tqdm, numpy(1.14.1), termcolor, scipy, sklearn, scikits
tensorflow(1.6.0), keras(2.1.4)

pip install numpy
pip install termcolor
pip install scipy
pip install sklearn
pip install scikit-learn
pip install tensorflow-gpu==1.6.0
pip install keras==2.1.4

Speech Feature Generation

Dataset

We use 'Google Speech Command Dataset'. You could refer to blog and Download Link

  • Download the dataset from above link and unzip it. (In our case we will unzip it in the folder named 'Google_Speech_Command')

Adding noise

To add noise to the original dataset, we use MATLAB and voicebox which is MATLAB library. We run matlab code on local which is window base and upload it to server which is linux base.

  1. Unzip download google speech command dataset.

  2. Create new folder name 'Google_Speech_Command' and move command folders into it. Then the folder structure will be like

speech_commands_v0.01.tar
|-- [_backgorund_noise_]
|-- Google_Speech_Command
|   |-- bed
|   |-- bird
 :      :
|   '-- zero
|-- testing_list
|-- validation_list
'-- etc.
  1. Change 'data_path' in matlab code and run the matlab code. It will generate new folder and save the noise added audio files.
noise_wave_generate.m
  1. You could aslo change 'SNR' in the code and generate noise audio files as you want.

Feature Generation

Extract speech features from raw audio file and save them as .npy file. Please adjust '--noise_name' argument.

cd core
python feature_generation.py

Data folder structure

feature_saved
|-- TEST
|   |-- fbank
|   |   |-- clean
|   |   '-- [noise names]_SNR5
|   '-- label
|-- TRAIN
|   |-- fbank
|   |   |-- clean
|   |   '-- [noise names]_SNR5
|   '-- label
'-- VALID
    |-- fbank
    |   |-- clean
    |   '-- [noise names]_SNR5
    '-- label

Training & Testing

For training and testing go into 'CNN' or 'CapsNet' folder and run the code. You could change the mode with '--is_training' argument.

Training

cd CapsNet
python main.py -m=CapsNet --is_training='TRAIN' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32  --primary_veclen=4 --digit_veclen=4

Testing

Note that you should set '--keep' argument to the number of epoch that you want to test.

cd CapsNet
python main.py -m=CapsNet --is_training='TEST' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32  --primary_veclen=4 --digit_veclen=4 --SNR=5 --keep=?

Various Neural Networks base KWS models

KWS models based on various kinds of Neural Networks(NNs) are also provided in CNN/model.py

1. Deep Neural Network(DNN) base KWS model from

  • G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks.” in ICASSP, vol. 14. Citeseer, 2014, pp. 4087–4091.
Contain 'ref_2014icassp_dnn' in ex_name to use DNN model. For example 
```
python main.py --model='CNN' --ex_name='ref_2014icassp_dnn512' --is_training='TRAIN' --model_size_info 512 512 512
```

2. CNN base KWS model from

  • T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
Contain 'ref_2015is_cnn' in ex_name to use CNN model. For example 
```
python main.py --model='CNN' --ex_name='ref_2015is_cnn' --is_training='TRAIN' --model_size_info 21 8 94 1 1 2 3 6 4 94 1 1 1 1 32
```

3. Long Short-Term Memory(LSTM) base KWS model form

  • M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 474–480.
Contain 'ref_rnn' in ex_name to use LSTM model. For example 
```
python main.py --model='CNN' -ex_name=ref_rnn_lstm --is_training='TRAIN' --model_size_info 64 32 0
```

4. Convolutional Recurrent Neural Network(CRNN) base KWS model from

  • S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv preprint arXiv:1703.05390, 2017.
Contain 'ref_crnn' in ex_name to use CRNN model. For example 
```
python main.py --model='CNN' --ex_name=ref_crnn --is_training='TRAIN' --model_size_info 32 20 5 8 2 2 32 1 64
```

Reference

Preprocessing source code from https://github.com/zzw922cn/Automatic_Speech_Recognition.

Base capsule network keras source code from https://github.com/XifengGuo/CapsNet-Keras.

Authors

Jaesung Bae - Korea Advanced Institute of Science and Technology (KAIST)

contact: bjs2279@gmail.com