Skip to content

Latest commit

 

History

History
72 lines (51 loc) · 2.08 KB

README.md

File metadata and controls

72 lines (51 loc) · 2.08 KB

DialogCorpus

A large scale dialog corpus for training the Next-Gen Dialog System.

How to Use?

First download the repository.

# download
git clone https://github.com/qywu/DialogCorpus.git
cd DialogCorpus

You can manually download and process the dataset.

# download data for daily_dialog
python daily_dialog/download_data.py
# process the data
python daily_dialog/process_data.py
# the processed data is stored as the {folder_name}.json
vi daily_dialog/data/daily_dialog.json

Or you can just use one command.

python prepare_all_data.py \
       --download \
       --process \
       --join

Detailed Dialog Processing for each dataset:

  • Daily Dialog

    • Removed tokenization space for punctuations
  • Persona Chat

    • Used huggingface's version [link]
    • Recovered lower cased utterances
    • Removed tokenization space for punctuations
  • Cornell Movie Corpus

    • Ignored UTF-8 Errors
    • Extracted Names
  • Task Master

    • Nothing specific
  • CCPE

    • Nothing specific
  • Frames

    • Nothing specific
  • Chit-Chat Challenge

    • Nothing specific
  • Self-dialogue

    • Nothing specific
  • Schema Dialog

    • Nothing specific

Links