DialogCorpus

A large scale dialog corpus for training the Next-Gen Dialog System.

How to Use?

First download the repository.

# download
git clone https://github.com/qywu/DialogCorpus.git
cd DialogCorpus

You can manually download and process the dataset.

# download data for daily_dialog
python daily_dialog/download_data.py
# process the data
python daily_dialog/process_data.py
# the processed data is stored as the {folder_name}.json
vi daily_dialog/data/daily_dialog.json

Or you can just use one command.

python prepare_all_data.py \
       --download \
       --process \
       --join

Detailed Dialog Processing for each dataset:

Daily Dialog
- Removed tokenization space for punctuations
Persona Chat
- Used huggingface's version [link]
- Recovered lower cased utterances
- Removed tokenization space for punctuations
Cornell Movie Corpus
- Ignored UTF-8 Errors
- Extracted Names
Task Master
- Nothing specific
CCPE
- Nothing specific
Frames
- Nothing specific
Chit-Chat Challenge
- Nothing specific
Self-dialogue
- Nothing specific
Schema Dialog
- Nothing specific

Links

Daily Dialog [link]
Conversational flow in Oxford-style debates [link]
Persona-chat [Google Drive]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DialogCorpus

How to Use?

Detailed Dialog Processing for each dataset:

Files

README.md

Latest commit

History

README.md

File metadata and controls

DialogCorpus

How to Use?

Detailed Dialog Processing for each dataset: