A PyTorch implementation of "Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study"
After checking out the repository, be sure to initialize the included git submodules:
git submodule update --init --recursive
This project requires the use of PyTorch
, which can be installed by following the directions on its project page
This project also uses the fairseq
NLP library, which is included as a submodule in this repository. To prepare the library for use, make sure that it is installed along with its dependencies.
cd fairseq
pip install -r requirements.txt
python setup.py build develop
All OpenNMT scripts have been grouped under opennmt-scripts
folder.
The first step is to prepare the source and target pairs of training and validation data. Extract original lang-8-en-1.0.zip
under corpus
folder. Then create another folder lang-8-opennmt
under corpus
folder to store re-formatted corpus.
To split the Lang-8 learner data training set, use the following command:
python transform-lang8.py -src_dir <dataset-src> -out_dir <corpus-dir>
e.g.
python transform-lang8.py -src_dir ../corpus/lang-8-en-1.0 -out_dir ../corpus/lang-8-opennmt
Once the data has been extracted from the dataset, use OpenNMT to prepare the training and validation data and create the vocabulary:
preprocess-lang8.bat
To train the error-correcting model, run the following command:
train.bat
Note that this script may need to be adjusted based on the GPU and memory resources available for training.
To test the model, run the following command to try to correct a test list of sentences:
translate.bat
After the sentences have been translated, the source and target sentence may be compared side to side using the following command:
python compare.py
If preprocess.py
fails with a TypeError, then you may need to patch OpenNMT-py.
Update OpenNMT-py\onmt\inputters\dataset_base.py
with the following code:
def __reduce_ex__(self, proto):
"This is a hack. Something is broken with torch pickle."
return super(DatasetBase, self).__reduce_ex__(proto)
If TypeError: __init__() got an unexpected keyword argument 'dtype'
occurs, pytorch/text
installed by pip may be out of date. Update it using pip install git+https://github.com/pytorch/text
If RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS
occurs during training, try install pytorch with CUDA 9.2 using conda instead of using default CUDA 9.0.
All fairseq scripts have been grouped under fairseq-scripts
folder.
The first step is to prepare the source and target pairs of training and validation data. Extract original lang-8-en-1.0.zip
under corpus
folder. Then create another folder lang-8-fairseq
under corpus
folder to store re-formatted corpus.
To split the Lang-8 learner data training set, use the following command:
python transform-lang8.py -src_dir <dataset-src> -out_dir <corpus-dir>
e.g.
python transform-lang8.py -src_dir ../corpus/lang-8-en-1.0 -out_dir ../corpus/lang-8-fairseq
Once the data has been extracted from the dataset, use fairseq to prepare the training and validation data and create the vocabulary:
preprocess-lang8.bat
To train the error-correcting model, run the following command:
train-lang8-cnn.bat
Note that this script may need to be adjusted based on the GPU and memory resources available for training.
To test the model, run the following command to try to correct a test list of sentences:
translate-lang8-cnn.bat
If error AttributeError: function 'bleu_zero_init' not found
occurs on Windows, modify functions to have __declspec(dllexport)
then build again. See Issue 292
If error UnicodeDecodeError: 'charmap' codec can't decode byte
error occurs, modify fairseq/tokenizer.py
to include , encoding='utf8'
for all open
functions.
When trying built-in example from fairseq/examples/translation/prepare-[dataset].sh
, scripts may need to change .py path from $BPEROOT/[script].py
to $BPEROOT/subword_nmt/[script].py
.