Skip to content

Retraining CluProcessor

Mihai Surdeanu edited this page Sep 24, 2017 · 13 revisions

Retraining CluProcessor

CluProcessor is the lab's internal suite of NLP tools, which includes only tools licensed under the Apache license. All these components (with the exception of the tokenizer and lemmatizer) are largely language and domain independent, and can be trained on other domains relatively quickly. Please follow these instructions to re-train the components in CluProcessor.

Retraining the part-of-speech (POS) tagger

TODO

Retraining the maltparser models

First, download then maltparser from here: http://www.maltparser.org/download.html

We are currently using version 1.9.0. If you change the version number, please copy again the corresponding appdata/ directory from the malt distribution to this location in processors: modelsmain/src/main/resources/appdata/. When copying over a new appdata/ directory from a newer malt version, make sure to replace @version@ with the actual version number (e.g., 1.9.0). This impacts the files appdata/options.xml and appdata/release.properties.

Use the following commands to train the forward, i.e., left-to-right model:

mkdir -p output

java -jar maltparser-1.9.0/maltparser-1.9.0.jar -w output -c en-forward-nivre -i <COMBINED TRAIN FILE FROM WSJ AND GENIA> -a nivreeager -m learn -l liblinear -llo -s_4_-c_0.1 -d POSTAG -s Input[0] -T 1000 -F NivreEager.xml

where:

  • The combined train file is available on our servers at: corpora/processors/deps/combined/wsjtrain-wsjdev-geniatrain-geniadev.conllx
  • The NivreEager.xml is the one located under appdata/features/liblinear/conllx/NivreEager.xml