-
Notifications
You must be signed in to change notification settings - Fork 100
Retraining CluProcessor
CluProcessor
is the lab's internal suite of NLP tools, which includes only tools licensed under the Apache license. All these components (with the exception of the tokenizer and lemmatizer) are largely language and domain independent, and can be trained on other domains relatively quickly. Please follow these instructions to re-train the components in CluProcessor
.
Use the following command line to retrain the POS tagger:
sbt 'run-main org.clulab.processors.clu.sequences.PartOfSpeechTagger -train <YOUR_TRAIN_FILE> -model <FILE_WHERE_YOU_WANT_TO_SAVE_YOUR_MODEL> -test <YOUR_TEST_FILE>'
The model file is simply a text file, where the classifier saves the statistics it learned from the data. Both the training file and the testing file share the same format. They both expect one word per line, where each line contains the word itself and the POS tag, separated by tab. There is an empty line between sentences. For example, the beginning of the training file from the Penn Treebank looks like this:
Pierre NNP
Vinken NNP
, ,
61 CD
years NNS
old JJ
, ,
will MD
join VB
the DT
board NN
as IN
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .
Mr. NNP
Vinken NNP
is VBZ
chairman NN
of IN
Elsevier NNP
N.V. NNP
, ,
the DT
Dutch NNP
publishing VBG
group NN
. .
The features for the POS tagger are implemented in this file: main/src/main/scala/org/clulab/processors/clu/sequences/PartOfSpeechTagger.scala
, in the method featureExtractor()
. Most of these features are language independent. The only exception is FeatureExtractor.lemma()
, which currently relies on the English lemmatizer in CluProcessor
.
First, download then maltparser from here: http://www.maltparser.org/download.html
We are currently using version 1.9.0
. If you change the version number, please copy again the corresponding appdata/
directory from the malt distribution to this location in processors
: modelsmain/src/main/resources/appdata/
. When copying over a new appdata/
directory from a newer malt version, make sure to replace @version@
with the actual version number (e.g., 1.9.0
). This impacts the files appdata/options.xml
and appdata/release.properties
.
Use the following commands to train the forward, i.e., left-to-right model:
mkdir -p output
java -jar maltparser-1.9.0/maltparser-1.9.0.jar -w output -c en-forward-nivre -i <COMBINED TRAIN FILE FROM WSJ AND GENIA> -a nivreeager -m learn -l liblinear -llo -s_4_-c_0.1 -d POSTAG -s Input[0] -T 1000 -F NivreEager.xml
where:
- The combined train file is available on our servers at:
corpora/processors/deps/combined/wsjtrain-wsjdev-geniatrain-geniadev.conllx
- The
NivreEager.xml
is the one located underappdata/features/liblinear/conllx/NivreEager.xml
- Users (r--)
- Developers (-w-)
- Maintainers (--x)