-
Notifications
You must be signed in to change notification settings - Fork 0
Our experiments with cross-lingual parsing. A croSSSynt project. Starting with the SFNW paper (Slavic Forest, Norwegian Wood).
ptakopysk/crosssynt
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
To use the "Slavic Forest, Norwegian Wood" cross-lingual parsing system, you need to fulfill several prerequisities: You need to have the following software: - Bash - Python 3 - Perl - Treex (http://ufal.mff.cuni.cz/treex) - UDPipe (bundled in tools/) - word2vec (bundled in tools/) - Moses (install it in mosesdecoder/ !!!) You need to have the following data: - UD treebank for the source language - train is sufficient - UD treebank for the target language - dev for evaluation - train for training the tagger and the supervised upper bound (or you may provide a trained target tagger and omit the supervised evaluation) - parallel data (separate plaintext file for each of the languages, one sentence per line, with an implicit sentence-level alignment, i.e. sentences on the same line correspond to each other) Small sample data for Czech and Polish are provided in the sample/ directory. To run the pipeline on the sample data, run: ./sample.sh This will print a lot of debug info to the error output, and eventually the evaluation scores to standard output. Look into the sample.sh script to see what needs to be done to run the pipeline on some data: - put the necessary treebanks into the treebanks/ directory - put the necessary parallel data into the para/ directory - create the script to run from the all-SSS-TTT.sh template by substituting SSS for the source language and TTT for the target language (can be done by running ./new.sh [srclang] [tgtlang]) - go to the run/ directory and run the script The source code of the template file is commented, modify it or remove parts of it according to your needs. If you have an SGE cluster, use the all-SSS-TTT.shc template instead, generate the run script with ./new_shc.sh, and submit it by going to run/ and calling qsub scriptname.shc (you may need to change some of the setting in the shc template). Of course, you need to get the treebank and parallel data somewhere. If you are lucky, you can get these automatically: - for treebanks, simply run: ./get_UD-1.4_treebanks.sh - for parallel data, try to run e.g.: ./try_get_para_SRC_TGT.sh cs pl Authors: Rudolf Rosa, Daniel Zeman, David Mareček and Zdeněk Žabokrtský <{rosa,zeman,marecek,zabokrtsky}@ufal.mff.cuni.cz> Institute of Formal and Applied Linguistics, Charles University Licence: GPL v2 Year: 2017 Versions v1.0 corresponds to our TLT 2018 paper: Rosa Rudolf, Žabokrtský Zdeněk: Error Analysis of Cross-lingual Tagging and Parsing. In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, Copyright © Univerzita Karlova, Praha, Czechia, ISBN 978-80-88132-04-2, pp. 106-118, 2017 v2.0 corrseponds to my 2018 dissertation: Rosa Rudolf: Discovering the structure of natural language sentences by semi-supervised methods. Ph.D. thesis, Charles University, Faculty of Mathematics and Physics, Praha, Czechia, 186 pp., Jun 2018
About
Our experiments with cross-lingual parsing. A croSSSynt project. Starting with the SFNW paper (Slavic Forest, Norwegian Wood).
Topics
Resources
Stars
Watchers
Forks
Packages 0
No packages published