Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorporate Bloomfield's texts #24

Open
dwhieb opened this issue Feb 10, 2021 · 2 comments
Open

incorporate Bloomfield's texts #24

dwhieb opened this issue Feb 10, 2021 · 2 comments
Labels
corpora Changes to the corpora settings

Comments

@dwhieb
Copy link

dwhieb commented Feb 10, 2021

Add Bloomfield's texts as an additional corpus.

@dwhieb dwhieb added enhancement User-facing features or improvements corpora Changes to the corpora settings and removed enhancement User-facing features or improvements labels Feb 10, 2021
@dwhieb
Copy link
Author

dwhieb commented Mar 17, 2021

@katieschmirler informs me these are ready for import into Korp!

@aarppe
Copy link

aarppe commented Jul 30, 2024

@fbanados A Korp version of the Bloomfield texts can be found here: altlab/crk/generated/bloomfield_fst+cg+gloss.vrt

This is created with the following invocation:

cat corpora/bloomfield.korp-vrt | gawk '{ if(match($0, ".+~$")!=0) sub("~$",""); print; }' | bin/fst-cg-analyze-vrt.sh analyser-gt-strict.hfstol /Users/arppe/gt/lang-crk/src/cg3/disambiguator.cg3 analyser-gt-relaxed.hfstol /Users/arppe/gt/lang-crk/src/cg3/functions.cg3 generator-gt-strict.hfstol | bin/vrt2korp.sh > generated/bloomfield_fst+cg+gloss.vrt

This is largely the same as the Ahenakew-Wolfart corpus, except it has only three levels: <corpus>, <subcorpus> (2 values), and <text> (the tens of individual texts). The lang field is defined at the corpus level, whereas in the A-W corpus that is defined for each text, which may need changing.

I probably would eventually want to make more use of the underlying XML sources (e.g. the word-specific as well as sentence-specific translations, which would add fields to the linguistic analyses), but incorporating this could be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpora Changes to the corpora settings
Projects
None yet
Development

No branches or pull requests

2 participants