This is the sister repo of Somali_NLP. The data here has been collected from a range of different places. We have sometimes gone to great lengths to clean them up. For project aims and milestones, please see the Somali NLP repo linked above.
If you’ve any ideas about how to clean up the data, make it better, please get in touch. I can be reached on Twitter.
Here you’ll find three csv files. These are about 8mb each. Which is huge (not that huge!). Between these three files you’ll find the entire Somali Wikipedia corpus. There are two headings. One for the title of the articles, and the other the actual text containing them.
We took this corpus by Masaryk University | which supposedly comprised of over 80 million tokens (individual words). We cleaned up these tokens, removed xml data around them, and removed duplicates. We then sorted the tokens into grammatical categories (is it a word a verb, an adjective, a noun, etc). These categories still need a LOT of work because many are still uncategorized but the foundation is there.
The Hadrawi data were contributed by Mohamed Ainab. They can be found in the original repo here.