GitHub - owaiswiz/webcorpus-trigram-json-model: Processed ngrams JSON model from web corpus of over 330 million words

A Trigram Model in JSON, created from scraping human written articles and blog post, consisting of over 330 million words (over 2 gigabytes in size).

The file itself is 60MB because statistically insignificant terms with frequency < 5 were filtered out.

The JSON file has a object with simple key value pair entries, where the key is the trigram (each word is joined by a single space) and the value is the frequency of its occurrence in the corpus scraped.

A usecase: Detecting Spun Content using n-gram analysis

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
web-corpus-trigrams.json		web-corpus-trigrams.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

owaiswiz/webcorpus-trigram-json-model

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages