Skip to content

Processed ngrams JSON model from web corpus of over 330 million words

Notifications You must be signed in to change notification settings

owaiswiz/webcorpus-trigram-json-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

A Trigram Model in JSON, created from scraping human written articles and blog post, consisting of over 330 million words (over 2 gigabytes in size).

The file itself is 60MB because statistically insignificant terms with frequency < 5 were filtered out.

The JSON file has a object with simple key value pair entries, where the key is the trigram (each word is joined by a single space) and the value is the frequency of its occurrence in the corpus scraped.

A usecase: Detecting Spun Content using n-gram analysis

About

Processed ngrams JSON model from web corpus of over 330 million words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published