Parts_of_speech_tagging

Structure of this repo

`bc.test - Test data set
bc.test.tiny - Small test data set for unit tests
bc.train -Training dataset
pos_solver.py - Main code to perform part of speech tagging
pos_scorer.py - Code to evaluate the performance of the solver

Aim: to find the parts of speech for the words in the sentence. observed variables: words in states

Training the model: while training the model with large set of labeled training data, we have created the following dictionaries which will help in calculating the emission and transition probability of Bayes net.

word_frequency: this dictionary will store how many times a word is present the given training data pos_frequency: this dictionary will store how many times a particular part of speech word_pos_frequency: this dictionary will store how many times a combination of word and part of speech is repeated in training data transition_frequency: this dictionary will store how many times a combination of two parts of speech repeated one after other in training data

Simplified Bayes net:

we have calculated fixed the part of speech tag to the word by maximizing the P(parts of speech/word). P(S/w) = P(s,w)/P(w) = frequency of word and part of speech in training set/ frequency of word in training set

if the given word is not present in training set, we have assigned "noun" to the word.

for calulating the posterior. we have multiplied emission probability p(w/s) for all the words and respective labels and applied logarithm to it.

for this bayes net, we have used viterbi algorithm.

In v-table the intial probabilites are calculated by multiplying the emission probability P(w/s) and probability that sentence starts with this parts of speech

the probabilties at the other time steps is calculated by multiplying emission probability P(w/s) and P(Si/Si-1) and vi(t-1)

for back tracking, we have implemented the which table which stores the POS for which we got maximum product of P(Si/Si-1) and vi(t-1).

if the word is not present in training set, i have given very small probability of 10**-10 in the v-table.

Complex bayes net:

we have mcmc algorithm for this bayes net to calculate the max probability of mcmc sequence.

we have the taken intial sequence as all nouns.

after that we have created 100 samples using gibbs sampling and assinged parts of speech which is most repeated to the word

Structure

`bc.test - Test data set
bc.test.tiny - Small test data set for unit tests
bc.train -Training dataset
pos_solver.py - Main code to perform part of speech tagging
pos_scorer.py - Code to evaluate the performance of the solver

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parts_of_speech_tagging

Structure of this repo

Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
bc.test		bc.test
bc.test.tiny		bc.test.tiny
bc.train		bc.train
label.py		label.py
pos_scorer.py		pos_scorer.py
pos_solver.py		pos_solver.py

hitheshbusetty/Parts_of_speech_tagging

Folders and files

Latest commit

History

Repository files navigation

Parts_of_speech_tagging

Structure of this repo

Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages