Skip to content

arthurdeschamps/TextRank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextRank

A simple implementation of the TextRank algorithm, tested on research paper abstracts with human annotated key phrases.

Algorithms

PageRank

PageRank is an algorithm created by Lawrence Page and Sergey Brin to rank web pages according to their importance, calculated using the "link to" and "linked by" relationships induced by the hyperlink system. It's implementation is based on a Markov model, where the limiting distribution of this latter corresponds to so-called scores.

TextRank

In 2004, the CS department of the University of North Texas came up with a new algorithm, TextRank, based on PageRank. The idea is the same, but the nodes instead of representing web pages now represent text entities (whatever they might be), and edges relationships between these entities. The idea behind TextRank was to be able to extract the most important keywords or key-phrases from text excerpts.

TF-IDF

Term Frequency - Inverse Document Frequency is a statistical method for computing the most important and relevant terms in a set of documents. It is a good baseline to compare to other methods, in our case, TextRank.

We enhanced the base algorithm by adding key-phrases concatenation whenever they are adjacent in the text and syntactic filtering use POS tagging.

Implementation (Pseudo-code)

Pre-processing

This step is common to both algorithms. It filters out types of word (verb, determiner, etc) that the user doesn't deem relevant or useful and creates all possible n-grams, were n goes from 1 to the number set by the user.

preprocessing

TextRank

textrank

TF-IDF

tfidf

Post-processing

This step is common to both algorithms. It uses the collected terms to create larger key phrases if the terms were adjacent in the original document.

postprocessing

Set-up

The following parameters can be tuned:

  • n: maximum length of the grams (1-grams, ..., n-grams) we will use as nodes of the graph.
  • k: number of key-phrases to predict per document.
  • Syntactic filters: which types of word to consider for key-phrases candidates

We used 2 different metrics to evaluate the models:

  • F1-macro score
  • Mean Reciprocal Score*

*Since the final scores of key-phrases or not well defined in the original paper, the predictions' order was randomly chosen.

Results

The same results were achieved using different configurations, so we will only list the most computationally efficient versions of them:

n k Synt. filters F1-macro MRR
1 3 Noun/Adj 0.065 0.163
1 1 Noun/Verb 0.063 0157

Tf-idf results for comparison:

k F1-macro MRR
1 0.031 0.109
3 0.083 0.195
5 0.095 0.209

About

Implementation of the TextRank algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages