TextRank

A simple implementation of the TextRank algorithm, tested on research paper abstracts with human annotated key phrases.

Algorithms

PageRank

PageRank is an algorithm created by Lawrence Page and Sergey Brin to rank web pages according to their importance, calculated using the "link to" and "linked by" relationships induced by the hyperlink system. It's implementation is based on a Markov model, where the limiting distribution of this latter corresponds to so-called scores.

TextRank

In 2004, the CS department of the University of North Texas came up with a new algorithm, TextRank, based on PageRank. The idea is the same, but the nodes instead of representing web pages now represent text entities (whatever they might be), and edges relationships between these entities. The idea behind TextRank was to be able to extract the most important keywords or key-phrases from text excerpts.

TF-IDF

Term Frequency - Inverse Document Frequency is a statistical method for computing the most important and relevant terms in a set of documents. It is a good baseline to compare to other methods, in our case, TextRank.

We enhanced the base algorithm by adding key-phrases concatenation whenever they are adjacent in the text and syntactic filtering use POS tagging.

Implementation (Pseudo-code)

Pre-processing

This step is common to both algorithms. It filters out types of word (verb, determiner, etc) that the user doesn't deem relevant or useful and creates all possible n-grams, were n goes from 1 to the number set by the user.

TextRank

TF-IDF

Post-processing

This step is common to both algorithms. It uses the collected terms to create larger key phrases if the terms were adjacent in the original document.

Set-up

The following parameters can be tuned:

n: maximum length of the grams (1-grams, ..., n-grams) we will use as nodes of the graph.
k: number of key-phrases to predict per document.
Syntactic filters: which types of word to consider for key-phrases candidates

We used 2 different metrics to evaluate the models:

F1-macro score
Mean Reciprocal Score*

*Since the final scores of key-phrases or not well defined in the original paper, the predictions' order was randomly chosen.

Results

The same results were achieved using different configurations, so we will only list the most computationally efficient versions of them:

n	k	Synt. filters	F1-macro	MRR
1	3	Noun/Adj	0.065	0.163
1	1	Noun/Verb	0.063	0157

Tf-idf results for comparison:

k	F1-macro	MRR
1	0.031	0.109
3	0.083	0.195
5	0.095	0.209

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
algorithm		algorithm
experiments		experiments
images		images
tests		tests
.gitignore		.gitignore
README.md		README.md
keyphrase_extractor.py		keyphrase_extractor.py
pytest.ini		pytest.ini
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextRank

Algorithms

PageRank

TextRank

TF-IDF

Implementation (Pseudo-code)

Pre-processing

TextRank

TF-IDF

Post-processing

Set-up

Results

About

Releases

Packages

Languages

arthurdeschamps/TextRank

Folders and files

Latest commit

History

Repository files navigation

TextRank

Algorithms

PageRank

TextRank

TF-IDF

Implementation (Pseudo-code)

Pre-processing

TextRank

TF-IDF

Post-processing

Set-up

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages