This release adds new algorithms for keyword extraction, adds some example notebooks and fixes some bugs.
Addition of edge weighting option in graph-of-words. The weight of an edge is the co-occurrence count of the tokens.
k-core approaches allow the selection of cohesive keywords. Selected keywords correspond to a cohesive subgraph. In other words, the granularity of selection is at the level of cohesive subgraphs and nodes are selected by entire batch at a time. A key property is the selection of an automatically adaptive number of cohesive keywords.
Three selection methods are now available based on the k-core decomposition
of the graph-of-words.
The 'maximum'
method simply selects the main core (the k-core with maximum k).
This is the default method. It can be viewed as being too restrictive. Two other
selection methods alleviate this limitation.
On one hand, the 'density'
method goes down the hierarchy of k-cores to select
the one retaining the cohesiveness from the perspective of the density of the
k-core. The most appropriate k-core is selected via the elbow method.
On the other hand, the 'inflexion'
method exploits the k-shell (which is the part
of the k-core that does not survive in the (k+1)-core). It consists in going down
the hierachy of k-cores as long as the shells increase in size, else stopping.
Example on 'density'
method:
from gowpy.summarization.unsupervised import KcoreKeywordExtractor
extractor_kw = KcoreKeywordExtractor(directed=False, weighted=True, window_size=4,
# Parameter to set the selection method
selection_method='density')
The CoreRank method extracts keywords from a text document at the node-level of a graph-of-words representation. Each node/token in the graph-of-words is associated with a score, namely the sum of the core numbers of its neighbors. Then each node is ranked in decreasing order of score.
This extractor allows both the selection of an automatically adaptive number of keywords and the selection of given number or proportion of keywords.
Example usage:
from gowpy.summarization.unsupervised import CoreRankKeywordExtractor
extractor_kw_cr = CoreRankKeywordExtractor(directed=False, weighted=True, window_size=4)
preprocessed_text = "..." # preprocessed text in which to extract keywords
extractor_kw_cr.extract(preprocessed_text, n=5)
- Implementation of the generalized core algorithm for weighted graphs (i.e. k-core algorithm for weighted graphs) from "Generalized Cores" V. Batagelj, M. Zaveršnik (2002)
- The
GoWMiner
can now be used to incrementally load results of more than one subgraph mining process. - Fix of a bug in the computation of the sparse matrix in the
GoWVectorizer
vectorizer.
- Addition of example notebooks
- Update of the documentation
- Public release of the first version of the library