Fangzhou Xie (fangzhou[dot]xie[at]nyu[dot]edu)
This wigpy
package is to compute time-series indices from texts, using
Wasserstein Index Generation (WIG) model and pruned Wasserstein Index
Generation (pWIG) model. The former is described in Wasserstein Index
Generation Model: Automatic generation of time-series index with
application to Economic Policy Uncertainty (published
version
and arxiv); while the latter is
described in Pruned Wasserstein Index Generation Model and wigpy
Package (arxiv) as an extension to
the original WIG model to deal with large-vocabulary instances.
Please cite the following papers if you include this package in your work. Use this one if you use the original WIG model:
@article{xie_wig_2020,
title = {Wasserstein {Index} {Generation} {Model}: {Automatic} generation of time-series index with application to {Economic} {Policy} {Uncertainty}},
volume = {186},
issn = {0165-1765},
shorttitle = {Wasserstein {Index} {Generation} {Model}},
url = {http://www.sciencedirect.com/science/article/pii/S0165176519304410},
doi = {10.1016/j.econlet.2019.108874},
language = {en},
urldate = {2019-12-10},
journal = {Economics Letters},
author = {Xie, Fangzhou},
month = jan,
year = {2020},
keywords = {Economic Policy Uncertainty Index (EPU), Singular Value Decomposition (SVD), Wasserstein Dictionary Learning (WDL), Wasserstein Index Generation Model (WIG)},
pages = {108874},
}
Or use this one if you choose the pruned WIG variant:
@misc{xie_prunedwig_2020,
title={Pruned Wasserstein Index Generation Model and wigpy Package},
author={Fangzhou Xie},
year={2020},
eprint={2004.00999},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
- python 3.7.6
- pytorch 1.3.1
- scikit-learn 0.22.1
- numpy 1.18.1
- pandas 0.25.3
- spacy 2.2.3
- gensim 3.8.1
Note: This package is developed under Ubuntu 18.04.3 LTS but not tested on macOS or Windows machines. I think macOS should work fine but I highly doubt it is the case for Windows users. I only listed the version of packages that I am using, and other (previous) versions may also work as well.
To install pytorch and spacy with GPU (CUDA) support, please consult
their official website for installation instructions. Other packages
should be available through both conda
or pip
channel.
It is recommended to install Anaconda or Anaconda mini for whole suite of python and its related scientific libraries. If this is the case for you, you can install packages like:
conda install numpy pandas gensim scikit-learn
Otherwise, you will need pip
to install like this:
pip install numpy pandas gensim scikit-learn
You can install wigpy
by pip
:
pip install wigpy
The main model class is WIG
and simply pass your time-attached
sentences to it.
from wigpy import WIG
sentences = [('2018-01-01', 'This is the first sentence.'),
('2020-02-14', 'I have another sentence.')]
wig = WIG(sentences)
Note that the input sentences
here is a list of (date, sentence)
pair (time-associated texts are required for the WIG model to generate
time-series indices). Each time-stamp should be in format “%Y-%m-%d” and
make sure to transform your date into this format before calling the
WIG
class.
You could well have multiple sentences for one time-label. In other words, it is fine to pass a whole document for the “date-doc” pair, like:
sentences = [('2018-01-01', 'I have one sentence here. And another.'),
('2020-02-14', 'There is something else.')]
Under the hood, the parsing will be carried out by spacy
and to be
preferred to use GPU for acceleration. If GPU(CUDA) is not available,
then it will fall back to use CPU. In the data processing pipeline, the
full document will be parsed into sentences, and you can choose
remove_stop
and/or remove_punct
to remove stop words and/or
punctuation marks in the sentences.
What is more, you could also pass documents that have same time-stamps. Example:
sentences = [('2020-02-14', 'A sentence for this date.'),
('2020-02-14', 'Another sentence for today.')]
This also works so you don’t have to merge documents beforehand by
yourself. In many applications, the texts are observed several times a
day and you can pack them all in a list to the WIG
and it will take
care of everything else.
You could try to play with the provided test dataset by calling the
readdata
function.
from wigpy import readdata
This example dataset is the one I used for the publication of both WIG
paper, where I collected some news headlines from The New York Times
from 1980-2018. The function readdata
will return a list of tuples,
where each tuple is a pair of date
and headline
. That is exactly
what we need for feeding the WIG
model.
from wigpy import WIG, readdata
data = readdata()
wig = WIG()
# train and evaluate the model, return evaluation loss
# crossvalidation is done here automatically
loss = wig.train()
# save the generated index
wig.generateindex(output_file='index.tsv')
If you accept all parameters by its default value, then you are done!
The output index.tsv
is a tab-separated file under subfolder
results
.
As this package is a new implementation of WIG model so the computation is slightly different than the results in the original WIG paper. But the default value are mainly the ones chosen by cross-validation as in the paper.
Parameters:
======
dataset : list, of (date, doc) pairs
train_test_ratio: list, of floats sum to 1, how to split dataset
emsize : int, dimension of word embedding (Word2Vec)
batch_size : int, size of a batch
num_topics : int, K topics to cluster
reg : float, entropic regularization term in Sinkhorn
epochs : int, epochs to train
lr : float, learning rate for optimizer
wdecay : float, L-2 regularization term used by some optimizers
log_interval : int, print log one per k steps
seed : int, pseudo-random seed for pytorch
prune_topk : int, max no of tokens to use for pruning vocabulary
l1_reg : float, L1 penalty for pruning
n_clusters : int, KMeans clusters to group tokens
opt : str, which optimizer to use, default to 'adam'
ckpt_path : str, checkpoint when training model
numItermax : int, max steps to run Sinkhorn, dafault 1000
dtype : torch.dtype, default torch.float32
spacy_model : str, spacy language model name
Default: nlp = spacy.load('en_core_web_sm', disable=["tagger"])
metric : str, 'sqeuclidean' or 'euclidean'
merge_entity : bool, merge entity detected by spacy model, default True
remove_stop : bool, whether to remove stop words, default False
remove_punct : bool, whether to remove punctuation, default True
interval : 'M', 'Y', 'D'
visualize_every : int, not implemented
loss_per_batch : bool, if print loss per batch
**kwargs : Also parameters from Word2Vec
The hyperparameters should be self-evident. Any parameter for
gensim.models.Word2Vec
(Word2Vec) may
also be passed into the WIG model. Only remember it is equivalent to
pass emsize
or size
for the embedding dimension, as the latter is
defined by Word2Vec model.
For optimizers, the default is Adam
as is used by the original WIG
algorithm. There are several others to choose:
if self.opt == 'adam':
optimizer = optim.Adam([self.R, self.A], lr=self.lr,
weight_decay=self.wdecay)
elif self.opt == 'adagrad':
optimizer = optim.Adagrad([self.R, self.A], lr=self.lr,
weight_decay=self.wdecay)
elif self.opt == 'adadelta':
optimizer = optim.Adadelta([self.R, self.A], lr=self.lr,
weight_decay=self.wdecay)
elif self.opt == 'rmsprop':
optimizer = optim.RMSprop([self.R, self.A], lr=self.lr,
weight_decay=self.wdecay)
elif self.opt == 'asgd':
optimizer = optim.ASGD([self.R, self.A], lr=self.lr,
weight_decay=self.wdecay, t0=0, lambd=0.)
else:
print('Optimizer not supported . Defaulting to vanilla SGD...')
optimizer = optim.SGD([self.R, self.A], lr=self.lr)
Some of the optimizers have weight_decay
parameter to have L-2
regularization during optimization step.
For preprocessing, it is performed by spacy
and there are 4 parameters
to control its behavior. spacy_model
is the language model used by
spacy
, when dealing with other languages or if you want to use large
version of English model, be sure to pass it in by this argument.
merge_entity
is a Boolean value and spacy
will recognize entities
and merge them as a single token, e.g. ‘quantum mechanics’. The default
value is true and you could turn it off by ‘False’. remove_stop
and
remove_punct
are Boolean values to remove (or not) stop words and (or)
punctuations. The default behavior is to remove punctuation but not stop
words.
To use the original algorithm, pass prune_topk=0
to the WIG
model,
or choose any integer larger than 0 as the dimension of vocabulary, e.g.
prune_topk=1000
for choosing 1000 words as maximum vocabulary length.
Those, in this case 1000 words, are called “base tokens”, and other
words will be approximated by those “base tokens”.
The original WIG algorithm relies largely on Wasserstein Dictionary Learning (here), as WIG use WDL to cluster documents and then use SVD to produce uni-dimensional time-series index. I implemented the code in python with pytorch calculating gradients, and readers interested in details should refer to the WIG paper.
However, this model leverages Wasserstein distance, which is notoriously expensive for computation. Even with the Sinkhorn regularization, the Optimal-Transport based calculations are still slow to compute. Further, the WIG model requires a full matrix to be calculated, where is the dimension of vocabulary. It is obvious that when the dataset and vocabulary is large, the memory become an issue, especially if we still want to use GPU for acceleration (VRAMs are typically smaller than RAMs). Thus, I propose this modified version–pruned WIG to shrink vocabulary to a smaller dimension.
First we need to identify the “base tokens”. The idea is to choose a subset of vocabulary, of length , to represent the whole vocabulary, of length , where . Now the question become: how should we choose the “base tokens” wisely? I first consider the unsupervised -means clustering on word vectors (given by Word2Vec) to first “word-clusters” and pick the most frequent tokens among each clusters.
Formally speaking, we set up the following minimization problem:
and we could find clusters of words.
The number of tokens in a cluster is given by the formula: , where is length of “base tokens” and is the number of clusters. In this way, we are picking tokens from clusters equally.
Now that we have the “base tokens” at hand, the next step is to represent the whole vocabulary by these “base tokens”. This step is conducted by using LASSO regression. Denote word vector of base tokens as and other tokens as , we have
and for each , we have a weight vector of length to represent token in the -dimensional space. The original WIG model will calculate -dimensional word frequency vector for the further transportation computation, and now we need to use the to represent this token. Finally, we could compute pair-wise word-distance matrix for the Sinkhorn computation.
This package is under MIT license.