-
Notifications
You must be signed in to change notification settings - Fork 168
Machine Learning Models
BIDMach supports supervised and unsupervised models. It includes a general regression class with linear, logistic and SVM subclasses, and a factor model class with LDA, NMF and SFA (often implemented with ALS or Alternating Least Squares). There are also random forest and k-means. We have started a group of causal estimators in a separate package. Most models can be combined with (e.g. L1 and L2 regularization) mixins. There are typically several options for updaters, although several models use multiplicative updates and require matching updaters.
All models work with either sparse or dense input matrices (on CPU or GPU). For regression models which have target matrices (indicating category membership) as well as data matrices, the target matrices can also be either sparse or dense.
Regression models include Generalized Linear Model (GLM) and Factorization Machine. Linear Regression, Logistic Regression and Support Vector Machines are implemented under GLM with different link functions and loss gradients. Factorization Machine models interaction between features.
Factor models includes Latent Dirichlet Allocation (LDA), Alternating Least Square (ALS) and Non-negative Matrix Factorization (NMF). A factor model takes a sparse matrix (m-by-n) as input and decompose it into two low dimension matrices (m-by-k and k-by-n). k is usually much smaller than m and n. Dimension reduction is typically used to improve prediction, but may also use the factors to capture topic or cluster structure in the dataset.
Usage of the models are described below. Import the following package before using the models:
import BIDMach.models._
The Generalized Linear Model (GLM) is a generalization of linear regression. Linear regression, logistic regression and SVM are implemented under GLM with different link functions and likelihood functions. We will be adding other standard GLM models soon. You should be able to deploy your own GLM models with your own link object that extends BIDMach.models.GLM.GLMLink
.
The following code creates a learner and the associated options for GLM.
val (nn, opts) = LDA.learner(a, c, i)Input a is a m-by-n data matrix. It can be either dense or sparse. m is the dimension of features and n is the dimension of examples. c is a k-by-n matrix. k is the number of categories. c(i,j)=1 indicates that example j belongs to category i. For details on how to create matrices, please refer to Data Wrangling or Creating Test Matrices. The choice of GLM model is set by the third argument i. Currently:
0 = linear regression 1 = logistic regression 2 = logistic regression predictor with likelihood (not log likelihood) optimization 3 = Support Vector MachineThe learner builds multiple binary predictors for each category, that is, an OAA or One-Against-All multiclass classifier. To see all the parameters of the learner:
opts.what
To train the model:
nn.train
GLM
produces the weight vectors for each of the k categories. To get the results of the learner:
nn.modelmat: the feature-weight matrixTo save the model on disk, please see Saving your results. If you have some data to predict, you can construct a predictor for it like this:
val (nn, nopts) = GLM.predictor(model, atest, ctest, i)
Here model is the model produced in the train step, atest is the test data and again i is the choice of GLM model. ctest is a matrix that holds the predictions.(You can go to here to see how to create some test matrices) You can compute predictions with
nn.predict
Please refer to the usage of GLM, setting the third argument i=0.
Please refer to the usage of GLM, setting the third argument i=1 or 2. i=1 produces standard logistic regression which maximizes the total log likelihood across all input instances. A logistic model outputs a probability <math>p</math> for each test instance, and the probability of correctness is <math>p</math> if the label is 1, and <math>1-p</math> if it is zero. Standard logistic regression optimizes the sum of logs of this probability. A weakness of this formulation is that it attributes a lot of weight to incorrectly labeled instances, trying to avoid large negative log likelihoods. In effect it tries hard not to make bad labels worse.
Alternatively, setting i=2 optimizes the average likelihood across instances. This directly optimizes the accuracy of the prediction (i.e. the probability of a correct prediction). It tries to make good labels better. This option often leads to more accurate models (since it is directly optimizing accuracy). It seems to be a little trickier to tune, but you should find the effort worthwhile if you care about absolute accuracy. Whereas standard logistic regression weights instances according to their prediction error (so bad instances have higher weight), this version weights instances "on the fence" more heavily. Thus it resembles an SVM, and is optimizing accuracy on borderline instances.
Support vector machine learning is very similar. Start by defining a learner with data and target matrices:
> val (nn, nopts) = GLM.SVMlearner(datamat, labelmat)then train it with
> nn.trainThe algorithm uses the Pegasos SGD (Stochastic Gradient) algorithm [1]. This algorithm minimizes the SVM objective function using gradient steps of the form:
<math> \nabla = \lambda w - \mathbb{1}[y_i\;] \:y_i x_i </math>
where <math>w</math> is the model weight vector, <math>x_i</math> is the ith instance, and <math>y_i</math> is the (0,1) label of the ith instance, and <math>\mathbb{1}</math> is an indicator function which is 1 when its argument is true, 0 otherwise. <math>\lambda</math> is the regularization parameter of the SVM (see [1]). The regularization is implemented using BIDMach's L2Regularizer mixin. <math>\lambda</math> is equivalent to the L2 regularizer weight, which can be set with:
> nopts.reg2weight
The section below on Tuning explains how to set the other important parameters.
[1] Shalev-Schwartz S., Singer Y., Srebro, N, "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM", IMCL 2007.
The GLM models are homogeneous, and do not include a constant term. To model a constant term, you need to introduce a constant feature, (usually fixed at 1) into the input data source. Make sure you do this after doing any normalization of each input feature.
The standard datasources include this as an option (Options.addConstFeat
). When this option is set, a constant feature is added as the last feature on each instance (so the number of features increments by 1).
Regression models share the following options
var targets:FMat = null var targmap:FMat = null var rmask:FMat = null
and in addition, the GLM model has an option:
var links:IMat = null
which specifies the link (0 for linear, 1 for logistic etc) number for each target. The link argument must be set. The others are optional. They are not needed if the input is two datasources (as in GLM.learner(dat,cats,1)
since the targets are given explicitly.
A GLM learner will also typically inherit options from a Learner object (e.g. number of passes over the dataset, regularizers (L1 and possibly L2), and options such as learning rate from an updater such as ADAGRAD. Those options are described in detail in those sections of the Docs. Regularization weights, learning rate, number of passes over the data are all settable in the opts
object.
Back to the Regression-specific fields, targets
is used to define targets for datasets from which they have not been extracted as separate datasources. For instance, from a set of log files for social media data, you might define targets such as emoticons (:-) which exist as features in the dataset. targets
is a k x n
matrix where k is the number of targets, and n is the number of features. targets
will a a zero-one-valued matrix with ones in position (i,j)
where j
is a feature which defines target i
.
targmap
supports multimodel inference over the same set of targets. It will typically be a stack of identity matrices replicated p
times, where the learner trains p
models per target with different parameters. See the BIDMach/Experiments.scala examples of its usage. For simple cases, you can leave it null.
rmask
forces certain elements of the prediction matrix to zero. It is a zero-one-valued matrix. This is essential when you use targets
to define targets from input features. At a minimum, the target input features need to be forced to zero in rmask
.
You can see a full list of a GLM learner options with the what
method, like this:
> val (mm,opts) = GLM.learner(dat,cats,1) > opts.what Option Name Type Value ----------- ---- ----- addConstFeat boolean false batchSize int 10000 dim int 256 doubleScore boolean false epsilon float 1.0E-5 evalStep int 11 featType int 1 initsumsq float 1.0E-5 links IMat 1,1,1,1,1,1,1,1,1,1,... lrate FMat 1 mask FMat null npasses int 1 nzPerColumn int 0 pstep float 0.01 putBack int -1 r1nmats int 1 reg1weight FMat 1.0000e-07 resFile String null rmask FMat null sample float 1.0 sizeMargin float 3.0 startBlock int 8000 targets FMat null targmap FMat null texp FMat 0.50000 useGPU boolean true vexp FMat 0.50000 waitsteps int 2
we wont go over all the options here, and they are discussed in the sections on DataSource, Learner, Mixins and Updaters. There are many, but typically tuning a model will focus on a few: lrate
which is the learning rate, the regularizer weights (reg1weight
and reg2weight
), and the updater parameters texp
and vexp
which are discussed in those respective sections.
Many of the parameters above have FMat type but a single value. That means all models share the same parameters. But you can tailor the parameters for each target (and use multiple models per target using targmap
) by passing in a k-element FMat for any parameter, where k is the total number of targets. Thus you can tune parameters in parallel with a single (or few) passes over a large dataset.
Factorization machines are described in [2]. They use dimension reduction to model interactions in prediction tasks, and have been shown to work well on collaborative filtering and click prediction. An output <math>\hat{y}</math> is modeled as
<math>\hat{y} = w_0 \: + \: \sum\limits_{i=1}^n \; w_i x_i \: + \sum\limits_{i=1}^n \sum\limits_{j=i+1}^n \; \langle \mathbf{\text{v}}_i, \mathbf{\text{v}}_j \rangle \; x_i x_j</math>
where each <math>\mathbf{\text{v}}_i</math> is a k < n dimensional vector. <math>\hat{y}</math> can be thought of as an approximation to a second-order function <math>y(x)</math> using a dimension reduced interaction matrix. The second summation in the formula above is effectively symmetric and, except for the missing diagonal terms, positive definite. Its been noted that Factorization machines can approximate any second-order function of the features if k is large enough by varying the diagonal weights in the first sum. But this is not necessarily an efficient approximation. In our implementation we generalize the above formula to this one:
<math> \hat{y} = w_0 \: + \sum\limits_{i=1}^n \; w_i x_i \: + \sum\limits_{i=1}^n \sum\limits_{j=i+1}^n \; \langle \mathbf{\text{v}}_i, \mathbf{\text{v}}_j \rangle \; x_i x_j \; - </math><math>\sum\limits_{i=1}^n \sum\limits_{j=i+1}^n\langle\mathbf{\text{u}}_i,\mathbf{\text{u}}_j \rangle x_i x_j</math>
which adds a second sum of negative definite terms. We assume the <math>\mathbf{\text{v}}_i</math> now have dimension <math>k_1</math> and the <math>\mathbf{\text{u}}_i</math> have dimension <math>k_2</math>. This generally gives better predictions for a given total number <math>k=k_1 + k_2</math> of dimensions than using assuming all positive eigenvalues as in the original formulation. We found that <math>k_1 > k_2</math> by a factor of 4 or more works best. Either this is a property of the full second-order matrix, or the positive subspace is more important for high-value predictions.
Factorization machines are integrated with BIDMach's GLM framework, so you can create factorization machines for either linear, logistic or SVM link functions. To create a Factorization machine model, do:
> val (mm, opts) = FM.learner(datamat, labels, d)
where
dis the link number (0 for linear, 1 or 2 for logistic, 3 for SVM). The options for setting the positive and negative subspace dimensions are:
> opts.dim1 = 224 > opts.dim2 = 32
Otherwise, the options for running and tuning GLM models apply.
Note that factorization machines currently only support singe-target predictions. This will probably change soon.
[2] Rendle, S. "Factorization machines." In Proc. 10th IEEE Int. Conf. on Data Mining, 2010.
We currently have two models for LDA: BIDMach.models.LDA
uses Variational Bayes method for model inference and parameter estimation, and BIDMach.models.LDAgibbs
uses Gibbs sampling. In addition, the Variational Bayes model supports both batch and minibatch updates, so effectively there are 3 LDA algorithms available. The usage of the algorithms are the same with a few exceptions in setting the model options. The following code creates a learner and the associated options for LDA.
val (nn, opts) = LDA.learner(a, k) or val (nn, opts) = LDA.learnBatch(a, k) or val (nn, opts) = LDAgibbs.learner(a, k)The required input a is m-by-n sparse matrix (word-doc-counts). m is the number of features and n is the number of examples. For topic model, m is number of words and n is number of documents. Input k is optional and is the dimension of the model.
To see all the parameters of the learner:
opts.what
To train the model:
nn.train
To get the results of the learner:
nn.modelmats(0): the topic-word matrix, one topic per row. nn.datamats(1): the factorized topic-document matrix, one document per column.and you will probably want to wrap a call to
FMat(...)
around each of these expressions in order to get the result as a float matrix.
Hint: Make sure that opts.putBack is set to 1 if you want to store the topic-document matrix in nn.datamats(1).
You can go to Saving your results to see how to save your results.
Here are the key parameters for LDA
and LDAgibbs
. Default values are noted in parenthesis.
dim(256): Model dimension uiter(5): Number of iterations on one block of data alpha(0.001f): Dirichlet document-topic prior beta(0.0001f): Dirichlet word-topic prior * Other key parameters inherited from the learner, datasource and updater: batchSize: the number of samples processed in a minibatch power(0.3f): the exponent of the moving average of online learning npasses(2): number of complete passes over the dataset
Specialized parameters for LDA
exppsi(true): Apply exp(psi(X)) if true, otherwise just use X
Specialized parameters for LDAgibbs
m(100): number of samples (per word) taken for each pass through the data
Sparse Factor Analysis is a popular model for collaborative filtering. It is often referred to as ALS (Alternating Least Squares) which is an algorithm rather than a model. ALS is actually a very slow way to implement SFA. We use instead a gradient method with CG solver (should be preconditioned CG soon). Construction of SFA models is similar to LDA. For example, you can create a SFA learner with
val (nn, opts) = SFA.learner(a, k)The results are also in the same format as in
LDA
.
The usage of NMF model is the same as LDA. For example, you can create a NMF learner with
val (nn, opts) = NMF.learner(a, k)The main options are:
dim(256): Model dimension uiter(5): Number of iterations on one block of data uprior(0.01f): Prior on the user (datasource) factor mprior(0.0001f): Prior on the model NMFeps(1e-9): A safety floor constant
Other key parameters inherited from the learner, datasource and updater:
batchSize: the number of samples processed in a minibatch power(0.3f): the exponent of the moving average of online learning npasses(2): number of complete passes over the dataset
The goal of Independent Component Analysis (ICA) is to find a linear representation of non-Gaussian data where the underlying components are as independent as possible. Let S be an n x k matrix, where each column is an n-dimensional vector that represents the joint measurement of n mixtures at a certain time index. Then, we assume we can represent the "data mixing" process with an n x n, invertible mixing matrix A, so that we observe X = AS. Thus, the input to the model should be an n x k matrix X, and the ICA model should recover W = A^{-1} so that the original sources can be recovered by left-multiplying to the input, i.e., WX = WAS = S.
Rather surprisingly, it turns out that so long as at most one of the sources follows a Gaussian distribution, it is possible to recover the sources subject to ambiguity among permutations and scale, which for many datasets is not an issue.
BIDMach includes an implementation of FastICA, which is the current state of the art algorithm to solve ICA. To use ICA, we need an input matrix X. (See below for important facts regarding zero-mean and whitened data.) If X is an n x k matrix, then its dimension is n (so set opts.dim = n). This is important, since the dimension is built-in with the data and it does not make sense to have a fixed default unlike, for instance, LDA.
Here is some sample usage with a single matrix as input. In reality, large-scale data may require a FilesDS.
class xopts extends Learner.Options with MatDS.Opts with ICA.Opts with ADAGrad.Opts val opts = new xopts opts.dim = n // The number of rows of x val output = loadFMat("ica_output.txt") // A file containing our output val nn = new Learner(new MatDS(Array(output), opts), new ICA(opts), null, new ADAGrad(opts), opts); nn.train
Our current model returns the following:
modelmats(0) \\ Our estimated W = A^{-1} modelmats(1) \\ Our calculated data mean vector
Thus, our predicted source can be computed by
modelmats(0) * (output - modelmats(1))
Even if the output (note: this is the DATA, or the INPUT to the ICA model, a bit confusing but output makes sense because it's (mixing matrix) * (source matrix) = output matrix) is already zero-mean, we are okay because modelmats(1) should become the zero vector.
Right now, our version does not support efficient whitening. It is best if the data is already pre-whitened somehow. If not, one can use eigendecompositions, but this is an expensive operation.
We are currently extending ICA to include a faster and more reliable whitening procedure.
For an excellent survey of ICA, read the following reference:
Independent Component Analysis: A Tutorial Neural Networks, Vol. 13, No. 4-5. (2000), pp. 411-430 by Aapo Hyvärinen, Erkki Oja
Coming Soon
The usage of KMeans is the similar to LDA. For example, you can create a KMeans learner with
val (nn, opts) = KMeans.learner(a, k)Where k is the number of clusters. And if you have n data points in d dimension, then the data matrix a should be a d*n matrix.
KMeans is currently a simple implementation that runs for a fixed number of iterations selected by
opts.npasses
Result matrices are almost always stored as GMat (useGPU=true) or FMat (useGPU=false), and have general (Mat) type. Its easier to work with them as FMats (and this facilitates saving). The result of a training session is stored in the Learner's modelmat
or modelmats
field (for models with several components). You can extract and convert them to FMat like this:
> val (nn,opts) = LDA.learner(a,256) > nn.train > ... > val m = FMat(nn.modelmat)
After that, you can use saveAs("filename.mat",m,"model") to save your results in a Matlab-compatible HDF5 format mat file, or use saveFMat("filename.fmat.lz4",m) to save in BIDMat's own format using lz4 compression. You will find the latter to be significantly faster. We strongly recommend you use a similar file naming convention (".mat" suffix for matlab-compatible files, and "name.mattype.compresstype" for BIDMat-format files. HDF5 can store multiple arrays in each file, and requires a string name for each (in this case, we named m
as "model" in the HDF5 file). BIDMat requires matrices to each be saved in their own file. There is type meta-data in the file but it is easier to track matrix types and contents using filenames. (See here for more details).
You can create random dense matrices with either:
> val a=rand(n,m) //Creating a random FMat > val a=grand(n,m) //Creating a random GMatby convention, BIDMach's datasources assume that the row index is the feature index and the column index is for instances. You may want to use the GMat version if you're in a hurry. Its about 30x faster on our test machine.
To create an array of random binary labels with probability p, just generate a uniform (0,1) array as above, and then do:
> val labels = a < p
which will create an array with the same type as a (assumed to be FMat or GMat) containing 0s and 1s. If you need integer labels apply IMat or GIMat to the output, e.g.
> val labels = IMat(a < p)
To generate random integers in some range 0..k-1, you can do
> val randints = IMat(floor(k * 0.99999f * rand(n,m)))where the 0.99999f is to guard against the rare event that rand returns 1f.
You can create random sparse matrices with the sprand
function. It takes two dimension arguments and then a density argument.
val a=sprand(n,m,0.2) //Creating a random SMat with 20% non-zero valuesThis returns a sparse matrix with approximately 20% non-zeros and where the non-zero values are distributed uniformly in the range (0,1).
Since power-law matrices are so common in data analysis, there are convenience functions for generating random power-law matrices. The simplest is powrand
which takes row,column dimension arguments and then a density argument which indicates the expected number of items per column:
> val a = powrand(10000,1000,100) > mean(sum(a)) 99.5Which could represent a set of documents over a dictionary of 10000 terms, with 1000 documents of average length 100.
This function generates data by columns. For each column, it samples from a power-law distribution with exponent -1 to determine the number of elements in that column. Using this number, it repeatedly samples from a second power-law distribution with exponent -1 to obtain the final coefficient samples.
powrand
uses the general form of sprand, which is:
sprand(nrows, ncols, rowGenFunction, colGenFunction)This routine samples from the second function (colGenFunction) to get the number of elements in each column and uses this as a count to repeatedly sample from the row generator. For basic power-law data, both functions are specializations of
simplePowerLaw(range:Int)which produces power-law distributed integers with exponent -1 in the range 0,...,range-1.
For more general power-law data, you can specialize the Pareto random generator whose interface is:
paretoGen(low:Int, high:Int, alpha:Double)where
low,high
give the range of output values, and alpha
is the shape parameter.