Skip to content

Data Wrangling

jcanny edited this page Apr 28, 2014 · 72 revisions

Table of Contents

ASCII Numeric Data

Dense Data

The goal of data wrangling is to get data from outside BIDMach into a numeric form that it can process. In the simplest case, you may have some tabular numerical data in a text file that looks like this:

1.2 4.3 5.4 0.2 0.1
0.2 7.2 8.0 1.5 3.2
...

Assuming this is a regular table (same number of elements per row), you can read it straight into BIDMach with

val a = loadFMat(<filename>.txt)

Its important that the filename ends in ".txt". That's the clue to the loader that this is a text file. The file should contain numerical elements only (no letters or other symbols). The elements can be separated by spaces, commas or tabs (or a combination).

You can also read integer data using "loadIMat", assuming the filename ends with ".txt". For double precision data the function is "loadDMat".

Sparse Matrices

Sparse matrices come in a variety of forms, which will often require some custom code. But one common format is row/column/value triples, which will look like this:

1 0 0.75
1 1 0.33
2 1 1.57
5 1 0.02
...

The data are of mixed types: two integers and one float or double per line. But the double reader can handle all of them, so you can load this with "loadDMat". Note that loadFMat will sometimes work, but the float type can only hold 24 bits of integer precision. This wont work if the indices are bigger than 16 million. After loading the triples with loadDMat, you can convert to a sparse matrix using "cols2sparse" like this:

val a = loadDMat("<filename>")
val s = cols2sparse(a, [fliprc], [issorted], [ibase])

where the arguments in [] are optional. [fliprc] is a Boolean specifying where the indices are really row, column or flipped (column, row). [issorted] is a Boolean which specifies whether the data are already sorted in column major order, i.e. lexicographically by column, row. [ibase] is an integer {0,1} specifying where the input data are zero- (default) or one-based. zero-based row/column indices start at zero, while one-based start at one. For the example above (which must be zero-based), the data can be converted as

val s = cols2sparse(a)

since the parameters default to fliprc=false, issorted=true, ibase=0.

Checking Sparse Matrices

The complexities of generating large data often lead to "glitches". e.g. indices that are apparently sorted may contain out-of-order elements, or indices may be negative or otherwise out-of-range. You can check the integrity of a sparse matrix with

s.check

which checks index order and range. It will throw an exception if any problem is found, so that it can be used in the middle of an application program.

Table Parser (Tparse)

BIDMach includes a parser for common tabular data formats (tab- or comma-separated). The data should be in ASCII format, divided into rows, and with the same number and types of elements in each row. Empty fields are fine, but there should be exactly k-1 separators on each line of a k-column table. The data format is specified in a schema file. The schema file for the sparse data file from the last section looks like this:

int rows
int cols
float values

To parse this file, we use the "tparse" binary which will be in BIDMach/bin. tparse is invoked like this:

tparse -i <infile> -f <format_file> -o <output_prefix>

where <output_prefix> is a partial file name for the collection of output files. The actual file names will be appended to the field names and field types. e.g. if the prefix is "c:/data/nytimes." with the above schema, the output files will be:

c:/data/nytimes.rows.imat
c:/data/nytimes.cols.imat
c:/data/nytimes.values.fmat

Adding a "-c" option causes the output files to be compressed using gzip, and generating files:

c:/data/nytimes.rows.imat.gz
c:/data/nytimes.cols.imat.gz
c:/data/nytimes.values.fmat.gz

Both the compressed and uncompressed files are in BIDMat's binary format and can be loaded with "loadIMat" and "loadFMat". These loads are much faster (at least an order of magnitude) than loading the ascii data.

Word Fields

There are two fields types for string data. The first is "word" and the second is "string". The word field type produces a single numeric token for each field, using a dictionary. e.g. "the white house" encodes as the literal string "the white house". The word type produces two output matrices, an IMat containing the indices of the tokens, and an SBMat (sparse Byte matrix) which is the dictionary. The IMat has the same number of rows (and one column) as the input table.

SBMat's are a fast way to encode a lot of string data. An SBMat does not have explicit row indices. Its elements are implicitly packed to the lowest row indices. i.e. the j^th element in a column is in row j. Each column encodes a single string token, and the column number is the index of the token.

Let's look at an example. Here is the schema:

word sometext

and here is a datafile:

hot
hot 
hot or
not

running tparse on this data/format file produces two matrices:

data.sometext.imat
data.sometext.sbmat

and if we load them:

val a = loadIMat("data.sometext.imat")
val b = loadSBMat("data.sometext.sbmat")

we see

> a.t
0 0 1 2
> b
hot,hot or,not

String Fields

The string field allows a field from a table to be split into individual tokens using sub-separator characters. This separator(s) must be distinct from the field separator(s). The string field separator(s) are specified in the format file, on a separate line. So to parse the above data file as string data, we create a schema file that looks like this:

string text
<space>
Where the space is literally a space character. You may need to use a low-level editor to make sure this whitespace isnt removed. The separator for fields can be altered with the "-d" option to tparse. It defaults to a tab character. In this example there is only one field so the main separator isn't needed.

Running tparse on the new schema and the old input file produces two output files:

data.text.smat
data.text.sbmat

where the data file is now an smat. The columns of this smat correspond to rows of the input file, and each row entry is a token from that row. If you like, this file is a transpose of the input file. The value of each entry is the dictionary position of that token, as a one-based index. This is so we can use the zero value as the absence of a token to keep the matrix sparse. Thus if we do:

> val a=loadSMat("data.text.smat")
> val b=loadSBMat("data.text.sbmat")
> full(a)
1 1 3
0 2 0
> b
hot,or,not

from which we can see that column 0 contains token "hot", col 1 contains tokens "hot" and "or", and column 2 contains "not". It follows that the number of rows in a is the maximum length of any of the strings.

If we want to reconstruct the input, we can't directly use SBMat, which can only represent a linear array of strings (since rows encode chars within strings). But there is a related type called CSMat (C-string matrix) which can hold 2D arrays with string values. If we construct a CSMat from b, prefaced by an empty value "", the indices of a will directly index it, with zeros indexing a null string. In other words

> val cs = csrow("") \ CSMat(b)
,hot,or,not
> cs(full(a)).t
hot
hot    or
not

Why is string data stored in this apparently-transposed format? It goes back to the difference between databases and scientific computing. The former typically stores instances as rows while the latter has mostly used compressed sparse column CSC format (or at least Matlab and many numerical libraries have). With CSC format, one can easily access an instance (a single column), or range of instances if the instances are stored as columns. But there is no way to access a row or range of rows without scanning the entire array. For the same reason, SBMat stores the letters of a token as a column which allows fast lookup.

Flex Tokenizer

Many data sources produce semi-structured data such as HTML, XML or JSON. Its possible to use parsers for each of these, but this is often challenging because of

  • The need for accurate schema specifying the types of objects.
  • The difficulty of recovering from format errors in badly-formed inputs.
  • The need to know and extract the data fields of interest at parse time.
Most of the datasets we have worked with use "shallow" formats, e.g. wikipedia, TREC, twitter etc., and it turns out to be very burdensome to use schema with these sources. They also have a lot of non-compliant data, and its easy to spend most of your time trying to recover from format errors.

Instead we have come to rely on separate tokenization functions in C++ which produce binary files, followed by "extraction" from the binary data in BIDMach. This workflow is quite effective even for very large datasets. i.e. we manage about 10TB of Twitter data on a single machine this way. The tokenizers use Flex, the Fast Lexical Analyzer, to break the data into tokens, build a dictionary from them, and create an output file containing binary indices for tokens into the dictionary. Each tokenizer produces three output files, such as:

twitter.imat
twitter.dict.sbmat
twitter.dict.imat

The first of these contain indices into the dictionary and interpreted entities. The second two form the dictionary with strings in the SBMat, and counts of tokens in the "dict" IMat. The tokenizers include an individual flex source file for that particular data format (such as xmltweet.flex), and common driver code in newparse.cpp and utils.cpp. As well as splitting the input into tokens which reference dictionary entries, the tokenizer recognizes and decodes common entities such as integers, floating point numbers and dates. Each non-entity token appears in the output as a positive 31-bit integer, which is its index into the dictionary. Entity tokens are truncated to 31 bits and stored with sign bit set (i.e. as negative integers). Actually the representation is up to the user. You can store non-entity and entity objects with 32 bits if you are confident of being able to differentiate them from context. Or you could add a special integer (which could not be a dictionary index) before each non-entity token.

The sample tokenizers (twitter, wikipedia and TREC) use the sign bit approach. Integer and date entities lose their leading bit, while floating point numbers lose their trailing bit. This is a compromize which works well with a pretty wide variety of input data, but obviously has limitations. A tokenizer can be run like this:

> xmlwiki -i data.txt -o mydata.

and for a data file like:

<text> the first 3 digits of pi are 3.14 and the first 6 are 3.14159</text>

If we load

> val a = loadIMat("mydata.imat").t
1,2,3,-2147483645,4,5,6,7,-1608221983,8,2,3,-2147483642,7,-1608218648,9
> val d = loadSBMat("mydata.dict.sbmat")
<text>,the,first,digits,of,pi,are,and,</text>

we see that all the positive indices are one-based indices into the dictionary, and that negative entities directly hold values that can be extracted with some bit-fiddling. The floating point values have to be extracted from the integer fields by masking and right shifting, which can be done in Java. To convert to floats, you need to "cast" them which is possible but difficult in Java. The routine edu.berkeley.bid.UTILS.memcpy(len, ptr1, off1, ptr2, off2) copies and casts an Array[Int] ptr1 with a word offset of off1 into an Array[Float] ptr2 with a word offset of off2.

Variable Types: Continuous, Discrete, Categorical

Continuous Data

Continuous (real-valued) data usually require minimal pre-processing. It may be valuable to normalize the data, i.e. subtract the population mean for that feature and divide by its standard deviation. But regression, SVM and random forests are insensitive to feature offset and (relatively) insensitive to scale. If the input data is sparse, its important to make sure any value transformation preserves sparseness. Offsets by the mean don't do this, but division by standard deviation does.

Discrete Data

Discrete data often derive from counts (e.g. word instances in a document), and can benefit from a few transformations. e.g. one of the best transformations for count data used for SVM inference is a square root, which normalizes the relative error of the counts. log transformations, e.g. tfidf, may also help.

Categorical Data

Categorical data include many types of unordered labels, e.g. document topic labels. Give a field that encodes a category, the tokenizers will produce an integer index for that field that points to a text representation of the category in a dictionary. This representation cant be used directly in most learning algorithms. Instead we need to transform the index value i into a vector [0, 0, 0,... 1,...,0, 0\ with a 1 in the i^th position.

Clone this wiki locally