-
Notifications
You must be signed in to change notification settings - Fork 1
Indexing with Pignlproc and Hadoop
This page explains the steps required to generate an index in the format (URI,{(token_1, count_1),...,(token_N, count_N)}) from an XML dump of Wikipedia using Apache Pig and Hadoop. Some of this code is still in the testing phase, but each of the steps described here have been tested successfully using clients running Ubuntu 10.04 and 12.04, and a Hadoop cluster of five machines running Ubuntu 10.04, as well as in local and pseudo-distributed modes on a single machine. Please keep in mind that this code is changing every day, so expect some of these instructions to change quickly.
Apache Pig version >= 0.8.1 (version 0.10.0 preferred)
Hadoop >= 0.20.x (tested with 0.20.2) - if you only want to test the scripts in local mode, you won't need to install Hadoop, as Pig comes with a bundled Hadoop installation for local mode.
A stopword list with one word per line, such as the one used by DBpedia-Spotlight
- Clone this fork of pignlproc
git clone git@github.com:chrishokamp/pignlproc /your/local/dir
- From the top dir of pignlproc, build the project by running:
mvn assembly:assembly
**this will compile the project and run the tests. If you want to skip tests:
mvn assembly:assembly -DskipTests=true
- Set JAVA_HOME to the location of your JDK:
i.e.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
Add Apache Pig to your PATH:
i.e.
export PATH=/home/chris/programs/pig-0.10.0/bin:$PATH
-
You're now ready to test in local mode. There are eight parameters that you'll need to supply to Pig - these are required for both local and distributed mode:
$DIR - the directory where the output files should be stored
$STOPLIST_PATH - the path to the directory containing the stoplist in HDFS, or on your local filesystem
$STOPLIST_NAME - the filename of the stoplist
$INPUT - the location of the wikipedia XML dump (see below for more info)
$MIN_COUNT - the minumum count for a token to be included in the index (i.e. '2' means that all counts >= 2 will be included)
$PIGNLPROC_JAR - the location of the pignlproc jar
$LANG - the language of the Wikidump (only English (en) is currently supported) $MAX_SPAN_LENGTH - the maximum length for a paragraph spanTo test in local mode, run examples/nerd-stats/indexer_small_cluster.pig using parameters similar to the following:
From top dir of pignlproc:
pig -x local \
-p DIR=/tmp/output \
-p STOPLIST_PATH=/home/chris/projects/pignlproc/resources/ \
-p STOPLIST_NAME=stopwords.en.list \
-p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
-p LANG=en \
-p INPUT=src/test/resources/enwiki-pages-articles-sample.xml \
-p MIN_COUNT=1 \
examples/nerd-stats/indexer_small_cluster.pig
-Note: change $DIR if necessary to point to an output dir that works for you, and $STOPLIST_PATH to point to the directory containing your stoplist. When the script finishes, check to $DIR to confirm output.
5. If you want to run indexing on an actual Hadoop cluster, you'll first need to put your wikidump and stoplist into the Hadoop File System (HDFS).
hadoop fs -put /location/of/enwiki-latest-pages-articles.xml /location/on/hdfs/enwiki-latest-pages-articles.xml
hadoop fs -put /location/of/stopwords.en.list /location/on/hdfs/stopwords.en.list
-Note: Although Pig supports automatic loading of files with .bz2 compression, this feature is not currently implemented in the custom loader in pignlproc. Thus, the extracted version of the XML dump is currently required. This inconvenience will be resolved in the near future.
You should also define your output dir ($DIR) as a directory in HDFS
Note also that the parameters containing paths must now be paths in HDFS
6. There are currently two possibilities for indexing:
1) Run indexing using a script written for smaller clusters which counts tokens inside a UDF: indexer_small_cluster.pig
2) Run indexing using a script written to take maximum advantage of parallelization, but requiring much more intermediate storage: indexer_lucene.pig
-- Testing has shown that option (1) is **much more** efficient than (2) on small/mid-sized clusters (up to 45 mappers and 15 reducers), but these scripts have not yet been tested on a very large cluster.
Make sure that $HADOOP_HOME is set to the location of your local Hadoop installation.
echo $HADOOP_HOME /local/hadoop-0.20.2
From your client machine you can now run the script using something like the following:
pig \
-p DIR=/user/hadoop/output \
-p STOPLIST_PATH=/user/hadoop/input \
-p STOPLIST_NAME=stopwords.en.list \
-p INPUT=/user/hadoop/wikidump/enwiki-latest-pages-articles.xml \
-p MIN_COUNT=1 \
-p PIGNLPROC_JAR=pignlproc-0.1.0-SNAPSHOT.jar \
-p LANG=en \
indexer_small_cluster.pig
--- substitute 'indexer_small_cluster.pig' with 'indexer_lucene.pig' to try the script that maximizes parallelization for large clusters.
**Notes:
1- the speed of indexing obviously depends upon the size of your cluster. Constraints such as the size of hadoop.tmp.dir may also affect performance. This code has been tested on up to 1/3 of the full English Wikipedia with (relatively) good performance on a five-node cluster.
2- you only need the pignlproc JAR and the example scripts to use this code with Hadoop. You can just copy examples/nerd-stats/indexer_small_cluster.pig, examples/nerd-stats/indexer_lucene.pig and target/pignlproc-0.1.0-SNAPSHOT.jar to a client machine configured for your cluster if you built the project on a different machine (you'll need Apache Pig though).
3- Once indexing has finished, to get the files back on to your local machine, do:
hadoop fs -get /path/to/hadoop/output /path/to/local