Supplemental information on ArchiveSpark for course CS4984/CS5984: Big Data Text Summarization, Fall2018, Virginia Tech.
Things you will learn about: Github, Docker, Zeppelin, ArchiveSpark, Spark.
ArchiveSpark serves as the first (not limited to) component in your project pipeline on web archive data extraction. In this tutorial, you will learn to deploy a test environment for ArchiveSpark, test code locally and execute code on the DLRL cluster. You will also find some further information about Spark programming and NLP processing with Spark.
If you encounter any question or issue, please check relevant documentation first. For more questions, you can create an issue in this GitHub page:
- ArchiveSpark
- Docker: Your Local Test Environment
- Zeppelin is Your Playground
- Real Job on Cluster
- Best Practice
- Work with PySpark
- Spark and NLP
"An Apache Spark framework for easy data processing, extraction as well as derivation for archival collections." - helgeho ArchiveSpark Official GitHub page
In this class, we will utilize ArchiveSpark to process our web archive collections. We can leverage the power of ArchiveSpark in various ways: content extraction, word count, clustering (LDA), etc.
In the following sections, you will find information about local usage and test with our Docker image; Instructions for running the ArchiveSpark job on DLRL cluster.
Spark SQL, DataFrames and Datasets Guide
We provide a Docker image that contains a full development environment with ArchiveSpark. Check following links for detailed information about Docker.
Be aware Docker works as a Virtual Machine in MacOS and Windows. You can configure the computer resources allocations(CPU/Memory) to speed up/down your task. In Linux systems, Docker works as a native application.
Install Docker CE version on your local environment: Linux MacOS
- Install Docker toolbox from here
- Disable Hyper-V feature in your windows system, tutorial here
- Open Docker Quickstart Terminal (it will start an automatic set up)
-
In Docker Quickstart Terminal, start the container
docker run -d -p 8082:8080 --rm -v ~/docker/cs5984/share_dir:/share_dir -v ~/docker/cs5984/logs:/logs -v ~/docker/cs5984/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name cs5984 nytfox/fall18_cs4984-cs5984
-
check container ip address:
docker-machine ip default
-
Open your browser with:
yourDockerIP:8082
Check Docker command line basics for various docker operations in the command line.
-
Get an account for Docker
-
Login Docker: either through application or commandline
-
Pull the container image from Docker image hub
pull nytfox/fall18_cs4984-cs5984:latest
-
Start the container
docker run -d -p 8082:8080 --rm -v ~/docker/cs5984/share_dir:/share_dir -v ~/docker/cs5984/logs:/logs -v ~/docker/cs5984/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name cs5984 nytfox/fall18_cs4984-cs5984
-
Access Zeppelin Website through the following URL in your browser (the service might take several minutes to boot up):
http://localhost:8082
Consider the Docker container as a sub-Linux system inside your current OS where you can access the subsystem through following commands:
`docker ps`
`docker exec -it your_docker_id bash`
In docker run
command:
-p
option project the service port in the subsystem (8080) to your local system (8081).-v
option mounts one shared directory~/docker/zeppelin/my_files
in your system and bind two other directories to the shared directory. In the subsystem, the directory is/my_files
. Zeppelin notebook and the log will also be automatically saved in this shared directory through-e
option.
Please refer Docker documentation for all other detailed information for the run command as needed.
- Important All changes you make inside Docker subsystem will not be saved unless you commit the changes. Refer Docker commit and make sure to commit your changes if you make some significant changes to the docker environment.
- Your notebook (code) and the log will be saved in the bind-mounted directories
- You can mess with your Docker container in whatever way you want: install applications, changing files, etc. Refer Linux commands.
- If you think you broke the container, stop it and restart it. Every time you restart the container, it will start from the initial status:
docker stop your_container_id
Zeppelin is a notebook environment (similar to Jupyter Notebook) upon Spark where you can run, test your code all within Zeppelin. We have integrated AchiveSpark in our Docker environment so that you can play around with it. The primary language is Scala. (Python is also available if needed)
Refer Zeppelin Official Website for detailed documentation.
We have prepared a Zeppelin based sample notebook, the notebook source file is available in this repositary:
/sample_notebooks/ArchiveSpark_HtmlText_extraction.json
You can download import the notebook to Zeppelin through import note
.
ArchiveSpark Github page also provides some good Documentations and Recipies
Other than running code in Zeppelin, you can also run your code through spark-shell
within Docker. (This is recommended before you run any code on DLRL cluster) We have prepared one example ArchiveSpark_HtmlText_extraction.scala
in share_dir
.
-
Package (copy) your code into one scala script:
ArchiveSpark_HtmlText_extraction.scala
-
Copy/Move your script to
~/docker/cs5984/share_dir/
-
Access Docker shell:
docker ps
docker exec -it your_docker_id bash
-
Run spark-shell to execute your script:
/archive_spark/spark-2.2.1-bin-hadoop2.7/bin/spark-shell -i /share_dir/ArchiveSpark_HtmlText_extraction.scala --files /archive_spark/archivespark_dlrl/libs/en-sent.bin --jars /archive_spark/archivespark_dlrl/libs/archivespark-assembly-2.7.6.jar,/archive_spark/archivespark_dlrl/libs/archivespark-assembly-2.7.6-deps.jar,/archive_spark/archivespark_dlrl/libs/stanford-corenlp-3.5.1.jar,/archive_spark/archivespark_dlrl/libs/opennlp-tools-1.9.0.jar
-i
option points to the path of your script
--files
and --jars
options will load all necessary dependencies you would need for your script. You can add more dependencies as you need for your code.
After testing and validating your code, you can package your code into one Scala script file and run it on DLRL cluster through following commands:
-
Enable JAVA8 env:
export JAVA_HOME=/usr/java/jdk1.8.0_171/
-
Execute Scala Scripts:
spark2-shell -i /your/script.scala --files /home/public/cs4984_cs5984_f18/unlabeled/lib/en-sent.bin --jars /home/public/cs4984_cs5984_f18/unlabeled/lib/archivespark-assembly-2.7.6.jar,/home/public/cs4984_cs5984_f18/unlabeled/lib/archivespark-assembly-2.7.6-deps.jar,/home/public/cs4984_cs5984_f18/unlabeled/lib/stanford-corenlp-3.5.1.jar,/home/public/cs4984_cs5984_f18/unlabeled/lib/opennlp-tools-1.9.0.jar
Before you run the code on DLRL cluster, here is the recommended procedures for preparing your code:
- If your dataset is small or process is not heavy: get the result from your local Zeppelin environment.
- If your dataset is big or process is heavy: sample your dataset first for fast testing.
- Package your script and do Spark-Shell Testing in Docker
- Load your script to DLRL cluster and run it
If you want to work with Python with Spark (PySpark), find the sample code we provide in Zeppelin: SampleCode_PySpark
A cool thing: you can exchange variable between Spark and PySpark in Zeppelin.
Spark provides packages for NLP related tasks, check following resources:
- MLib package for Spark with Scala
- PySpark MLib package for Spark with Python
- SparkNLP package for Scala and Python