Process Web Archive with ArchiveSpark

Supplemental information on ArchiveSpark for course CS4984/CS5984: Big Data Text Summarization, Fall2018, Virginia Tech.

Things you will learn about: Github, Docker, Zeppelin, ArchiveSpark, Spark.

Description

ArchiveSpark serves as the first (not limited to) component in your project pipeline on web archive data extraction. In this tutorial, you will learn to deploy a test environment for ArchiveSpark, test code locally and execute code on the DLRL cluster. You will also find some further information about Spark programming and NLP processing with Spark.

Questions and Issues

If you encounter any question or issue, please check relevant documentation first. For more questions, you can create an issue in this GitHub page:

ArchiveSpark

"An Apache Spark framework for easy data processing, extraction as well as derivation for archival collections." - helgeho ArchiveSpark Official GitHub page

In this class, we will utilize ArchiveSpark to process our web archive collections. We can leverage the power of ArchiveSpark in various ways: content extraction, word count, clustering (LDA), etc.

In the following sections, you will find information about local usage and test with our Docker image; Instructions for running the ArchiveSpark job on DLRL cluster.

Some Scala and Spark Tutorials

Docker: Your Local Test Environment

We provide a Docker image that contains a full development environment with ArchiveSpark. Check following links for detailed information about Docker.

Be aware Docker works as a Virtual Machine in MacOS and Windows. You can configure the computer resources allocations(CPU/Memory) to speed up/down your task. In Linux systems, Docker works as a native application.

What is docker?

Install Docker CE

Install Docker CE version on your local environment: Linux MacOS

Install Docker for Windows 10

Install Docker toolbox from here
Disable Hyper-V feature in your windows system, tutorial here
Open Docker Quickstart Terminal (it will start an automatic set up)

Deploy Docker Container (Windows10 pro)

In Docker Quickstart Terminal, start the container

docker run -d -p 8082:8080 --rm -v ~/docker/cs5984/share_dir:/share_dir -v ~/docker/cs5984/logs:/logs -v ~/docker/cs5984/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name cs5984 nytfox/fall18_cs4984-cs5984
check container ip address:

docker-machine ip default
Open your browser with:

yourDockerIP:8082

Deploy Docker Container (MacOS/Linux)

Check Docker command line basics for various docker operations in the command line.

Get an account for Docker
Login Docker: either through application or commandline
Pull the container image from Docker image hub

pull nytfox/fall18_cs4984-cs5984:latest
Start the container

docker run -d -p 8082:8080 --rm -v ~/docker/cs5984/share_dir:/share_dir -v ~/docker/cs5984/logs:/logs -v ~/docker/cs5984/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name cs5984 nytfox/fall18_cs4984-cs5984
Access Zeppelin Website through the following URL in your browser (the service might take several minutes to boot up):

http://localhost:8082

Consider the Docker container as a sub-Linux system inside your current OS where you can access the subsystem through following commands:

`docker ps`
`docker exec -it your_docker_id bash`

In docker run command:

-p option project the service port in the subsystem (8080) to your local system (8081).
-v option mounts one shared directory ~/docker/zeppelin/my_files in your system and bind two other directories to the shared directory. In the subsystem, the directory is /my_files. Zeppelin notebook and the log will also be automatically saved in this shared directory through -e option.

Please refer Docker documentation for all other detailed information for the run command as needed.

Notice

Important All changes you make inside Docker subsystem will not be saved unless you commit the changes. Refer Docker commit and make sure to commit your changes if you make some significant changes to the docker environment.
Your notebook (code) and the log will be saved in the bind-mounted directories
You can mess with your Docker container in whatever way you want: install applications, changing files, etc. Refer Linux commands.
If you think you broke the container, stop it and restart it. Every time you restart the container, it will start from the initial status: docker stop your_container_id

Zeppelin is Your Playground

Zeppelin is a notebook environment (similar to Jupyter Notebook) upon Spark where you can run, test your code all within Zeppelin. We have integrated AchiveSpark in our Docker environment so that you can play around with it. The primary language is Scala. (Python is also available if needed)

Refer Zeppelin Official Website for detailed documentation.

Sample Code

We have prepared a Zeppelin based sample notebook, the notebook source file is available in this repositary:

/sample_notebooks/ArchiveSpark_HtmlText_extraction.json

You can download import the notebook to Zeppelin through import note.

ArchiveSpark Github page also provides some good Documentations and Recipies

Spark-Shell Testing in Docker

Other than running code in Zeppelin, you can also run your code through spark-shell within Docker. (This is recommended before you run any code on DLRL cluster) We have prepared one example ArchiveSpark_HtmlText_extraction.scala in share_dir.

Package (copy) your code into one scala script:

ArchiveSpark_HtmlText_extraction.scala
Copy/Move your script to

~/docker/cs5984/share_dir/
Access Docker shell:

docker ps docker exec -it your_docker_id bash
Run spark-shell to execute your script:

/archive_spark/spark-2.2.1-bin-hadoop2.7/bin/spark-shell -i /share_dir/ArchiveSpark_HtmlText_extraction.scala --files /archive_spark/archivespark_dlrl/libs/en-sent.bin --jars /archive_spark/archivespark_dlrl/libs/archivespark-assembly-2.7.6.jar,/archive_spark/archivespark_dlrl/libs/archivespark-assembly-2.7.6-deps.jar,/archive_spark/archivespark_dlrl/libs/stanford-corenlp-3.5.1.jar,/archive_spark/archivespark_dlrl/libs/opennlp-tools-1.9.0.jar

-i option points to the path of your script

--files and --jars options will load all necessary dependencies you would need for your script. You can add more dependencies as you need for your code.

Real Job on Cluster

After testing and validating your code, you can package your code into one Scala script file and run it on DLRL cluster through following commands:

Enable JAVA8 env:

export JAVA_HOME=/usr/java/jdk1.8.0_171/
Execute Scala Scripts:

spark2-shell -i /your/script.scala --files /home/public/cs4984_cs5984_f18/unlabeled/lib/en-sent.bin --jars /home/public/cs4984_cs5984_f18/unlabeled/lib/archivespark-assembly-2.7.6.jar,/home/public/cs4984_cs5984_f18/unlabeled/lib/archivespark-assembly-2.7.6-deps.jar,/home/public/cs4984_cs5984_f18/unlabeled/lib/stanford-corenlp-3.5.1.jar,/home/public/cs4984_cs5984_f18/unlabeled/lib/opennlp-tools-1.9.0.jar

Best Practice

Before you run the code on DLRL cluster, here is the recommended procedures for preparing your code:

If your dataset is small or process is not heavy: get the result from your local Zeppelin environment.
If your dataset is big or process is heavy: sample your dataset first for fast testing.
Package your script and do Spark-Shell Testing in Docker
Load your script to DLRL cluster and run it

Work with PySpark

If you want to work with Python with Spark (PySpark), find the sample code we provide in Zeppelin: SampleCode_PySpark

A cool thing: you can exchange variable between Spark and PySpark in Zeppelin.

Spark and NLP

Spark provides packages for NLP related tasks, check following resources:

MLib package for Spark with Scala
PySpark MLib package for Spark with Python
SparkNLP package for Scala and Python

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
doc		doc
sample_notebooks		sample_notebooks
sample_scripts		sample_scripts
ReadMe.markdown		ReadMe.markdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Process Web Archive with ArchiveSpark

Description

Questions and Issues

Table of Conetents

ArchiveSpark

Some Scala and Spark Tutorials

Docker: Your Local Test Environment

Install Docker CE

Install Docker for Windows 10

Deploy Docker Container (Windows10 pro)

Deploy Docker Container (MacOS/Linux)

Notice

Zeppelin is Your Playground

Sample Code

Spark-Shell Testing in Docker

Real Job on Cluster

Best Practice

Work with PySpark

Spark and NLP

About

Releases

Packages

Contributors 3

Languages

xw0078/VT_Fall18_CS4984-CS5984

Folders and files

Latest commit

History

Repository files navigation

Process Web Archive with ArchiveSpark

Description

Questions and Issues

Table of Conetents

ArchiveSpark

Some Scala and Spark Tutorials

Docker: Your Local Test Environment

Install Docker CE

Install Docker for Windows 10

Deploy Docker Container (Windows10 pro)

Deploy Docker Container (MacOS/Linux)

Notice

Zeppelin is Your Playground

Sample Code

Spark-Shell Testing in Docker

Real Job on Cluster

Best Practice

Work with PySpark

Spark and NLP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages