This repo contains code for J.P. Morgan Big Data Workshop by Cambridge Spark, cloned from here.
There are two parts: Introduction to Apache Spark (Spark.ipynb
) using the war-and-peace.txt
file and the HackerNews Analysis (HackerNews\ Challenge.ipynb
) using HNStories.json
scraped from the website.
Codes were modified from its original repo. Use load_libraries.ipynb
to test whether all the installation and dependencies have already been installed properly.
git clone
or download this repo to get the most part.
As this repo utilises git lfs
to store HNStories.json
as the file larger than 100MB, install git lfs in advance through here or just navigate through its website. When you only clone or download this repo, HackerNews.json
will contain only the pointer of the real file stored in the git lfs
server.
After installing git lfs
, navigate through the project root folder using either command prompt in Windows or terminal in *NIX, then type this command,
$ git lfs install
$ git lfs pull
# this will download the real data file consisting ~400MB which going to take some time
The easiest way is to run the whole program on Databricks cluster which simply can be accessed through web browser. Simply sign up for the community edition and upload all notebook as well as the data (war-and-peace.txt
and HNStories.json
).
You will need to install Anaconda for Python 3.6 together with Java and the pyspark
library.
Install Anaconda (Python 3.6) from: https://www.anaconda.com/download/.
This includes python 3.6 and the necessary libraries we will be using: numpy
, matplotplib
.
Install Java 8 or higher from https://java.com/en/download/help/index_installing.xml.
Only after you've done the instructions above, open a terminal (or CMD line on Windows) and run the following command to install pyspark
:
pip install pyspark
Open a terminal in your project directory and run the following command:
jupyter notebook load_libraries.ipynb
Execute the first cell to make sure you have all of the required libraries.
Run a jupyter notebook by following the instruction here
Use it to open the file load_libraries.ipynb
.
Execute the first cell to make sure you have all of the require libraries.