This ansible script deploys a server with a collection of Python Big Data and Scientific Computing tools and libraries, preconfigured for running on a local Spark cluster.
Included packages:
- Base Python
- Anaconda Python Distribution 2.7 (
- Development tool
- Jupyter (
- Data visualization
- Bokeh (
- Pygal (
- Seaborn (
- Text and Sentiment
- Gensim (
- vaderSentiment (
- Machine learning
- Gensim (
- Tensorflow (
- Distributed processing
- Spark 1.3.1 (
- Luigi (
- mrjob (
- Data access
- Python HDFS (
- Web scraper
- Scrapy (
- Geojson (
- Geocoder (
- Javascript visualization libraries
- D3.js (
- DC.js (
- NVD3.js (
- Dimple.js (
- Crossfilter.js (
Setup a server or VM with CentOS7
Ensure FQDN is configured correctly. Spark requires the host system hostname to be resolvable, quickest fix is to ensure the hostname resolves as by adding an entry in /etc/hosts: localhost.localdomain localhost pydatalab.server.local pydatalab
Create an ansible hosts inventory (assuming server hostname is your hostname is :
[master] pydatalab.server.local
Execute ansible
ansible-playbook -i hosts playbook.yml
Jupyter should be running at pydatalab.server.local:8888
Default login for pydatalab user is
This ansible script detects whether it is being installed on Hortonworks Data Platform and will create a Jupyter kernel with the right environment variables set and configured to use Spark on HDP.
This script have been tested on:
- CentOS 7.1
- Red Hat Enterprise Linux 7.1