•• Pipeline of different micro-services all communicating with Kafka. •• Used Languages and Technologies: Scala | Spark-Streaming | HDFS | Kafka | Zepplin | Python •• Developed features: • Scrape the web and collect movie data (Basic infos, reviews, rates, etc) • Sentiment Analysis for the movie reviews. • Data processing (save data to HDFS). • Data visualization on a Zepplin notebook (data read from HDFS).
Install kafka, TextBlob, tmdbsimple, beautifulSoup4, requests, json,
https://www.apache.org/dyn/closer.cgi?path=/kafka/0.11.0.0/kafka\_2.11-0.11.0.0.tgz
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic movie_data
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic movie_popularity
bin/kafka-topics.sh --list --zookeeper localhost:2181
- This command should display the list of tpics ('movie_data' and 'movie_popularity') in your case.
python collect-data.py
python analyse-data.py
cd hdfs_utils sbt run
- Download zepplin
zeppelin-0.7.2-bin-all/bin/zeppelin.sh start
- go to localhost:8080 in your web browser
collects data about movies and publish it in a kafka stream topic called: 'movie_data'
Is used by 'collect-data.py' to collect the movie reviews
Collects data with imdbpy api but because there is a connection problem, it was replaced with the tmdbsimple api.
Uses a consumer to listen to the Kafka topic 'movie_data', and for each message (data for a single movie), analyse its reviews and assigns a score for each one. and than the results are published in a new topic called 'movie_popularity'