Skip to content

bachtn/movie-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

movie-analytics

Overview

•• Pipeline of different micro-services all communicating with Kafka. •• Used Languages and Technologies: Scala | Spark-Streaming | HDFS | Kafka | Zepplin | Python •• Developed features: • Scrape the web and collect movie data (Basic infos, reviews, rates, etc) • Sentiment Analysis for the movie reviews. • Data processing (save data to HDFS). • Data visualization on a Zepplin notebook (data read from HDFS).

Requirements:

Install kafka, TextBlob, tmdbsimple, beautifulSoup4, requests, json,

How to use:

1 - Download Kafka

https://www.apache.org/dyn/closer.cgi?path=/kafka/0.11.0.0/kafka\_2.11-0.11.0.0.tgz

2 - Start the server (ZooKeeper)

bin/zookeeper-server-start.sh config/zookeeper.properties

3 - Start the Kafka server

bin/kafka-server-start.sh config/server.properties

4 - Create the needed topics ('movie_data' and 'movie_popularity')

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic movie_data
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic movie_popularity

5 - Test that the topics were created

bin/kafka-topics.sh --list --zookeeper localhost:2181

  • This command should display the list of tpics ('movie_data' and 'movie_popularity') in your case.

6 - Collect movie data:

python collect-data.py

7 - Analyse movie reviews:

python analyse-data.py

8 - Save data to hdfs

cd hdfs_utils sbt run

9 - Launch Zepplin notebook

  • Download zepplin

zeppelin-0.7.2-bin-all/bin/zeppelin.sh start

  • go to localhost:8080 in your web browser

The steps 6 and 7 can be executed in the same time

Files

collect-data.py :

collects data about movies and publish it in a kafka stream topic called: 'movie_data'

reviewCollector.py :

Is used by 'collect-data.py' to collect the movie reviews

imdbpy.py : (is no longer used in the project)

Collects data with imdbpy api but because there is a connection problem, it was replaced with the tmdbsimple api.

analyse-data.py :

Uses a consumer to listen to the Kafka topic 'movie_data', and for each message (data for a single movie), analyse its reviews and assigns a score for each one. and than the results are published in a new topic called 'movie_popularity'

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published