Skip to content

Latest commit

 

History

History
15 lines (11 loc) · 595 Bytes

README.md

File metadata and controls

15 lines (11 loc) · 595 Bytes

big_data_analysis_hadoop_stack

Full fledged data analysis project using Hadoop stack

Steps performed in the project:

  1. Acquire the top 200,000 posts by viewcount
  2. Using Pig or MapReduce , extract, transform and load the data as applicable
  3. Using Hive Query Language , compute: I. The top 10 posts by score II. The top 10 users by post score III. The number of distinct users, who used the word “Hadoop” in one of their posts
  4. Using Mapreduce calculate the per user TF IDF and find 10 most used words, excluding stop words.

Refer to "Documentation" for step by step guide.