Project Summary

Task 1: Data Exploration

Utilized test_review.json to extract various information:
- Total number of reviews
- Number of reviews in 2018
- Number of distinct users who wrote reviews
- Top 10 users who wrote the largest numbers of reviews and the number of reviews they wrote
- Number of distinct businesses that have been reviewed
- Top 10 businesses that had the largest numbers of reviews and the number of reviews they had
Output the results in a JSON format file.

Task 2: Partition

Showed the number of partitions for the RDD used for Task 1 Question F and the number of items per partition.
Used a customized partition function to improve the performance of map and reduce tasks.
Compared the time duration between the system default partition and the customized partition.

Task 3: Exploration on Multiple Datasets

Explored review information (test_review.json) and business information (business.json) together.
Answered questions such as the average stars for each city.
Compared the execution time of using two methods to print the top 10 cities with the highest average stars.

The code was executed using the provided input format and the results were written in the specified output format. The project ensured that the code works well on large datasets, as required.

Final Results

Grade: 100%

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
business.json		business.json
output_task_1.json		output_task_1.json
output_task_2.json		output_task_2.json
output_task_3.json		output_task_3.json
review.json		review.json
task1.py		task1.py
task2.py		task2.py
task3.py		task3.py
test_review.json		test_review.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Summary

Task 1: Data Exploration

Task 2: Partition

Task 3: Exploration on Multiple Datasets

Final Results

About

Releases

Packages

Languages

drewm8080/data_mining_spark_rdds

Folders and files

Latest commit

History

Repository files navigation

Project Summary

Task 1: Data Exploration

Task 2: Partition

Task 3: Exploration on Multiple Datasets

Final Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages