This repository contains the project for the Big Data course.
It contains 2 different jobs, both implemented in two version, one non-optimized and one optimized.
- The first job calculates the average number of TracksForArtist in all the playlists.
- The second one, given a specific song, calculates the most similar track related to that, (the song that appears more time in the same playlists of the target one), and the number of playlists that they share.
The results of the jobs, both the optimized and the non-optimized version, are stored in the repost/results
folder.
It is also possible to visualize the stats of the jobs in the report/stats
folder.
Download dataset:
#!/bin/bash
curl -L -o ~/Downloads/spotify-millions-playlist.zip\ https://www.kaggle.com/api/v1/datasets/download/adityak80/spotify-millions-playlist
Additional readmes:
- Initial Setup (instructions to setup the environment)
- AWS CLI cheatsheet (a collection of the most common commands to use on the AWS CLI)
- AWS Workflow (a vademecum of the list of things to do to setup the AWS environment and use it to deploy Spark jobs)
- Exam Project (instructions for the project, that is mandatory for the exam)