Having fun with IMDb data files
- From https://datasets.imdbws.com download
(these datasets contain movie titles and movie ratings):
$ wget https://datasets.imdbws.com/title.basics.tsv.gz
$ wget https://datasets.imdbws.com/title.ratings.tsv.gz
If you don't have wget
you can try curl
$ curl -O https://datasets.imdbws.com/title.basics.tsv.gz
$ curl -O https://datasets.imdbws.com/title.ratings.tsv.gz
- Extract these files:
$ gunzip title.basics.tsv.gz
$ gunzip title.ratings.tsv.gz
- We explore and discuss these files.
Our challenge:
- Find all movies which contain the word "python".
- Discuss problems of the following solution:
This script will find all movies which
contain the word "python".
titles = []
with open('title.basics.tsv', 'r') as f:
for line in f.read().splitlines():
if not 'primaryTitle' in line:
s = line.split('\t')
for title in titles:
if 'python' in title.lower():
Optional take-home exercises:
- Find the 20 movies with highest ratings (use only those with many votes).
- Find the 10 most popular comedies.
- Write your code so that it can work with datasets of in principle any size.