GitHub - a-jacewicz/youtube-viral-video-forecasting

Youtube Viral Video Forecasting

This repository contains the code written by Aleksandra Jacewicz, Alyssa Rodriguez, Kamillah Ismail, and Trina Dang through out our time working as AI Fellows with Google on a project where we pre-processed real world Youtube data, engineered features that are powerful and can realistically be used, and eventually built supervised, regression-based models that can predict how many views a video labeled as viral by Youtube will get.

Data Processing & Feature Engineering

We pre-processed around 50k data instances as part of this project. In doing so, we replaced missing data with average values (for ‘dislikes’ and ‘comments’, on a channel basis when possible), combined data tables to have access to a full representative dataset, removed exact duplicates in dataset (≈ 160), extracted information from time based columns (‘publishedAt’, ‘trending_date’) and taking difference between them to be able to do per day based calculations for numeric columns and did one hot encoding for relevant columns (ex. categoryID -> Film & Animation, News & Politics, etc).

We also processed text based data to allow for natural language processing to take place and then pursued that. This included cleaning, tokenization, building vocabulary, training embeddings and getting vectorized results.

Models and Results

We built several models across the course of this project, with every member of our team taking on a different model. We've uploaded our final project presentation here, so feel free to look at it to find out more! As a general overview:

Aleksandra worked on building a neural network based model, and was able to get an MSE (mean squared error) of .1069.

The code used while doing so can be found at Model_Development/Neural_Network.ipynb
Alyssa worked on building a decision tree model, and was able to get an MSE of 0.3107.

Dataset

We used the Youtube Trending Video Dataset, and specifically the US_category_id.json and US_Youtube_trending_data.csv files found in it while training our models.

The US_Youtube_trending_data.csv file contains around 50k snapshots of information about viral Youtube videos across times, with the columns it has including:

video_id
title
publishedAt
channelID
channelTitle
categoryId
trending_date
tags
view_count
likes
dislikes
comment_count
thumbnail_link
comments_disabled
rating_disabled
description

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Model_Development		Model_Development
Google 3E Final Presentation - Youtube Viral Video Forecasting.pdf		Google 3E Final Presentation - Youtube Viral Video Forecasting.pdf
README.md		README.md
data_preprocessing.ipynb		data_preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Youtube Viral Video Forecasting

Data Processing & Feature Engineering

Models and Results

Dataset

About

Releases

Packages

Contributors 3

Languages

a-jacewicz/youtube-viral-video-forecasting

Folders and files

Latest commit

History

Repository files navigation

Youtube Viral Video Forecasting

Data Processing & Feature Engineering

Models and Results

Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages