Last Updated: December 4, 2016
Lead Maintainers: Rafael Zamora, Justin Murphy
The goal of this project is to analyze Twitter data related to the 2016 United States presidential race. We hope to discover classes of tweets by applying clustering techniques on two particular data points: the sentiment of tweets and how much they reference either the Republican or Democrat candidates. The clusters will then be used to analyze tweeting behavior over the last few weeks of the election. We hope to see how specific events during the race influenced Twitter's sentiment towards either candidate.
Data was gathered from 3 weeks prior to the election and 1 week after the election.The data was pulled from Twitter using Python with the following parameters:
- Start Date: 2016-10-16
- End Date: 2016-11-14
- Keywords: @hillaryclinton OR #hillaryclinton OR Hillary Clinton OR Hillary OR @RealDonaldTrump OR #donaldtrump OR Donald Trump OR Trump
The following values were gathered from each tweet:
- Author-ID
- Date with Time
- Text
The following is an example of a tweet and the values produced through processing:
- Tweet text: Donald Trump Angry at Mike Pence For Doing Great Job At Vice Presidential Debate.
- Noun Phrases: [ 'donald trump angry', 'mike pence', 'job', 'vice presidential debate' ]
- Sentiment Value: 0.803
- Clinton Reference Value: 0.263
- Trump Reference Value: 0.800
- Candidate Reference Value: -0.537*
*-1 = Trump, 1 = Clinton
SciKitLearn's Birch Clustering algorithm was used to cluster the processed data. The following are graph examples of processed and clustered data:
Requires Python 3.5 and R.
Requires the following Python Packages:
- GOT3 (modified version is included in /src/)
- TextBlob
- scikit-learn
To install download or clone repository and install required packages.
The /src/ folder includes all scripts used for this project. The following are short descriptions of each script:
- PullTwitterData.py - Used to pull and write data to CSV
- ProcessTwitterData.py - Used to process and run sentiment analysis on pulled data
- ClusterTwitterData.py - Used to run Birch clustering on processed data
- GraphTwitterData.R - Used to export PNG graphs of processed and clustered data
The /doc/ folder contains an R Notebook used for analyzing data and results. It also contains /figures/ folder which includes graphs of all processed and clustered data.
The /data/ folder contains pre-processed and processed Twitter data while the final clustered data can be found in the results folder.
Cluster sizes and centroid coordinates can be found in results.txt
Graph of the total number of Tweets per day can be found in TweetsPerDay.png
This project is licensed under the MIT License.
CITATION provides how to cite this project.