Skip to content

jsingh811/pyYouTubeAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

38 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

pyYouTubeAnalysis

Interaction with the YouTube API to pull data and run analysis using statistics and Natural Language Processing (NLP). Contains NLP implementations of text cleaning specific to social media data noise, key-phrase extraction using NLTK and Named-entity Recognition (NER) on a list of strings. Contains automatic plots, word clouds, and analysis report pdf generation.

Setup

  1. Use pip
pip install pyYouTubeAnalysis

and run

python -m spacy download en_core_web_sm

Or, 2. Clone the project from github and run the following for setup.

git clone git@github.com:jsingh811/pyYouTubeAnalysis.git
cd pyYouTubeAnalysis
pip install -e .
python -m spacy download en_core_web_sm

Demos

To see YouTube data extraction examples, see the section YouTube Data Fetching.

To see NER extraction examples, see the section Extracting Locations.

To see Key-phrase extraction examples, see the section Extracting Keyphrases from Text.

To see data cleaning examples for removing emojis and URLs from text, see the section Removing Emojis and URLs from Text.

To see report generation with statistical and NLP analysis, see the section Report Generation.

YouTube Data Fetching

Command Line Usage

python run_crawl.py -t "<YouTube API key (39 chars long)>" -k "travel vlog" -sd "2020-01-01T00:00:00Z" -ed "2020-01-02T00:00:00Z" -climit 5 -path "/Users/abc/Documents"

Input Arguments

path (-path): Path to the directory you want to save the data in
keyword (-k): Keyword to search videos for
start-date (-sd): Starting publish date of video to search. Format YYYY-MM-DDThh:mm:ssZ
end-date (-ed): Ending publish date of video to search. Format YYYY-MM-DDThh:mm:ssZ
token (-t): YouTube API access token
comments (-c): Whether you want to fetch comment text for the videos
comment-limit (-climit): Per video comment limit to fetch

Import and Use

import json
from pyYouTubeAnalysis import (run_crawl, crawler)

keyword = "travel"
start_date = "2020-01-01T00:00:00Z"
end_date = "2020-01-02T00:00:00Z"
comment_limit = 5
api_token = "<YouTube API key (39 chars long)>"
path = "/Users/abc/Documents"
api = crawler.YouTubeCrawler(key=api_token)

# Fetch data from the api
[videos, comments] = run_crawl.get_videos_and_comments(
     api,
     keyword=keyword,
     start_date=start_date,
     end_date=end_date,
     comment_limit=comment_limit
)

# Save the fetched data on disk
with open("/".join([
    path,
    "_".join([
        keyword,
        start_date.replace(":", ""),
        end_date.replace(":", ""),
        "video_details.json"
    ])
]), "w") as f:
      json.dump(videos, f, indent=2)
with open("/".join([
    path,
    "_".join([
        keyword,
        start_date.replace(":", ""),
        end_date.replace(":", ""),
        "comment_details.json"
    ])
]), "w") as f:
      json.dump(comments, f, indent=2)

Sample output

The data inside ...video_details.json file that generates is a list of dictionaries, of the following format as shown in this file.

The data inside ...comment_details.json file that generates is a list of dictionaries, of the following format as shown in this file.

Extracting Locations

The following contains examples for extracting location from comments file generated above.

Command Line Usage

Assuming you are in the parent folder pyYouTubeAnalysis after cloning and setting up the project, the following sample command can be used. Please alter -filepath accordingly.

python extract_locations.py -filepath "/Users/abc/Documents/travel_comment_details.json"

Import and Use

from pyYouTubeAnalysis import extract_locations

filepath = "/Users/abc/Documents/travel_comment_details.json"

comments = extract_locations.read_comment_text(filepath)
locations = extract_locations.extract_locations(comments)

Sample output

The data inside locations_....json file that generates using the command line usage example, or the variable locations in the import and use example is a dictionary of location names as keys and their occurrence counts as values of the format as shown in this file.

Extracting Keyphrases from Text

Import and Use

from pyYouTubeAnalysis.phrases import KeyPhraseGenerator

documents = [
            """Did you know about this conference in Miami? It is about Natural
            Language Processing techniques as applied to messy data.""",
            "I really enjoyed the chocolate cheesecake yesterday!"
]

kp = KeyPhraseGenerator()

phrases =  kp.extract_keyphrases(documents)

Removing Emojis and URLs from Text

Import and Use

from pyYouTubeAnalysis import cleaner

document = " emoji was here -> ๐Ÿ˜ƒ , and url was here -> https://github.com"

# remove emoji
emoji_removed = cleaner.remove_emojis(document)

# removing url 
url_removed = cleaner.remove_urls(document)

Report Generation

This functionality allows the user to crawl YouTube and gather stats related plots, wordclouds and location analysis in one pdf. The files generated as a part of this can be found in this folder.

Command Line Usage

Assuming you are in the parent folder pyYouTubeAnalysis after cloning and setting up the project, the following sample command can be used. Please alter -path accordingly.

python report.py -path "/Users/abc/Documents" -k "travel vlog" -sd "2020-01-01T00:00:00Z" -ed "2021-03-31T00:00:00Z" -analysis "monthly,yearly"  -t "<YouTube API key (39 chars long)>"```  

Import and Use

from pyYouTubeAnalysis.report import ReportGenerator
from pyYouTubeAnalysis import run_crawl, crawler

keyword = "travel vlog"
start_date =  "2020-01-01T00:00:00Z"
end_date = "2021-03-31T00:00:00Z"
analysis_type = ["yearly", "monthly"] 
api_token = "<YouTube API key (39 chars long)>"
path = "/Users/abc/Documents"

rgen = ReportGenerator(path, keyword, start_date, end_date, analysis_type)

api = crawler.YouTubeCrawler(key=api_token)
# Fetch data from the api
[videos, comments] = run_crawl.get_videos_and_comments(
    api, keyword=keyword, start_date=start_date, end_date=end_date, comment_limit=10
)
print("\nFetched data\n")
rgen.get_and_plot_stats(videos)
rgen.plot_trending_tags(videos)
rgen.plot_comment_locations(comments)
print("\nFetched plots\n")
output_path = rgen.export_to_pdf()
print("\nGenerated pdf here {}\n".format(output_path))

Citation

Kindly cite as follows.

APA

Singh, J. (2021). Social Media Analysis using Natural Language Processing Techniques. In Proceedings of the 20th Python in Science Conference (pp. 52-58).

BibTex

@InProceedings{ jyotika_singh-proc-scipy-2021,
  author    = { {J}yotika {S}ingh },
  title     = { {S}ocial {M}edia {A}nalysis using {N}atural {L}anguage {P}rocessing {T}echniques },
  booktitle = { {P}roceedings of the 20th {P}ython in {S}cience {C}onference },
  pages     = { 74 - 80 },
  year      = { 2021 },
  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
  doi       = { 10.25080/majora-1b6fd038-009 }
}

Please cite this software as below

APA

Singh, J. (2021). jsingh811/pyYouTubeAnalysis: YouTube Data Requests and Natural Language Processing on Text (v1.1) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5044556

BibTex

@software{https://doi.org/10.5281/zenodo.5044556,
  doi = {10.5281/ZENODO.5044556},
  url = {https://zenodo.org/record/5044556},
  author = {Singh,  Jyotika},
  title = {jsingh811/pyYouTubeAnalysis: YouTube Data Requests and Natural Language Processing on Text},
  publisher = {Zenodo},
  year = {2021},
  copyright = {Open Access}
}