Skip to content

Scripts for selecting and preprocessing image and video ad data in the AWS Rekognition pipeline as well as querying text ad data

License

Notifications You must be signed in to change notification settings

Wesleyan-Media-Project/image-video-data-preparation

Repository files navigation

CREATIVE --- image-video-data-preparation

Welcome! This repository contains scripts for selecting and preprocessing ad image and video data for the AWS Rekognition pipeline as well as querying text ad data.

This repository is part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.

To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Processing step in our pipeline.

A picture of the pipeline diagram

Table of Contents

1. Overview

This repository contains scripts for video and image data preprocessing, specifically, deduplication, ads content filtering (for image data), and video trimming (in order to economize computational resources). It also provides scripts which allow users to retrieve metadata for video and image ads from their Google BigQuery table and text ads from their MySQL table set up during the data collection step from Facebook and Google. Additionally, a script that selects text ads data is also provided.

2. Setup

2.1 Install Relevant Software

Before running any of the code in this repo, make sure you have Python installed on your system. You can install Python from the official Python website. In addition, install Jupyter Notebook with the following command in your terminal:

 pip install jupyter

From here, you should be able to run Jupyter Notebook by entering this command in your terminal:

jupyter notebook

2.2 Install Dependencies

Prior to running the scripts in this repo, please install the following dependency:

pip install pandas

2.3 Run the Scripts

The scripts are intended to be run in the numbered order, specifically, 01-get-checksum-for-deduplication.ipynb should be run prior to 02-filter-data-for-audiovisual-analysis.ipynb.

Note that prior to running 01-get-checksum-for-deduplication.ipynb, you will have to change the lines of code video_source_path = 'my-video-dir' image_source_path = 'my-image-dir' and 'def search_files(directory, filetype=None):' to match up with your data directories and the filetype you attempt to target.

In order to run the trim-video.py script, which may be done after running the previous two scripts, you will have to use the following bash code to get the ffmpeg value:

 export PATH=/software/ffmpeg:/software/ffmpeg/bin:$PATH

In addition, you will again have to make sure that the code referencing data directories matches up with your data directories. This is specifically in reference to the lines video_dir = "my-video-directory" and truncated_video_dir = "my-trimmed-video-directory"

NOTE: The query scripts located in select-ad-metadata are different from the other ones in this repo in that they are or contain a SQL script. It requires you to have a Google BigQuery and/or a local MySQL database set up. For setting up Google BigQuery see our google_ads_archive repo.

3. Results Storage

When you run 01-get-checksum-for-deduplication.ipynb and 02-filter-data-for-audiovisual-analysis.ipynb, the results are saved in an output folder. The data will be in csv format.

The data created by 01-get-checksum-for-deduplication.ipynb (saved as either outfile.csv, google2022_video_info.csv or google2022_image_info'.csv) contains the following fields:

  • filepath: the file path to get to the file being referenced
  • filename: the file name of the file being referenced
  • checksum: the checksum computed for the file being referenced
  • filesize (if table of image file information, not for video information table): filesize of the file being referenced

02-filter-data-for-audiovisual-analysis.ipynb refines the data created by 01-get-checksum-for-deduplication.ipynb. The data returned is largely similar to that of 01-get-checksum-for-deduplication.ipynb, but with deduplication, the extraction of ad_ids when relevant, and the exclusion of screenshot images.

The data returned by 02-filter-data-for-audiovisual-analysis.ipynb contains the following fields:

  • filepath: the file path to get to the file being referenced
  • filename: the file name of the file being referenced
  • checksum: the checksum computed for the file being referenced
  • filesize (if table of image file information, not for video information table): filesize of the file being referenced
  • ad_id: extracted ad ids from image files that are named following the ad it underline filetype structure

The trim-video.py script results in truncated videos (each 2 minutes long) inside of a truncated_video_dir folder (your local path to save trimmed video files).

The select-ad-metadata folder contains three database query scripts. get_fb_metadata_and_text_ads.R and get_google_text_ads.R select and save query results tables into .csv files. The data returned by get_google_metadata.sql is a result table on Google BigQuery, which can be exported and saved into a .csv file.

The three query scripts in the select-ad-metadata folder select and save both text ads and metadata for all ads (including video and image ads) for Google and Facebook ads, respectively. For Google ads, text ads and ads metadata are queried separately, one from MySQL, the other from Google BigQuery. For Facebook ads, they are queried altogether from MySQL.

For Google ads:

  • select-ad-metadata/get_google_metadata.sql is a SQL query selecting metadata fields for image and video data and as such it returns a result table.
  • select-ad-metadata/get_google_text_ads.R selects text ads data fields from MySQL and saves them into a csv file.

For Facebook ads:

  • select-ad-metadata/get_fb_metadata_and_text_ads.R selects text ads as well as metadata information for all media types and saves them into a csv file.

4. Thank You

We would like to thank our supporters!


This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

National Science Foundation Logo

The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.

CREATIVE Logo

Wesleyan Media Project logo

privacy-tech-lab logo

About

Scripts for selecting and preprocessing image and video ad data in the AWS Rekognition pipeline as well as querying text ad data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published