Skip to content

🏍️ A clustering tool providing exact and near de-duplication of images using vector embeddings.

License

Notifications You must be signed in to change notification settings

HQarroum/piaggio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


logo

Piaggio Β Static Badge

A clustering algorithm tool for de-duplicating near exact images in videos using vector embeddings and segmentation clusters.

Github Codespaces


πŸ”– Features

  • πŸ“Ή Scene Detection - Uses scene detection to extract transition frames from videos.
  • πŸ€– Semantic Fingerprinting β€” Uses vector embeddings to perform semantic de-duplication of images.
  • ⬛ Technical Frames Detection β€” Filters out black and white technical frames.
  • πŸ–ΌοΈ Image Deduplication β€” Allows to semantically de-duplicate images in addition to videos.
  • πŸ“ˆ Plotting - Allows to plot and visualize the image clusters.
  • 🦎 Local-first - Runs entirely locally, on GPU or CPU.

πŸš€ Installation

Using pip

pip install -r requirements.txt

Using uv

uv sync

This application requires ffmpeg/mkvmerge for video splitting support.

What's this ❓

Piaggio is a semantic image clustering tool that you can run from the command-line to de-duplicate near exact images from videos or a collection of images. It uses vector embeddings to perform semantic de-duplication of images and PySceneDetect to extract transition frames from videos.

Use-cases in mind include keyframe extractions from videos (e.g in the context of thumbnail generation), or semantic de-duplication of images in a dataset by clustering images not only based on their pixel resemblance but also on their semantic content.

πŸ“š Usage

Extracting keyframes from a local video

uv run src/main.py \
  -v path/to/video.mp4 \
  -o path/to/output/directory
Workflow
graph LR
	A[Video] --> B(Scene Detection)
	B --> C(Semantic Fingerprinting)
	C --> D(Technical Frames Filtering)
	D --> E(Clustering)
	E --> F(Deduplication)
Loading

Extracting keyframes from a YouTube video

Install yt-dlp locally to download videos from YouTube.

ℹ️ This is only provided as an example for research purposes, use responsibly according to YouTube's terms of service.

# Download video and encode as MP4.
yt-dlp \
  -S res,ext:mp4:m4a \
  --recode mp4 \
  'https://www.youtube.com/watch?v=<video-id>'

# Extract keyframes.
uv run src/main.py \
  -v path/to/video.mp4 \
  -o path/to/output/directory

In this example, we're trying this NetworkChuck video which is 1.4 GB in size, 34 minutes long, and contains 62,836 frames in total. Piaggio managed to reduce the number of frames to only 22 images after clustering. Below are some of the extracted keyframes from the semantic cluster.



Deduplicating images from a local directory

uv run src/main.py \
  -d path/to/images/directory \
  -o path/to/output/directory
Workflow
graph LR
  A[Images] --> B(Semantic Fingerprinting)
  B --> C(Clustering)
  C --> D(Deduplication)
Loading

Plot the clusters

uv run src/main.py \
  -d path/to/images/directory \
  -o path/to/output/directory \
  --plot

logo

Plot the images in the clusters

uv run src/main.py \
  -d path/to/images/directory \
  -o path/to/output/directory \
  --plot-images

logo

πŸ“Ÿ Options

  • -v or --video - Path to the video file to process.
  • -d or --directory - Path to the images directory to process.
  • -o or --output - Path to the output directory where to store the results.
  • -m or --model - Path to the CLIP embedding model name to use for semantic de-duplication (default: ViT-B/32).
  • -e or --epsilon - The epsilon value to use for the DBSCAN clustering algorithm (default: 0.2).
  • -s or --min-samples - The minimum number of samples to use for the DBSCAN clustering algorithm (default: 5).
  • -t or --metric - The metric to use for the DBSCAN clustering algorithm (default: cosine).
  • -p or --plot - Whether to plot the clusters or not (default: False).
  • -i or --plot-images - Whether to plot the images in the clusters or not (default: False).

About

🏍️ A clustering tool providing exact and near de-duplication of images using vector embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages