A clustering algorithm tool for de-duplicating near exact images in videos using vector embeddings and segmentation clusters.
- πΉ Scene Detection - Uses scene detection to extract transition frames from videos.
- π€ Semantic Fingerprinting β Uses vector embeddings to perform semantic de-duplication of images.
- β¬ Technical Frames Detection β Filters out black and white technical frames.
- πΌοΈ Image Deduplication β Allows to semantically de-duplicate images in addition to videos.
- π Plotting - Allows to plot and visualize the image clusters.
- π¦ Local-first - Runs entirely locally, on GPU or CPU.
Using pip
pip install -r requirements.txt
Using uv
uv sync
This application requires ffmpeg/mkvmerge for video splitting support.
Piaggio is a semantic image clustering tool that you can run from the command-line to de-duplicate near exact images from videos or a collection of images. It uses vector embeddings to perform semantic de-duplication of images and PySceneDetect to extract transition frames from videos.
Use-cases in mind include keyframe extractions from videos (e.g in the context of thumbnail generation), or semantic de-duplication of images in a dataset by clustering images not only based on their pixel resemblance but also on their semantic content.
uv run src/main.py \
-v path/to/video.mp4 \
-o path/to/output/directory
graph LR
A[Video] --> B(Scene Detection)
B --> C(Semantic Fingerprinting)
C --> D(Technical Frames Filtering)
D --> E(Clustering)
E --> F(Deduplication)
Install yt-dlp
locally to download videos from YouTube.
βΉοΈ This is only provided as an example for research purposes, use responsibly according to YouTube's terms of service.
# Download video and encode as MP4.
yt-dlp \
-S res,ext:mp4:m4a \
--recode mp4 \
'https://www.youtube.com/watch?v=<video-id>'
# Extract keyframes.
uv run src/main.py \
-v path/to/video.mp4 \
-o path/to/output/directory
In this example, we're trying this NetworkChuck video which is 1.4 GB in size, 34 minutes long, and contains 62,836 frames in total. Piaggio managed to reduce the number of frames to only 22 images after clustering. Below are some of the extracted keyframes from the semantic cluster.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
uv run src/main.py \
-d path/to/images/directory \
-o path/to/output/directory
graph LR
A[Images] --> B(Semantic Fingerprinting)
B --> C(Clustering)
C --> D(Deduplication)
uv run src/main.py \
-d path/to/images/directory \
-o path/to/output/directory \
--plot
uv run src/main.py \
-d path/to/images/directory \
-o path/to/output/directory \
--plot-images
-v
or--video
- Path to the video file to process.-d
or--directory
- Path to the images directory to process.-o
or--output
- Path to the output directory where to store the results.-m
or--model
- Path to the CLIP embedding model name to use for semantic de-duplication (default:ViT-B/32
).-e
or--epsilon
- The epsilon value to use for the DBSCAN clustering algorithm (default:0.2
).-s
or--min-samples
- The minimum number of samples to use for the DBSCAN clustering algorithm (default:5
).-t
or--metric
- The metric to use for the DBSCAN clustering algorithm (default:cosine
).-p
or--plot
- Whether to plot the clusters or not (default:False
).-i
or--plot-images
- Whether to plot the images in the clusters or not (default:False
).