-
Notifications
You must be signed in to change notification settings - Fork 3
Example
This is a full-scale example on how to use vitrivr-engine to index a collection of images and videos and serves as a starting point for advanced users of vitrivr-engine. Previous knowledge about multimedia retrieval and vitrivr-engine are beneficial, however the aim of this example is such that even novices can use vitrivr-engine only with this tutorial.
This is a tutorial / example on how to use vitrivr-engine for users, e.g. people with a multimedia collection aiming on indexing it. There are three goals of this tutorial:
- A quick reference for vitrivr-engine ingestion and retrieval
- Thoughts and design choices for schema, ingestion and retrieval
- Real-world example, in contrast to other documentation in this wiki, which is more abstract
Having a multimedia collection, (videos and images for the sake of this tutorial) is great, however the means to explore / search within (large) collections are still rather limited. With vitrivr-engine, a general purpose content-based multimedia retrieval engine, ingestion (i.e. analysing the content and storing this information for efficient use) and retrieval (i.e. using the previously gathered information to find items of the collection) may improve the understanding / usability of the collection.
Not a requirement, however reading and following the Getting Started guide is beneficial. Additionally, reading the introduction of the Documentation wiki page is also helpful.
Technical requirements are as follows:
- JDK 21 or higher, e.g. OpenJDK
- CottontailDB at least v0.16.5
- The example collection consisting of CC-0 videos and images. This is arguably a small collection and a real-world multimedia collection would be significantly larger.
In case no release exists, then building vitrivr-engine is required.
- Start CottontailDB on the default port
1865
- Build vitrivr-engine (from the root of the repository): Unix:
./gradlew distZip
Windows:
.\gradlew.bat distZip
- Unzip the distribution, e.g.
unzip -d ../instance/ vitrivr-engine-module-server/build/distribution/vitrivr-engine-server-0.0.1-SNAPSHOT.zip
- Prepare the media data into a folder called
example/media
By now, you should have the following folder structure:
+ vitrivr-engine/
|
+ instance/
|
+ vitrivr-engine-server-0.0.1-SNAPSHOT/
|
+ bin/
|
+ lib/
+ example/
|
+ media/
|
+ images/
|
+ videos/
|
- README.md
|
+ cottontaildb/
The cottontaildb
folder is optional and might contain either the DBMS or the repository. We will not delve deeper into the cottontail setup.
In the
Since we have images and videos with a rather diverse set of styles, we aim on extracting as much content-based information as possible. Therefore, we set the schema accordingly:
The schema fields in detail:
Field | Type | Description | Module |
---|---|---|---|
averagecolor |
Vector, length: 3 | The most basic feature for completeness sake | core |
clip |
Vector, length: 512 | CLIP based dense embedding, enables textual, concept search | fes |
file |
Structural | Metadata for the file | core |
whisper |
Textual | ASR: OpenAI whisper deep learning based subtitle analysis | fes |
ocr |
Textual | OCR: Text recogntion both for images and videos, however for videos only on key frames | fes |
dino |
Vector length 384 | DINO based dense embedding, predominantly for query-by-example | fes |
time |
Structural | Temporal metadata for time-based media (e.g. video, audio) | core |
video |
Structural | Metadata for videos, e.g. resolution, FPS, ... | core |
The fes module depends on the feature extraction server, a micro service for extraction and queries using pre-trained deep learning models. There is a list of available tasks and the README explains the setup.
For the sake of this tutorial, we assume that there is a FES instance running on the same machine, available under the host http://127.0.0.1:8888
(which should be the default port, following the instructions of FES).
This is the schema we use:
{
"schemas": [
{
"name": "example",
"connection": {
"database": "CottontailConnectionProvider",
"parameters": {
"Host": "127.0.0.1",
"port": "1865"
}
},
"fields": [
{
"name": "averagecolor",
"factory": "AverageColor"
},
{
"name": "file",
"factory": "FileSourceMetadata"
},
{
"name": "clip",
"factory": "DenseEmbedding",
"parameters": {
"host": "http://127.0.0.1:8888",
"model": "open-clip-vit-b32",
"length":"512"
}
},
{
"name": "dino",
"factory": "DenseEmbedding",
"parameters": {
"host": "http://127.0.0.1:8888/",
"model": "dino-v2-vits14",
"length":"384"
}
},
{
"name": "whisper",
"factory": "ASR",
"parameters": {
"host": "http://127.0.0.1:8888/",
"model": "whisper"
}
},
{
"name": "ocr",
"factory": "OCR",
"parameters": {
"host": "http://127.0.0.1:8888/",
"model": "tesseract"
}
},
{
"name": "time",
"factory": "TemporalMetadata"
},
{
"name": "video",
"factory": "VideoSourceMetadata"
},
],
"resolvers": {
"disk": {
"factory": "DiskResolver",
"parameters": {
"location": "./example/thumbs"
}
}
},
"exporters": [
{
"name": "thumbnail",
"factory": "ThumbnailExporter",
"resolverName": "disk",
"parameters": {
"maxSideResolution": "300",
"mimeType": "JPG"
}
}
],
"extractionPipelines": []
}
]
}
To simplify the pipelines, it is beneficial to separate them based on media type. In this tutorial's collection, we do have images and videos, therefore we have two separate ones. Even with a shared schema, not all media types can be analysed for all the fields we have defined. For instance, there is no audio attached to images and therefore, we won't extract ASR from them.
The basic idea behind the image pipeline is the assumption, that the microservice, FES (feature-extraction-server), for CLIP, OCR, and DINO can handle multiple requests, which may take some time. In the meantime, vitrivr-engine can extract metadata information.
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#2D373C',
'primaryTextColor': '#D2EBE9',
'primaryBorderColor': '#A5D7D2',
'lineColor': '#D20537',
'secondaryColor': '#2D373C',
'edgeLabelBackground': '#000'
}
}
}%%
flowchart LR
direction LR
e[enumerator] --> d[decoder]
d --> a[averagecolor]
d --> c[clip]
d --> i[dino]
d --> o[ocr]
d --> t[thumbails]
t --> f[filter]
a --> f[filter]
f -->|combine| m[file]
m --> p[persistance]
c --> p
i --> p
o --> p
p -->|combine| q[end]
Found an issue in the wiki? Post it!
Have a question? Ask it
Disclaimer: Please keep in mind, vitrivr and vitrivr-engine are predominantly research prototypes.