Code repository for Aereo, an experimental bird’s eye view of the digital collections from the State Library of New South Wales by Mauricio Giraldo Arteaga for the 2019 DX Lab Fellowship.
Used to create object predictions and obtain color palettes from the images:
https://github.com/mgiraldo/image-utils
Used to create the category pixels (when thumbnails are off) and atlases (when thumbnails are on):
https://github.com/mgiraldo/aereo-pixels
To download these files you need to know how to use Amazon S3. The files are located in this bucket:
https://dxlab-fellowship-2019.s3.amazonaws.com/
All files are in their own subfolder in that bucket and mapped in this CSV (85 MB). The CSV has three columns: id
, filename
, access_pid
. This is really one of the CSVs listed below in “File ID to URL mapping” that contains every file.
Folder | Description | File count | Size |
---|---|---|---|
csv/ |
File ID to URL mapping for categories/full set (CSV) | 22 | 48.1 MB |
colors_output/ |
Colour summarizing (full version) | 2,212,318 | 64.3 GB |
colors_minimal/ |
Colour summarizing (compact version) | 2,231,480 | 1.3 GB |
predictions/ |
Image predictions (4,096 word values, gzipped) | 2,231,222 | 33.8 GB |
similarities/ |
Image similarity intermediate data | 81 | 2.9 GB |
150_150/ |
Image thumbnails (150x150 pixels) | 2,231,496 | 9.8 GB |
32_32/ |
Image thumbnails (32x32 pixels) | 2,238,557 | 3.5 GB |
This is the color information extracted from every image like histogram, colour palette, and more. There are two versions: full and compact. The compact version is the one used by Aereo and only includes:
- the five more prominent colours (palette)
- percentage amount for each colour
- text names for each colour
The two groups of files are not consistently named (because long story 😳) and follow this structure:
- Folder
colors_output
- Type
- JSON
- Naming convention
- Based on the file
access_pid
(e.g.:[BUCKET]/colors_output/110000148.json
).
- Folder
colors_minimal
- Type
- JSON
- Naming convention
- Based on the file
id
(e.g.:[BUCKET]/colors_minimal/0A3Z4x84Z0wq.json
).
- Folder
predictions
- Type
- Gzipped JSON
- Naming convention
- Based on the file
id
(e.g.:[BUCKET]/predictions/0A3Z4x84Z0wq.json.gz
).
These are the object recognition predictions for every image. These are 4,096 newline-separated values from 0 to 1 saved as gzipped JSON but no real JSON structure is used.
- Folder
similarities
- Type
- Multiple, see below
These are the similarity calculations for each category of files in Aereo. The internal names of the categories are:
archTechDrawings
newspapers
coin
drawings
ephemera
journals
manuscripts
manuscriptMaps
maps
medals
negatives
objects
paintings
photographs
pictures
posters
prints
stamps
For each category there are four files: three Python Pickle files and one text file. The Pickle files are the in-between steps for converting the 4,096 object recognitions above into the square grid that is used in Aereo. The code for this process is available in this repository, where this file is the one outputting the Pickle and text files.
The process includes:
- converting the 4,096 down to the 300 most informative values via Principal Component Analysis (PCA).
Filename:[CATEGORY]_pca.p
(e.g.:[BUCKET]/similarities/prints_pca.p
); - shaping those 300 into a three-dimensional space using t-distributed Stochastic Neighbor Embedding (t-SNE). I later replaced this with Uniform Manifold Approximation and Projection (UMAP) which is much faster (but kept the original t-SNE anyway).
Filename:[CATEGORY]_[tsne or umap].p
(e.g.:[BUCKET]/similarities/prints_umap.p
); - converting the three-dimensional space into a two-dimensional grid using RasterFairy and producing a newline-separated list of
x y
coordinates for each file in the category and starting with aCOLUMNS ROWS
line. Thephotographs
category wouldn't work in RasterFairy so it was gridded using Lagrangian Gradient Descent.
Filename:[CATEGORY].txt
(e.g.:[BUCKET]/similarities/prints.txt
)
There are two folders, one with 150x150 pixel thumbnails (folder 150_150
) and one with 32x32 pixel thumbnails (folder 32_32
). Both have the same naming convention but the 32x32 has PNG files and the 150x150 has JPEG files (I have no good explanation for this difference 😳): [first_four_characters_of_filename]/filename.[png or jpg]
(e.g.: [BUCKET]/32_32/1000/10000130.png
and [BUCKET]/150_150/1000/10000130.jpg
).
See .env.example
for the necessary environment variables and rename to .env.local
for proper functioning in localhost.
yarn install
yarn serve
yarn build