How should the training data be served? #3

thompsonmj · 2024-07-12T18:50:50Z

Full-sized images used for training are split between the EOL images on Hugging Face and the iNat21 and BIOSCAN image sets through their own distribution sources.

However, the images were resized to 224x224 prior to training. For this project, it is a requirement that the nearest-neighbor images be presented in the format they were in for training to provide users an accurate representation of what the model 'knows'. This would probably best be done by putting the full webdataset formatted set of TAR files into a private Hugging Face space strictly for serving a handful of images at a time for this project rather than as a redistribution method. This should provide random access to individual images by filename.

Would it also be useful to see the nearest neighbor images in full original resolution? It would be possible using ratarmount, which could FUSE mount the contents of each dataset (.tar.gz for EOL and iNat and .zip for BIOSCAN) to the API server filesystem to enable random access to individual full-sized images as well.

The text was updated successfully, but these errors were encountered:

egrace479 · 2024-07-25T15:26:35Z

The Hugging Face solution seems reasonable since our webdataset has about 200 TAR files and Hugging Face supports the webdataset format.

We'll likely want to have the TAR file/ for the keys in the vector database so it knows how to access the images.

egrace479 added the question Further information is requested label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should the training data be served? #3

How should the training data be served? #3

thompsonmj commented Jul 12, 2024

egrace479 commented Jul 25, 2024

How should the training data be served? #3

How should the training data be served? #3

Comments

thompsonmj commented Jul 12, 2024

egrace479 commented Jul 25, 2024