Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should the training data be served? #3

Open
thompsonmj opened this issue Jul 12, 2024 · 1 comment
Open

How should the training data be served? #3

thompsonmj opened this issue Jul 12, 2024 · 1 comment
Labels
question Further information is requested

Comments

@thompsonmj
Copy link
Contributor

Full-sized images used for training are split between the EOL images on Hugging Face and the iNat21 and BIOSCAN image sets through their own distribution sources.

However, the images were resized to 224x224 prior to training. For this project, it is a requirement that the nearest-neighbor images be presented in the format they were in for training to provide users an accurate representation of what the model 'knows'. This would probably best be done by putting the full webdataset formatted set of TAR files into a private Hugging Face space strictly for serving a handful of images at a time for this project rather than as a redistribution method. This should provide random access to individual images by filename.

Would it also be useful to see the nearest neighbor images in full original resolution? It would be possible using ratarmount, which could FUSE mount the contents of each dataset (.tar.gz for EOL and iNat and .zip for BIOSCAN) to the API server filesystem to enable random access to individual full-sized images as well.

@egrace479 egrace479 added the question Further information is requested label Jul 25, 2024
@egrace479
Copy link
Member

The Hugging Face solution seems reasonable since our webdataset has about 200 TAR files and Hugging Face supports the webdataset format.

We'll likely want to have the TAR file/ for the keys in the vector database so it knows how to access the images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants