Skip to content

Latest commit

 

History

History
89 lines (67 loc) · 4.85 KB

README.md

File metadata and controls

89 lines (67 loc) · 4.85 KB

Zero-Shot Coreset Selection

(Brent A. Griffin*, Jacob Marks, Jason J. Corso) @ Voxel51

* Corresponding author

Zero-Shot Coreset Selection (ZCore) is a method of coreset selection for unlabeled data. Deep learning methods rely on massive data, resulting in substantial costs for storage, annotation, and model training. Coreset selection aims to select a subset of the data to train models with lower cost while ideally performing on par with the full data training. Although the majority of real-world data are unlabeled, previous state-of-the-art coreset methods cannot select data that are unlabeled. As a solution, ZCore addresses the problem of coreset selection without labels or training on candidate data. Instead, ZCore uses existing foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. On ImageNet, the ZCore coreset achieves a higher accuracy than previous label-based coresets at a 90% prune rate, while removing annotation requirements for 1.15 million images.

Zero-Shot Coreset Selection Overview alt text

Using ZCore

We provide example ZCore commands for coreset selection and subsequent model training for the EuroSAT10 dataset from our paper. See instructions in Repeat Trials to repeat experiment trials and Dataset Setup for full ImageNet, CIFAR, or EuroSAT setup.

Step 1. Dataset. Download and unzip eurosat10.zip in ./data.

Step 2. Zero-Shot Coreset Selection

python zeroshot_coreset_selection.py --dataset eurosat10 --data_dir ./data --results_dir ./results --embedding clip resnet18 --num_workers 10

FiftyOne dependency to generate embeddings (pip install fiftyone).

Step 3. Train Coreset Model

python train_coreset_model.py --prune_rate 0.7 --dataset eurosat10 --data_dir ./data --score_file ./results/eurosat10/zcore-eurosat10-clip-resnet18-1000Ks-2sd-ri-1000nn-4ex-0/score.npy

Repeat Trials

We provide examples scripts to repeat ZCore experiments over multiple trials in ./repeat-trial-scripts.

Repeat ZCore Selections for EuroSAT10

chmod +x ./repeat-trial-scripts/eurosat10-score-x5.sh
./repeat-trial-scripts/eurosat10-score-x5.sh

Repeat Coreset Model Training for EuroSAT10

chmod +x ./repeat-trial-scripts/eurosat10-train-x5.sh
./repeat-trial-scripts/eurosat10-train-x5.sh

We provide example repeat trial results in ./results/example/eurosat10. To tabulate these repeat trials run:

python process_repeat_trials.py --base_score_dir ./results/example/eurosat10/zcore-eurosat10-clip-resnet18-1000Ks-2sd-ri-1000nn-4ex

to generate the following table:

Setting p30-s51 p50-s51 p70-s51 p80-s51 p90-s51 

Trial Results
0       93.80   91.93   86.10   80.98   63.63   
1       93.39   91.26   85.74   78.88   65.58   
2       93.63   91.21   87.91   79.84   66.70   
3       93.90   92.38   86.91   79.86   65.16   
4       94.06   92.26   86.47   80.20   67.75   

Aggregate Results
Mean    93.76   91.81   86.63   79.95   65.76   
StdDev  0.230   0.491   0.750   0.677   1.398   
Overall Mean: 83.58 

Datasets

ImageNet can be downloaded here and subsequently reformatted using:

cd ./ILSVRC/Data/CLS-LOC/val/                                                               
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash

CIFAR10 and CIFAR100 can be downloaded here.

EuroSAT80, EuroSAT40, EuroSAT20, and EuroSAT10 can be downloaded here.

Citation

If you find this code useful, please consider citing our paper:

@article{griffin24zcore,
  title={Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data},
  author={Griffin, Brent A and Marks, Jacob and Corso, Jason J},
  journal={arXiv preprint arXiv:2411.15349},
  year={2024}
}

You may also want to check out our open-source toolkit, FiftyOne, which provides a powerful interface for exploring, analyzing, and visualizing datasets for computer vision and machine learning.