Skip to content

Latest commit

 

History

History
62 lines (51 loc) · 1.67 KB

File metadata and controls

62 lines (51 loc) · 1.67 KB

VizWiz VQA course project Multi Modal Machine Learning

Running Instructions

  1. Download data:

Download skill data:

cd data/skill
bash download_data.sh

Download VQA data:

cd data/VQA
bash download_data.sh
  1. Run model (SkillCLIP) variants:

With everything:

python -m src.main_model.clip_late_fusion -t -de "cuda:0" -exp skill_aware_clip

Without skill embeddings:

python -m src.main_model.clip_late_fusion -t -de "cuda:0" -exp skill_unaware_clip

Without object tags:

python -m src.main_model.clip_late_fusion -t -de "cuda:0" -exp skill_aware_clip_nobj -nobj

Without scene text:

python -m src.main_model.clip_late_fusion -t -de "cuda:0" -exp skill_aware_clip_nsctxt -nsctxt

With multi-task training:

python -m src.main_model.clip_multitasking.py -t -de "cuda:0" -exp skill_aware_clip_multitasking -pred_file pred.json

Interesting object detections

Keys of a keyboard are detected as microwaves with relatively high confidence scores:

  1. path: val_objects_detected/VizWiz_val_00001474_objects.png
    Potential reasons: the image is very zoomed in which might be abnormal.

Illustrative Examples:

Here are some illustrative examples from our error analysis: FusionCLIP refers to the SkillCLIP model without the skill embeddings. Table Row1 Table Row2 Table Row3 Table Row4 Table Row5 Comparison between our model (SkillCLIP) and FusionCLIP. Example Table Some more examples: qual eg 1 qual eg 2