tested codebase

togethercomputer · Oct 13, 2024 · db139b4 · db139b4
1 parent 6ed24a3
commit db139b4
Show file tree

Hide file tree

Showing 8 changed files with 137 additions and 116 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,9 +1,10 @@
 <div align="center">
   <img src="assets/dragonfly_icon.png" alt="Dragonfly" style="width: 150px; display: block; margin-left: auto; margin-right: auto;" />
-  <h1>Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model</h1>
+  <h1>Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models</h1>
 </div>
 
 ## 🔥 News
+- **Note**: We updated our codebase and arxiv paper with improved version of Dragonfly architecture. If you still want to use the old version of the code, it is still in [github branch](link).
 - [Our paper](https://arxiv.org/abs/2406.00977) is out on arxiv.
 - Our model checkpoints are out on huggingface 🤗 🚀: 
     - General: [`togethercomputer/Llama-3.1-8B-Dragonfly-v1`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-v1) 
@@ -14,7 +15,7 @@
 
 ![Dragonfly framework](assets/model_overview.png)
 
-Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, existing vision transformers (ViTs) often struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we go beyond recent high-resolution and multi-crop techniques by not only preserving the native resolution but also zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. On average, Dragonfly ranks at the top across ten general-domain benchmarks, outperforming models that are significantly larger or trained on much larger datasets. Our biomedical version, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6\% accuracy on SLAKE (compared to 84.8\% for Med-Gemini), a 67.1\% token F1 score on Path-VQA (compared to 62.7\% for Med-PaLM M), and attains state-of-the-art results across the majority of performance metrics. Overall, our work establishes a new paradigm for extracting high-resolution fine-grained features from images, significantly enhancing the capabilities of VLMs in both general and specialized domains. 
+Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we go beyond recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical version, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and attains state-of-the-art results across the majority of performance metrics. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains. 
 
 
 # 📖 Table of Contents
@@ -26,7 +27,6 @@ Recent advances in vision-language models (VLMs) have demonstrated the advantage
 6. [BibTeX](#bibtex)
 7. [Licence](#license)
 
-
 <a name="installation"/>
 
 ## 💿 Installation
@@ -51,7 +51,7 @@ pip install --upgrade -e .
 
 ## 🏁 Checkpoint
 
-*Note: These models are released under [Llama 3 Community License Agreement](LICENSE)*
+*Note: These models are released under [Llama 3.1 Community License Agreement](LICENSE)*
 
 We release two huggingface model checkpoints: [`togethercomputer/Llama-3.1-8B-Dragonfly-v1`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-v1) and [`togethercomputer/Llama-3.1-8B-Dragonfly-Med-v1`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-Med-v1). Please follow the script [`test_dragonfly.py`](test_dragonfly.py) for more details. We provide a brief description on how to use them below.
 
@@ -65,7 +65,7 @@ We provide two test examples inside [`test_images`](test_images).
 
 Question: What is so funny about this image?
 
-![Skateboard](test_images/monalisa_dog.jpg)
+![Monalisa Dog](test_images/monalisa_dog.jpg)
 
 Load necessary packages
 ```python
@@ -99,7 +99,7 @@ image = image.convert("RGB")
 images = [image]
 # images = [None] # if you do not want to pass any images
 
-text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nSummarize the visual content of the image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
+text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
 
 inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
 inputs = inputs.to(device)

diff --git a/assets/dragonfly_icon.png b/assets/dragonfly_icon.png
diff --git a/environment.yml b/environment.yml
@@ -3,7 +3,6 @@ channels:
   - defaults
 dependencies:
   - python=3.10
-  - conda-forge::openjdk
   - pip
   - pip:
     - -r requirements.txt
diff --git a/requirements.txt b/requirements.txt
@@ -1,40 +1,11 @@
-accelerate>=0.19.0
-braceexpand>=0.1.7
-einops>=0.6.1
-einops_exts>=0.0.4
-fastapi>=0.95.2
-gradio>=3.33.1
-huggingface_hub>=0.13.3
-importlib_metadata>=6.6.0
-inflection>=0.5.1
-markdown2>=2.4.8
-more_itertools>=9.1.0
-nltk>=3.8.1
-numpy>=1.23.5
-open_clip_torch>=2.16.0
-openai>=1.1.1
-opencv_python_headless>=4.5.5.64
-Pillow>=9.5.0
-pycocoevalcap>=1
-pycocotools>=2.0.6
-Requests>=2.31.0
-scipy>=1.10.1
-timm>=0.9.2
-tqdm>=4.65.0
-transformers==4.35.1
-uvicorn>=0.22.0
-webdataset>=0.2.48
-natsort>=8.4.0
-peft>=0.4.0
-ijson>=3.2.3
-yajl>=0.3.5
-deepspeed>=0.10.0
-wandb>=0.15.8
-trl>=0.5.0
-cffi>=1.15.1
-pyyaml>=6.0.1
-pytest>=7.4.2
-prettytable>=3.9.0
+torch
+transformers
+numpy
+Pillow
 datasets
+wandb
+tqdm
+accelerate
+deepspeed
 packaging
 ninja
diff --git a/test_dragonfly.py b/test_dragonfly.py
@@ -8,7 +8,6 @@
 from dragonfly.models.processing_dragonfly import DragonflyProcessor
 from pipeline.train.train_utils import random_seed
 
-
 def format_text(text, system_prompt=""):
     instruction = f"{system_prompt} {text}" if system_prompt else text
     prompt = f"<|start_header_id|>user<|end_header_id|>\n\n" f"{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
@@ -24,16 +23,19 @@ def format_text(text, system_prompt=""):
 torch.backends.cuda.enable_flash_sdp(True)
 
 # set your model name and image path
-pretrained_model_name_or_path = "togethercomputer/Llama-3.1-8B-Dragonfly-v1"
-image_path = "./test_images/monalisa_dog.jpg"
-question = "What is so funny about this image?"
+# pretrained_model_name_or_path = "togethercomputer/Llama-3.1-8B-Dragonfly-v1"
+# image_path = "./test_images/monalisa_dog.jpg"
+# question = "What is so funny about this image?"
+
+pretrained_model_name_or_path = "togethercomputer/Llama-3.1-8B-Dragonfly-Med-v1"
+image_path = "./test_images/ROCO_04197.jpg"
+question = "Provide a brief description of the given image."
 
 # parameters
 device = "cuda:0"
 seed = 42
 temperature = 0
 
-
 def main():
     random_seed(seed)
 

diff --git a/train_dragonfly_stage1.sh b/train_dragonfly_stage1.sh
@@ -15,7 +15,7 @@ accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_
     --image_dir <"your_image_folder"> \
     --together_hq_datasets <"your_datasets"> \
     --logging_steps 1000 \
-    --max_seq_length 2048 \
+    --max_seq_length 4096 \
     --checkpointing_steps 5000 \
     --image_encoder_name_or_path openai/clip-vit-large-patch14-336 \
     --text_pretrained_model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct \

diff --git a/train_dragonfly_stage2.sh b/train_dragonfly_stage2.sh
@@ -18,7 +18,7 @@ accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_
     --together_math_datasets <"your_math_datasets"> \
     --text_dataset_prob 0.1 \
     --logging_steps 100 \
-    --max_seq_length 2048 \
+    --max_seq_length 4096 \
     --checkpointing_steps 10000 \
     --save_hf_checkpoints \
     --total_hf_checkpoint_limits 8 \