Skip to content

Commit

Permalink
tested codebase
Browse files Browse the repository at this point in the history
  • Loading branch information
Rahul Thapa committed Oct 13, 2024
1 parent 6ed24a3 commit db139b4
Show file tree
Hide file tree
Showing 8 changed files with 137 additions and 116 deletions.
179 changes: 114 additions & 65 deletions LICENSE

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
<div align="center">
<img src="assets/dragonfly_icon.png" alt="Dragonfly" style="width: 150px; display: block; margin-left: auto; margin-right: auto;" />
<h1>Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model</h1>
<h1>Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models</h1>
</div>

## 🔥 News
- **Note**: We updated our codebase and arxiv paper with improved version of Dragonfly architecture. If you still want to use the old version of the code, it is still in [github branch](link).
- [Our paper](https://arxiv.org/abs/2406.00977) is out on arxiv.
- Our model checkpoints are out on huggingface 🤗 🚀:
- General: [`togethercomputer/Llama-3.1-8B-Dragonfly-v1`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-v1)
Expand All @@ -14,7 +15,7 @@

![Dragonfly framework](assets/model_overview.png)

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, existing vision transformers (ViTs) often struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we go beyond recent high-resolution and multi-crop techniques by not only preserving the native resolution but also zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. On average, Dragonfly ranks at the top across ten general-domain benchmarks, outperforming models that are significantly larger or trained on much larger datasets. Our biomedical version, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6\% accuracy on SLAKE (compared to 84.8\% for Med-Gemini), a 67.1\% token F1 score on Path-VQA (compared to 62.7\% for Med-PaLM M), and attains state-of-the-art results across the majority of performance metrics. Overall, our work establishes a new paradigm for extracting high-resolution fine-grained features from images, significantly enhancing the capabilities of VLMs in both general and specialized domains.
Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we go beyond recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical version, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and attains state-of-the-art results across the majority of performance metrics. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.


# 📖 Table of Contents
Expand All @@ -26,7 +27,6 @@ Recent advances in vision-language models (VLMs) have demonstrated the advantage
6. [BibTeX](#bibtex)
7. [Licence](#license)


<a name="installation"/>

## 💿 Installation
Expand All @@ -51,7 +51,7 @@ pip install --upgrade -e .

## 🏁 Checkpoint

*Note: These models are released under [Llama 3 Community License Agreement](LICENSE)*
*Note: These models are released under [Llama 3.1 Community License Agreement](LICENSE)*

We release two huggingface model checkpoints: [`togethercomputer/Llama-3.1-8B-Dragonfly-v1`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-v1) and [`togethercomputer/Llama-3.1-8B-Dragonfly-Med-v1`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-Med-v1). Please follow the script [`test_dragonfly.py`](test_dragonfly.py) for more details. We provide a brief description on how to use them below.

Expand All @@ -65,7 +65,7 @@ We provide two test examples inside [`test_images`](test_images).

Question: What is so funny about this image?

![Skateboard](test_images/monalisa_dog.jpg)
![Monalisa Dog](test_images/monalisa_dog.jpg)

Load necessary packages
```python
Expand Down Expand Up @@ -99,7 +99,7 @@ image = image.convert("RGB")
images = [image]
# images = [None] # if you do not want to pass any images

text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nSummarize the visual content of the image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)
Expand Down
Binary file added assets/dragonfly_icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ channels:
- defaults
dependencies:
- python=3.10
- conda-forge::openjdk
- pip
- pip:
- -r requirements.txt
45 changes: 8 additions & 37 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,40 +1,11 @@
accelerate>=0.19.0
braceexpand>=0.1.7
einops>=0.6.1
einops_exts>=0.0.4
fastapi>=0.95.2
gradio>=3.33.1
huggingface_hub>=0.13.3
importlib_metadata>=6.6.0
inflection>=0.5.1
markdown2>=2.4.8
more_itertools>=9.1.0
nltk>=3.8.1
numpy>=1.23.5
open_clip_torch>=2.16.0
openai>=1.1.1
opencv_python_headless>=4.5.5.64
Pillow>=9.5.0
pycocoevalcap>=1
pycocotools>=2.0.6
Requests>=2.31.0
scipy>=1.10.1
timm>=0.9.2
tqdm>=4.65.0
transformers==4.35.1
uvicorn>=0.22.0
webdataset>=0.2.48
natsort>=8.4.0
peft>=0.4.0
ijson>=3.2.3
yajl>=0.3.5
deepspeed>=0.10.0
wandb>=0.15.8
trl>=0.5.0
cffi>=1.15.1
pyyaml>=6.0.1
pytest>=7.4.2
prettytable>=3.9.0
torch
transformers
numpy
Pillow
datasets
wandb
tqdm
accelerate
deepspeed
packaging
ninja
12 changes: 7 additions & 5 deletions test_dragonfly.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed


def format_text(text, system_prompt=""):
instruction = f"{system_prompt} {text}" if system_prompt else text
prompt = f"<|start_header_id|>user<|end_header_id|>\n\n" f"{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
Expand All @@ -24,16 +23,19 @@ def format_text(text, system_prompt=""):
torch.backends.cuda.enable_flash_sdp(True)

# set your model name and image path
pretrained_model_name_or_path = "togethercomputer/Llama-3.1-8B-Dragonfly-v1"
image_path = "./test_images/monalisa_dog.jpg"
question = "What is so funny about this image?"
# pretrained_model_name_or_path = "togethercomputer/Llama-3.1-8B-Dragonfly-v1"
# image_path = "./test_images/monalisa_dog.jpg"
# question = "What is so funny about this image?"

pretrained_model_name_or_path = "togethercomputer/Llama-3.1-8B-Dragonfly-Med-v1"
image_path = "./test_images/ROCO_04197.jpg"
question = "Provide a brief description of the given image."

# parameters
device = "cuda:0"
seed = 42
temperature = 0


def main():
random_seed(seed)

Expand Down
2 changes: 1 addition & 1 deletion train_dragonfly_stage1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_
--image_dir <"your_image_folder"> \
--together_hq_datasets <"your_datasets"> \
--logging_steps 1000 \
--max_seq_length 2048 \
--max_seq_length 4096 \
--checkpointing_steps 5000 \
--image_encoder_name_or_path openai/clip-vit-large-patch14-336 \
--text_pretrained_model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct \
Expand Down
2 changes: 1 addition & 1 deletion train_dragonfly_stage2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_
--together_math_datasets <"your_math_datasets"> \
--text_dataset_prob 0.1 \
--logging_steps 100 \
--max_seq_length 2048 \
--max_seq_length 4096 \
--checkpointing_steps 10000 \
--save_hf_checkpoints \
--total_hf_checkpoint_limits 8 \
Expand Down

0 comments on commit db139b4

Please sign in to comment.