Gamified crowd-sourcing of high-quality data for visual fine-tuning
Join our ๐ฌ Telegram
GAP-VQA Dataset ๐ค | MiniCPM-Llama3-V-2.5-8B fine-tuned on GAP ๐ค | GAP Technical Report
- [2024.10.10] Our paper "Gamified Adversarial Prompting (GAP) - A Framework for Crowd-Sourcing High-Quality Data for Visual Fine-Tuning" is now available on arXiv.
- [2024.10.05] ๐๐๐ We open-source the GAP-VQA dataset and fine-tuned models! Try them now on Hugging Face!
- [2024.09.16] We've reached over 50,000 participants in our GAP platform!
- Introduction
- Results and Achievements
- User Participation
- Usage and Implementation
- Future Work
- Contributing and Community
- License and Citation
Gamified Adversarial Prompting (GAP) is a groundbreaking framework that revolutionizes the collection of high-quality data for visual instruction tuning of large multimodal models. By transforming data collection into an engaging game, GAP motivates participants to uncover and challenge the limitations of AI models, leading to unprecedented improvements in model performance.
- ๐ฎ Gamified Data Collection: An immersive platform where participants earn rewards by identifying AI model weaknesses.
- ๐ Highly Scalable Framework: Rapidly engaged over 50,000 participants within weeks, demonstrating exceptional growth potential.
- ๐ Dramatic Model Improvements: Achieved a remarkable increase in GPT score for MiniCPM-Llama3-V-2.5-8B from 0.147 to 0.477.
- ๐ Universal Cross-Model Benefits: Demonstrated significant improvements across various model architectures, indicating broad applicability.
- ๐ Cutting-edge Web3 Integration: Leveraging blockchain technology for transparent rewards and true contributor ownership.
At the heart of our framework is an engaging game that challenges players to outsmart AI models:
- Players are presented with a series of images, each for a maximum of 120 seconds.
- For each image, the player's goal is to formulate a question that the AI model will answer incorrectly.
- Players can ask multiple questions per image, strategically probing for model weaknesses.
- Points are awarded based on the player's ability to identify questions that reveal gaps in the AI's knowledge.
- The game includes both "tainted" (simple) and "untainted" (complex) images, creating a varied and challenging experience.
This gamified approach not only makes data collection enjoyable but also naturally guides players towards uncovering genuine model limitations, resulting in an exceptionally high-quality dataset.
Our GAP framework has produced remarkable improvements across multiple models and datasets, showcasing its power and versatility:
Model | Pre-fine-tuning GPT Score | Post-fine-tuning GPT Score | Improvement |
---|---|---|---|
GPT-4V (Benchmark) | 0.637 | - | - |
MiniCPM-Llama3-V-2.5-8B | 0.147 | 0.477 | +0.300 |
Qwen2-VL-2B | 0.169 | 0.285 | +0.116 |
Qwen2-VL-7B | 0.207 | 0.250 | +0.043 |
The fine-tuned MiniCPM-Llama3-V-2.5-8B model achieved a remarkable 324% improvement, dramatically closing the gap with the GPT-4V benchmark. This extraordinary leap in performance demonstrates the exceptional quality of the GAP-VQA dataset.
Performance of MiniCPM-Llama3-V-2.5-8B before and after fine-tuning on GAP-VQA:
Dataset | Pre-fine-tuning | Post-fine-tuning |
---|---|---|
LLaVA Bench | 87.9 | 82.2 |
OCRBench | 72.4 | 73.1 |
MME | 2025.61 | 2040.54 |
RealWorldQA | 0.634 | 0.609 |
MM-Vet | 51.422 | 51.789 |
MMBench | 0.752 | 0.7422 |
HallusionBench | 59.93 | 60.25 |
TextVQA | 76.63 | 76.966 |
MMMU val | 0.474 | 0.486 |
DocVQA | 84.47 | 84.33 |
The GAP-fine-tuned model shows impressive gains across a wide range of benchmarks, highlighting the framework's ability to enhance general visual understanding and reasoning capabilities.
Performance of Qwen2-VL-7B before and after fine-tuning on GAP-VQA:
Dataset | Pre-fine-tuning | Post-fine-tuning |
---|---|---|
LLaVA Bench | 76.7 | 83.6 |
OCRBench | 86.1 | 86.7 |
MME | 2318.98 | 2332.71 |
RealWorldQA | 0.699 | 0.690 |
MM-Vet | 62.889 | 64.954 |
MMBench | 0.808 | 0.815 |
HallusionBench | 68.769 | 68.769 |
TextVQA | 84.428 | 84.084 |
MMMU val | 0.524 | 0.527 |
DocVQA | 93.866 | 94.038 |
Performance of Qwen2-VL-2B before and after fine-tuning on GAP-VQA:
Dataset | Pre-fine-tuning | Post-fine-tuning |
---|---|---|
LLaVA Bench | 52.6 | 57.9 |
OCRBench | 81.2 | 81.4 |
MME | 1881.92 | 1962.75 |
RealWorldQA | 0.626 | 0.6156 |
MM-Vet | 51.146 | 52.43 |
MMBench | 0.729 | 0.732 |
HallusionBench | 61.619 | 62.99 |
TextVQA | 79.824 | 80.074 |
MMMU val | 0.414 | 0.448 |
DocVQA | 89.26 | 89.36 |
These results demonstrate the remarkable versatility of the GAP framework. Not only does it dramatically improve the performance of the model it was initially designed for (MiniCPM-Llama3-V-2.5-8B), but it also yields significant enhancements in different model architectures and sizes. This cross-model benefit is a testament to the high quality and broad applicability of the GAP-VQA dataset.
The GAP framework has achieved outstanding user engagement, showcasing its appeal and effectiveness:
69.68% | โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ | 0 sessions
27.75% | โ โ โ โ โ โ โ โ โ โ โ โ โ โ | 1 session
2.68% | โ | 2+ sessions
Our participation data reveals a highly engaged user base:
- An impressive 30.32% of users actively participate at least once per week, demonstrating strong interest in the platform.
- A dedicated core of 2.68% of users show exceptional engagement by participating in multiple sessions weekly.
- The substantial 27.75% of users who engage in one session per week represent a large pool of casual participants, indicating the game's broad appeal.
0.7 | โ
| โ
0.6 | โ
| โ
0.5 | โ
| โ
0.4 | โ
| โ
0.3 | โ
| โ
0.2 | โ
| โ
0.1 | โ โ
| โ โ โ
0.0 |_โ ___โ ___โ ___โ ___โ ___โ ___โ ___โ ___โ ___โ ___
1 2 3 4 5 6 7 8 9 10
Number of Images Interacted
The above figure shows that when users do participate, they overwhelmingly interact with all 10 images in a session, as evidenced by the pronounced spike at the 10-image mark. This suggests that the game design effectively encourages thorough engagement once a session begins. The active user base, while smaller, provides valuable data on the gameโs appeal and effectiveness. The high completion rate of sessions and the presence of repeat participants indicate strong engage- ment among active users, showcasing the gameโs potential to retain and involve players consistently.
git clone https://github.com/fraction-ai/GAP.git
from peft import PeftModel
from transformers import AutoModel
# Define model type and path to the adapter
model_type = "openbmb/MiniCPM-Llama3-V-2.5"
path_to_adapter = "path_to_lora_checkpoint"
# Load the base pre-trained model
model = AutoModel.from_pretrained(
model_type,
trust_remote_code=True
)
# Load the LoRA adapter and move model to GPU
lora_model = PeftModel.from_pretrained(
model,
path_to_adapter,
device_map="auto",
trust_remote_code=True
).eval().cuda()
# Your code here
The GAP-VQA dataset (3,683 question-image pairs) is available on Hugging Face.
LoRA fine-tuning scripts have been set up for MiniCPM-Llama3-V-2_5 and Qwen2-VL. Detailed instructions can be found inside each link.
Instructions for Benchmark Evaluation of LoRA fine-tuned models are present in VLMEvalkit. You can also check out GPT-4V Score calculation to compute the GPT scores for these models.
Our future work will focus on enhancing the GAP framework through three key developments:
-
AI-Assisted Question Generation: We plan to develop a visually fine-tuned language model capable of generating questions that the base LLM answers incorrectly. This will allow for more systematic identification of model blind spots and weaknesses.
-
Advanced Probabilistic Modeling: We aim to create a sophisticated probabilistic model that incorporates factors such as player skill, image difficulty, response time, and fatigue. This will enable more accurate estimation of LMM capabilities while controlling for confounding variables.
-
Expanding GAP's Applicability: We will explore the application of the GAP framework beyond LMMs to other AI domains. This expansion will involve addressing domain-specific challenges and ethical considerations unique to each field.
These advancements aim to create a more scalable, efficient, and widely applicable approach for continuous improvement of AI systems through targeted human feedback.
We welcome contributions! To contribute:
- Fork the repository.
- Create a new branch for your feature or fix.
- Make your changes and commit them.
- Push your changes to your fork.
- Open a Pull Request (PR) to the main repository.
Please ensure your code follows the standard guidelines and is well-tested.
You can join our Telegram for discussions, follow us on Twitter for updates, and check our Blog for in-depth content.
This project is licensed under the Apache 2.0 License. If you find our work helpful, please cite our paper:
@misc{yadav2024gamifiedcrowdsourcinghighqualitydata,
title={Gamified crowd-sourcing of high-quality data for visual fine-tuning},
author={Shashank Yadav and Rohan Tomar and Garvit Jain and Chirag Ahooja and Shubham Chaudhary and Charles Elkan},
year={2024},
eprint={2410.04038},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.04038},
}