This project aims to develop a series of open-source and strong fundamental image recognition models.
-
Recognize Anything Plus Model (RAM++) [Paper]
RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.
-
Recognize Anything Model (RAM) [Paper][Demo]
RAM is an image tagging model, which can recognize any common category with high accuracy.
RAM is accepted at CVPR 2024 Multimodal Foundation Models Workshop.
-
Tag2Text (ICLR 2024) [Paper] [Demo]
Tag2Text is a vision-language model guided by tagging, which can support tagging and comprehensive captioning simultaneously.
Tag2Text is accepted at ICLR 2024! See you in Vienna!
RAM++ outperforms existing SOTA image fundamental recognition models on common tag categories, uncommon tag categories, and human-object interaction phrases.
![]() |
Comparison of zero-shot image recognition performance.
We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.
RAM++
RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.
- For Common Predefined Categoies. RAM++ exhibits exceptional image tagging capabilities with powerful zero-shot generalization, which maintains the same capabilities as RAM.
- For Diverse Open-set Categoires. RAM++ achieves notably enhancements beyond CLIP and RAM.
![]() |
(Green color means fully supervised learning and others means zero-shot performance.)
![]() |
RAM++ demonstrate a significant improvement in open-set category recognition.
RAM
RAM is a strong image tagging model, which can recognize any common category with high accuracy.
- Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
- Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
- Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.
![]() |
(Green color means fully supervised learning and Blue color means zero-shot performance.)
![]() |
RAM significantly improves the tagging ability based on the Tag2text framework.
- Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
- Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training
Tag2text
Tag2Text is an efficient and controllable vision-language model with tagging guidance.
- Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
- Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
- Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.
![]() |
Tag2Text generate more comprehensive captions with tagging guidance.
![]() |
Tag2Text provides tags as additional visible alignment indicators.
These annotation files come from the Tag2Text and RAM. Tag2Text automatically extracts image tags from image-text pairs. RAM further augments both tags and texts via an automatic data engine.
DataSet | Size | Images | Texts | Tags |
---|---|---|---|---|
COCO | 168 MB | 113K | 680K | 3.2M |
VG | 55 MB | 100K | 923K | 2.7M |
SBU | 234 MB | 849K | 1.7M | 7.6M |
CC3M | 766 MB | 2.8M | 5.6M | 28.2M |
CC3M-val | 3.5 MB | 12K | 26K | 132K |
CC12M to be released in the next update.
These tag descriptions files come from the RAM++ by calling GPT api. You can also customize any tag categories by generate_tag_des_llm.py.
Tag Descriptions | Tag List |
---|---|
RAM Tag List | 4,585 |
OpenImages Uncommon | 200 |
Note : you need to create 'pretrained' folder and download these checkpoints into this folder.
Name | Backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM++ (14M) | Swin-Large | COCO, VG, SBU, CC3M, CC3M-val, CC12M | Provide strong image tagging ability for any category. | Download link |
2 | RAM (14M) | Swin-Large | COCO, VG, SBU, CC3M, CC3M-val, CC12M | Provide strong image tagging ability for common category. | Download link |
3 | Tag2Text (14M) | Swin-Base | COCO, VG, SBU, CC3M, CC3M-val, CC12M | Support comprehensive captioning and tagging. | Download link |
- Create and activate a Conda environment:
conda create -n recognize-anything python=3.8 -y
conda activate recognize-anything
- Install
recognize-anything
as a package:
pip install git+https://github.com/xinyu1205/recognize-anything.git
- Or, for development, you may build from source:
git clone https://github.com/xinyu1205/recognize-anything.git
cd recognize-anything
pip install -e .
Then the RAM++, RAM, and Tag2Text models can be imported in other projects:
from ram.models import ram_plus, ram, tag2text
Get the English and Chinese outputs of the images:
python inference_ram_plus.py --image images/demo/demo1.jpg --pretrained pretrained/ram_plus_swin_large_14m.pth
The output will look like the following:
Image Tags: armchair | blanket | lamp | carpet | couch | dog | gray | green | hassock | home | lay | living room | picture frame | pillow | plant | room | wall lamp | sit | wood floor
图像标签: 扶手椅 | 毯子/覆盖层 | 灯 | 地毯 | 沙发 | 狗 | 灰色 | 绿色 | 坐垫/搁脚凳/草丛 | 家/住宅 | 躺 | 客厅 | 相框 | 枕头 | 植物 | 房间 | 壁灯 | 坐/放置/坐落 | 木地板
- Get the OpenImages-Uncommon categories of the image:
We have released the LLM tag descriptions of OpenImages-Uncommon categories in openimages_rare_200_llm_tag_descriptions.
python inference_ram_plus_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_plus_swin_large_14m.pth \ --llm_tag_des datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json
The output will look like the following:
Image Tags: Close-up | Compact car | Go-kart | Horse racing | Sport utility vehicle | Touring car
- You can also customize any tag categories for recognition through tag descriptions:
Modify categories, and call GPT api to generate corresponding tag descriptions:
python generate_tag_des_llm.py \ --openai_api_key 'your openai api key' \ --output_file_path datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json
RAM Inference
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/demo/demo1.jpg \ --pretrained pretrained/ram_swin_large_14m.pth
The output will look like the following:
Image Tags: armchair | blanket | lamp | carpet | couch | dog | floor | furniture | gray | green | living room | picture frame | pillow | plant | room | sit | stool | wood floor
图像标签: 扶手椅 | 毯子/覆盖层 | 灯 | 地毯 | 沙发 | 狗 | 地板/地面 | 家具 | 灰色 | 绿色 | 客厅 | 相框 | 枕头 | 植物 | 房间 | 坐/放置/坐落 | 凳子 | 木地板
RAM Inference on Unseen Categories (Open-Set)
Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:
python inference_ram_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_swin_large_14m.pth
The output will look like the following:
Image Tags: Black-and-white | Go-kart
Tag2Text Inference
Get the tagging and captioning results:
python inference_tag2text.py --image images/demo/demo1.jpgOr get the tagging and sepcifed captioning results (optional):
--pretrained pretrained/tag2text_swin_14m.pth
python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth
--specified-tags "cloud,sky"
We release two datasets OpenImages-common
(214 common tag classes) and OpenImages-rare
(200 uncommon tag classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/
and datasets/openimages_rare_200/imgs
.
To evaluate RAM++ on OpenImages-common
:
python batch_inference.py \
--model-type ram_plus \
--checkpoint pretrained/ram_plus_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram_plus
To evaluate RAM++ open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram_plus \
-- pretrained/ram_plus_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_plus_openset
To evaluate RAM on OpenImages-common
:
python batch_inference.py \
--model-type ram \
-- pretrained/ram_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram
To evaluate RAM open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram \
-- pretrained/ram_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_openset
To evaluate Tag2Text on OpenImages-common
:
python batch_inference.py \
--model-type tag2text \
-- pretrained/tag2text_swin_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/tag2text
Please refer to batch_inference.py
for more options. To get P/R in table 3 of RAM paper, pass --threshold=0.86
for RAM and --threshold=0.68
for Tag2Text.
To batch inference custom images, you can set up you own datasets following the given two datasets.
-
Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags }.
-
In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.
-
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
-
Download RAM++ frozen tag embedding file "ram_plus_tag_embedding_class_4585_des_51.pth", and set file in "ram/data/frozen_tag_embedding/ram_plus_tag_embedding_class_4585_des_51.pth"
-
Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
--model-type ram_plus \
--config ram/configs/pretrain.yaml \
--output-dir outputs/ram_plus
- Fine-tune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
--model-type ram_plus \
--config ram/configs/finetune.yaml \
--checkpoint outputs/ram_plus/checkpoint_04.pth \
--output-dir outputs/ram_plus_ft
RAM
-
Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with four key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags, 'parse_label_id': image tags parsed from caption }.
-
In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.
-
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
-
Download RAM frozen tag embedding file "ram_tag_embedding_class_4585.pth", and set file in "ram/data/frozen_tag_embedding/ram_tag_embedding_class_4585.pth"
-
Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
--model-type ram \
--config ram/configs/pretrain.yaml \
--output-dir outputs/ram
- Fine-tune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
--model-type ram \
--config ram/configs/finetune.yaml \
--checkpoint outputs/ram/checkpoint_04.pth \
--output-dir outputs/ram_ft
Tag2Text
-
Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'parse_label_id': image tags parsed from caption }.
-
In ram/configs/pretrain_tag2text.yaml, set 'train_file' as the paths for the json files.
-
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
-
Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
--model-type tag2text \
--config ram/configs/pretrain_tag2text.yaml \
--output-dir outputs/tag2text
- Fine-tune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
--model-type tag2text \
--config ram/configs/finetune_tag2text.yaml \
--checkpoint outputs/tag2text/checkpoint_04.pth \
--output-dir outputs/tag2text_ft
If you find our work to be useful for your research, please consider citing.
@article{huang2023open,
title={Open-Set Image Tagging with Multi-Grained Text Supervision},
author={Huang, Xinyu and Huang, Yi-Jie and Zhang, Youcai and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Xie, Yanchun and Li, Yaqian and Zhang, Lei},
journal={arXiv e-prints},
pages={arXiv--2310},
year={2023}
}
@article{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
journal={arXiv preprint arXiv:2306.03514},
year={2023}
}
@article{huang2023tag2text,
title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
journal={arXiv preprint arXiv:2303.05657},
year={2023}
}
This work is done with the help of the amazing code base of BLIP, thanks very much!
We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.
We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.