Skip to content

yaotingwangofficial/Awesome-MCoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 

Repository files navigation

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

arXiv Maintenance Discussion WeChat

🎇 Introduction

Multimodal chain-of-thought (MCoT) reasoning has garnered attention for its ability to enhance step-by-step reasoning in multimodal contexts, particularly within multimodal large language models (MLLMs). Current MCoT research explores various methodologies to address the challenges posed by images, videos, speech, audio, 3D data, and structured data, achieving success in fields such as robotics, healthcare, and autonomous driving. However, despite these advancements, the field lacks a comprehensive review that addresses the numerous remaining challenges.

To fill this gap, we present the first systematic survey of MCoT reasoning, elucidating the foundational concepts and definitions pertinent to this area. Our work includes a detailed taxonomy and an analysis of existing methodologies across different applications, as well as insights into current challenges and future research directions aimed at fostering the development of multimodal reasoning.


Updates

2025-03-18: We release the Awesome-MCoT repo and survey.


📕 Table of Contents


🎖 MCoT Datasets and Benchmarks

  • "MC" and "Open" refer to multiple-choice and open-ended answer formats.
  • "T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.

Tab-1: Datasets for MCoT Training with Rationale.

Datasets Year Task Domain Modality Format Samples
ScienceQA 2022 VQA Science T, I MC 21K
A-OKVQA 2022 VQA Common T, I MC 25K
EgoCoT 2023 VideoQA Common T, V Open 200M
VideoCoT 2024 VideoQA Human Action T, V Open 22K
VideoEspresso 2024 VideoQA Common T, V Open 202,164
EMMA-X 2024 Robot Manipulation Indoor T, V Robot Actions 60K
M3CoT 2024 VQA Science, Math, Common T, I MC 11.4K
MAVIS 2024 ScienceQA Math T, I MC and Open 834K
LLaVA-CoT-100k 2024 VQA Common, Science T, I MC and Open 834K
MAmmoTH-VL 2024 Diverse Diverse T, I MC and Open 12M
Mulberry-260k 2024 Diverse Diverse T, I MC and Open 260K
MM-Verify 2025 MathQA Math T, I MC and Open 59,772
VisualPRM400K 2025 ScienceQA Math, Science T, I MC and Open 400K
R1-OneVision 2025 Diverse Diverse T, I MC and Open 155K

Tab-2: Benchmarks for MCoT Evaluation without Rationale.

Datasets Year Task Domain Modality Format Samples
MMMU 2023 VQA Arts, Science T, I MC and Open 11.5K
SEED 2023 VQA Common T, I MC 19K
MathVista 2023 ScienceQA Math T, I MC and Open 6,141
MathVerse 2024 ScienceQA Math T, I MC and Open 15K
Math-Vision 2024 ScienceQA Math T, I MC and Open 3040
MeViS 2023 Referring VOS Common T, V Dense Mask 2K
VSIBench 2024 VideoQA Indoor T, V MC and Open 5K
HallusionBench 2024 VQA Common T, I Yes-No 1,129
AV-Odyssey 2024 AVQA Common T, V, A MC 4,555
AVHBench 2024 AVQA Common T, V, A Open 5,816
RefAVS-Bench 2024 Referring AVS Common T, V, A Dense Mask 4,770
MMAU 2024 AQA Common T, A MC 10K
AVTrustBench 2025 AVQA Common T, V, A MC and Open 600K
MIG-Bench 2025 Multi-image Grounding Common T, I BBox 5.89K
MedAgentsBench 2025 MedicalQA Medical T, I MC and Open 862
OSWorld 2024 Agent Real Comp. Env. T, I Agent Action 369
AgentClinic 2024 MedicalQA Medical T, I Open 335

Tab-3: Benchmarks for MCoT Evaluation with Rationale.

Datasets Year Task Domain Modality Format Samples
CoMT 2024 VQA Common T, I MC 3,853
OmniBench 2024 VideoQA Common T, I, A MC 1,142
WorldQA 2024 VideoQA Common T, V, A Open 1,007
MiCEval 2024 VQA Common T, I Open 643
OlympiadBench 2024 ScienceQA Maths, Physics T, I Open 8,476
MME-CoT 2025 VQA Science, Math, Common T, I MC and Open 1,130
EMMA 2025 VQA Science T, I MC and Open 2,788
VisualProcessBench 2025 ScienceQA Math, Science T, I MC and Open 2,866

🎊 Multimodal Reasoning via RL

  • The following table concludes the techniques used by MLLMs with RL for better long-MCoT reasoning, where "T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.
  • In summary, RL unlocks complex reasoning and aha-moment without SFT, demonstrating its potential to enhance model capabilities through iterative self-improvement and rule-based approaches, ultimately paving the way for more advanced and autonomous multimodal reasoning systems.
Model Foundational LLMs Modality Learning Cold Start Algorithm Aha-moment
Deepseek-R1-Zero Deepseek-V3 T RL GRPO
Deepseek-R1 Deepseek-V3 T SFT+RL GRPO -
LLaVA-Reasoner LLaMA3-LLaVA-NEXT-8B T,I SFT+RL DPO -
Insight-V Deepseek-V3 T,I SFT+RL DPO -
Multimodal-Open-R1 Qwen2-VL-7B-Instruct T,I RL GRPO
R1-OneVision Qwen2.5-VL-7B-Instruct T,I SFT - - -
R1-V Qwen2.5-VL T,I RL GRPO
VLM-R1 Qwen2.5-VL T,I RL GRPO
LMM-R1 Qwen2.5-VL-Instruct-3B T,I RL PPO
Curr-ReFT Qwen2.5-VL-3B T,I RL+SFT GRPO -
Seg-Zero Qwen2.5-VL-3B + SAM2 T,I RL GRPO
MM-Eureka InternVL2.5-Instruct-8B T,I SFT+RL RLOO -
MM-Eureka-Zero InternVL2.5-Pretrained-38B T,I RL GRPO
VisualThinker-R1-Zero Qwen2-VL-2B T,I RL GRPO
Easy-R1 Qwen2.5-VL T,I RL GRPO -
Open-R1-Video Qwen2-VL-7B T,I,V RL GRPO
R1-Omni HumanOmni-0.5B T,I,V,A SFT+RL GRPO -
VisRL Qwen2.5-VL-7B T,I SFT+RL DPO -
R1-VL Qwen2-VL-7B T,I RL StepGPRO -

✨ MCoT Over Various Modalities

MCoT Reasoning Over Image

2025

2024

2023


MCoT Reasoning Over Video

2025

2024

2023


MCoT Reasoning Over 3D

2025

2024

2023


MCoT Reasoning Over Audio and Speech

2025

2024


MCoT Reasoning Over Table and Chart

2025

2024

2023


Cross-modal CoT Reasoning

2025

2024


🔥 MCoT Methodologies

Retionale Construction

MCoT reasoning methodologies primarily concern the construction of rationales and can be categorized into three distinct types: prompt-based, plan-based, and learning-based methods:

  1. Prompt-based MCoT reasoning employs carefully designed prompts, including instructions or in-context demonstrations, to guide models in generating rationales during inference, typically in zero-shot or few-shot settings.
  1. Plan-based MCoT reasoning enables models to dynamically explore and refine thoughts during the reasoning process.
  1. Learning-based MCoT reasoning embeds rationale construction within the training or fine-tuning process, requiring models to explicitly learn reasoning skills alongside multimodal inputs.

Structural Reasoning

The proposed structural reasoning framework aims to enhance the controllability and interpretability of the rationale generation process. The structured formats can be categorized into three types: asynchronous modality modeling, defined procedure staging, and autonomous procedure staging

Information Enhancing

Enhancing multimodal inputs facilitates comprehensive reasoning through the integration of expert tools and internal or external knowledge.

Objective Granularity

Multimodal Relationale

The reasoning processes adopt either text-only or multimodal rationales.

Test-time Scaling


🎨 Applications with MCoT Reasoning

Embodied AI

Agentic System

Autonomous Driving

Medical and Healthcare

Social and Human

Multimodal Generation


🚀 Useful Links

Survey


❤️ Citation

We would be honored if this work could assist you, and greatly appreciate it if you could consider starring and citing it:

@article{wang2025multimodal,
  author    = {Yaoting, Wang and Shengqiong, Wu and Yuecheng, Zhang and Roei, Herzig and Shuicheng, Yan and Ziwei, Liu and Jiebo, Luo and Hao, Fei},
  title     = {Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey},
  year      = {2025},
}

⭐️ Star History

Star History Chart