🎇 Introduction

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

🎇 Introduction

Multimodal chain-of-thought (MCoT) reasoning has garnered attention for its ability to enhance step-by-step reasoning in multimodal contexts, particularly within multimodal large language models (MLLMs). Current MCoT research explores various methodologies to address the challenges posed by images, videos, speech, audio, 3D data, and structured data, achieving success in fields such as robotics, healthcare, and autonomous driving. However, despite these advancements, the field lacks a comprehensive review that addresses the numerous remaining challenges.

To fill this gap, we present the first systematic survey of MCoT reasoning, elucidating the foundational concepts and definitions pertinent to this area. Our work includes a detailed taxonomy and an analysis of existing methodologies across different applications, as well as insights into current challenges and future research directions aimed at fostering the development of multimodal reasoning.

Updates

2025-03-18: We release the Awesome-MCoT repo and survey.

📕 Table of Contents

🎖 MCoT Datasets and Benchmarks
🎊 Multimodal Reasoning via RL
✨ MCoT Over Various Modalities
🔥 MCoT Methodologies
🎨 Applications with MCoT Reasoning
🚀 Useful Links
❤️ Citation
⭐️ Star History

🎖 MCoT Datasets and Benchmarks

"MC" and "Open" refer to multiple-choice and open-ended answer formats.
"T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.

Tab-1: Datasets for MCoT Training with Rationale.

Datasets	Year	Task	Domain	Modality	Format	Samples
ScienceQA	2022	VQA	Science	T, I	MC	21K
A-OKVQA	2022	VQA	Common	T, I	MC	25K
EgoCoT	2023	VideoQA	Common	T, V	Open	200M
VideoCoT	2024	VideoQA	Human Action	T, V	Open	22K
VideoEspresso	2024	VideoQA	Common	T, V	Open	202,164
EMMA-X	2024	Robot Manipulation	Indoor	T, V	Robot Actions	60K
M3CoT	2024	VQA	Science, Math, Common	T, I	MC	11.4K
MAVIS	2024	ScienceQA	Math	T, I	MC and Open	834K
LLaVA-CoT-100k	2024	VQA	Common, Science	T, I	MC and Open	834K
MAmmoTH-VL	2024	Diverse	Diverse	T, I	MC and Open	12M
Mulberry-260k	2024	Diverse	Diverse	T, I	MC and Open	260K
MM-Verify	2025	MathQA	Math	T, I	MC and Open	59,772
VisualPRM400K	2025	ScienceQA	Math, Science	T, I	MC and Open	400K
R1-OneVision	2025	Diverse	Diverse	T, I	MC and Open	155K

Tab-2: Benchmarks for MCoT Evaluation without Rationale.

Datasets	Year	Task	Domain	Modality	Format	Samples
MMMU	2023	VQA	Arts, Science	T, I	MC and Open	11.5K
SEED	2023	VQA	Common	T, I	MC	19K
MathVista	2023	ScienceQA	Math	T, I	MC and Open	6,141
MathVerse	2024	ScienceQA	Math	T, I	MC and Open	15K
Math-Vision	2024	ScienceQA	Math	T, I	MC and Open	3040
MeViS	2023	Referring VOS	Common	T, V	Dense Mask	2K
VSIBench	2024	VideoQA	Indoor	T, V	MC and Open	5K
HallusionBench	2024	VQA	Common	T, I	Yes-No	1,129
AV-Odyssey	2024	AVQA	Common	T, V, A	MC	4,555
AVHBench	2024	AVQA	Common	T, V, A	Open	5,816
RefAVS-Bench	2024	Referring AVS	Common	T, V, A	Dense Mask	4,770
MMAU	2024	AQA	Common	T, A	MC	10K
AVTrustBench	2025	AVQA	Common	T, V, A	MC and Open	600K
MIG-Bench	2025	Multi-image Grounding	Common	T, I	BBox	5.89K
MedAgentsBench	2025	MedicalQA	Medical	T, I	MC and Open	862
OSWorld	2024	Agent	Real Comp. Env.	T, I	Agent Action	369
AgentClinic	2024	MedicalQA	Medical	T, I	Open	335

Tab-3: Benchmarks for MCoT Evaluation with Rationale.

Datasets	Year	Task	Domain	Modality	Format	Samples
CoMT	2024	VQA	Common	T, I	MC	3,853
OmniBench	2024	VideoQA	Common	T, I, A	MC	1,142
WorldQA	2024	VideoQA	Common	T, V, A	Open	1,007
MiCEval	2024	VQA	Common	T, I	Open	643
OlympiadBench	2024	ScienceQA	Maths, Physics	T, I	Open	8,476
MME-CoT	2025	VQA	Science, Math, Common	T, I	MC and Open	1,130
EMMA	2025	VQA	Science	T, I	MC and Open	2,788
VisualProcessBench	2025	ScienceQA	Math, Science	T, I	MC and Open	2,866

🎊 Multimodal Reasoning via RL

The following table concludes the techniques used by MLLMs with RL for better long-MCoT reasoning, where "T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.
In summary, RL unlocks complex reasoning and aha-moment without SFT, demonstrating its potential to enhance model capabilities through iterative self-improvement and rule-based approaches, ultimately paving the way for more advanced and autonomous multimodal reasoning systems.

Model	Foundational LLMs	Modality	Learning	Cold Start	Algorithm	Aha-moment
Deepseek-R1-Zero	Deepseek-V3	T	RL	❌	GRPO	✅
Deepseek-R1	Deepseek-V3	T	SFT+RL	✅	GRPO	-
LLaVA-Reasoner	LLaMA3-LLaVA-NEXT-8B	T,I	SFT+RL	✅	DPO	-
Insight-V	Deepseek-V3	T,I	SFT+RL	✅	DPO	-
Multimodal-Open-R1	Qwen2-VL-7B-Instruct	T,I	RL	❌	GRPO	❌
R1-OneVision	Qwen2.5-VL-7B-Instruct	T,I	SFT	-	-	-
R1-V	Qwen2.5-VL	T,I	RL	❌	GRPO	❌
VLM-R1	Qwen2.5-VL	T,I	RL	❌	GRPO	❌
LMM-R1	Qwen2.5-VL-Instruct-3B	T,I	RL	❌	PPO	❌
Curr-ReFT	Qwen2.5-VL-3B	T,I	RL+SFT	❌	GRPO	-
Seg-Zero	Qwen2.5-VL-3B + SAM2	T,I	RL	❌	GRPO	❌
MM-Eureka	InternVL2.5-Instruct-8B	T,I	SFT+RL	✅	RLOO	-
MM-Eureka-Zero	InternVL2.5-Pretrained-38B	T,I	RL	❌	GRPO	✅
VisualThinker-R1-Zero	Qwen2-VL-2B	T,I	RL	❌	GRPO	✅
Easy-R1	Qwen2.5-VL	T,I	RL	❌	GRPO	-
Open-R1-Video	Qwen2-VL-7B	T,I,V	RL	❌	GRPO	❌
R1-Omni	HumanOmni-0.5B	T,I,V,A	SFT+RL	✅	GRPO	-
VisRL	Qwen2.5-VL-7B	T,I	SFT+RL	✅	DPO	-
R1-VL	Qwen2-VL-7B	T,I	RL	❌	StepGPRO	-

✨ MCoT Over Various Modalities

MCoT Reasoning Over Image

2025

2024

2023

MCoT Reasoning Over Video

2025

2024

2023

MCoT Reasoning Over 3D

2025

Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning

2024

2023

Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models

MCoT Reasoning Over Audio and Speech

2025

2024

MCoT Reasoning Over Table and Chart

2025

2024

2023

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Cross-modal CoT Reasoning

2025

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

2024

🔥 MCoT Methodologies

Retionale Construction

MCoT reasoning methodologies primarily concern the construction of rationales and can be categorized into three distinct types: prompt-based, plan-based, and learning-based methods:

Prompt-based MCoT reasoning employs carefully designed prompts, including instructions or in-context demonstrations, to guide models in generating rationales during inference, typically in zero-shot or few-shot settings.

Plan-based MCoT reasoning enables models to dynamically explore and refine thoughts during the reasoning process.

Learning-based MCoT reasoning embeds rationale construction within the training or fine-tuning process, requiring models to explicitly learn reasoning skills alongside multimodal inputs.

Structural Reasoning

The proposed structural reasoning framework aims to enhance the controllability and interpretability of the rationale generation process. The structured formats can be categorized into three types: asynchronous modality modeling, defined procedure staging, and autonomous procedure staging

Information Enhancing

Enhancing multimodal inputs facilitates comprehensive reasoning through the integration of expert tools and internal or external knowledge.

Objective Granularity

Multimodal Relationale

The reasoning processes adopt either text-only or multimodal rationales.

Test-time Scaling

🎨 Applications with MCoT Reasoning

Embodied AI

Agentic System

Autonomous Driving

Medical and Healthcare

Social and Human

Multimodal Generation

🚀 Useful Links

Survey

❤️ Citation

We would be honored if this work could assist you, and greatly appreciate it if you could consider starring and citing it:

@article{wang2025multimodal,
  author    = {Yaoting, Wang and Shengqiong, Wu and Yuecheng, Zhang and Roei, Herzig and Shuicheng, Yan and Ziwei, Liu and Jiebo, Luo and Hao, Fei},
  title     = {Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey},
  year      = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
assets		assets
README.md		README.md

yaotingwangofficial/Awesome-MCoT

Folders and files

Latest commit

History

Repository files navigation

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

🎇 Introduction

Updates

📕 Table of Contents

🎖 MCoT Datasets and Benchmarks

Tab-1: Datasets for MCoT Training with Rationale.

Tab-2: Benchmarks for MCoT Evaluation without Rationale.

Tab-3: Benchmarks for MCoT Evaluation with Rationale.

🎊 Multimodal Reasoning via RL

✨ MCoT Over Various Modalities

MCoT Reasoning Over Image

2025

2024

2023

MCoT Reasoning Over Video

2025

2024

2023

MCoT Reasoning Over 3D

2025

2024

2023

MCoT Reasoning Over Audio and Speech

2025

2024

MCoT Reasoning Over Table and Chart

2025

2024

2023

Cross-modal CoT Reasoning

2025

2024

🔥 MCoT Methodologies

Retionale Construction

Structural Reasoning

Information Enhancing

Objective Granularity

Multimodal Relationale

Test-time Scaling

🎨 Applications with MCoT Reasoning

Embodied AI

Agentic System

Autonomous Driving

Medical and Healthcare

Social and Human

Multimodal Generation

🚀 Useful Links

Survey

❤️ Citation

⭐️ Star History

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages