Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Logic-RL-Lite is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for reasoning capabilities. It is a follow-up of the Logic-RL project.

It leverages the following key components:

RL Framework: veRL
RL Algorithms: REINFORCE++ and GRPO
RL Dataset: Knights and Knaves (K&K) Logic Puzzle Dataset
Base Models: Qwen2.5 (1.5B, 3B), Llama3.2 (3B)

Dataset

Knights and Knaves (K&K) Logic Puzzle: Imagine there are two types of people: Knights and Knaves. Knights always tell the truth. Knaves always lie.

The K&K dataset is designed to test logical reasoning capabilities by presenting puzzles involving statements made by multiple "people," where the goal is to determine who is a knight and who is a knave based on the given clues.

RL Reward Design

Format Reward: Yes
Answer Reward: Yes
Language Consistency Reward or Others: No

Training

After configuring your WandB, GPUs, and other settings, execute the training:

bash run_rl_trainer_xxx.sh

Key Findings

For more visualized details, refer to my WandB report:
Logic-RL-Lite Training Report

Note: The findings may be specific to this experimentation setup.

1. Smallest Model Capable of Learning Reasoning

1.5B Models and Smaller:
- Instruction-tuned or pretrained models cannot learn reasoning.
3B Models:
- Instruction-tuned models (e.g., Qwen2.5-3B) can learn reasoning.
- Pretrained models (e.g., Llama3.2-3B) struggle to learn reasoning.
- Hypothesis: Qwen2.5-3B-Pretrain is likely somewhat instruction-tuned, making it significantly "smarter" than Llama3.2-3B-Pretrain.
7B Models and Larger:
- Consistently learn reasoning.

2. No "Aha Moment" During Pure RL

Self-reflection and rethinking behaviors appear at epoch 0 (or even step 0) in instruction-tuned base models.
These behaviors likely stem from instruction tuning, rather than emergent properties of pure RL.
See findings from OAT-ZERO and Logic-RL.

Table: Appearance of Self-Reflection and Verification Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

Keyword	Epoch	Step
rethink	0	4
re-think	N/A	N/A
think again	N/A	N/A
retry	N/A	N/A
re-try	N/A	N/A
try again	N/A	N/A
recheck	0	0
re-check	0	14
check again	0	52
reevaluate	0	121
re-evaluate	0	0
double check	0	1
double-check	0	7
verify	0	1
aha	N/A	N/A
wait	0	63

Table: Appearance of Summarization Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

Keyword	Epoch	Step
summarize	0	1
summary	0	0

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
This aligns with superficial self-reflection findings from OAT-ZERO.

Figures (Base Model = Qwen2.5-3B-Instruct):

Left Figure: Answer accuracy versus token count distribution.
Right Figure: Regression analysis of accuracy against token count.

4. Language Mixing

Instruction-Tuned Model as Base Model:
- Rare occurrences of language mixing.
Pretrained Model as Base Model:
- Language mixing is more prevalent.

Table: Language Distribution in Model Outputs (Base Model = Qwen2.5-3B-Instruct)

Output Type	Only English	Only Chinese	Mixed (English & Chinese)
`model_think`	98.71%	0.00%	0.82%
`model_answer_raw`	99.44%	0.00%	0.00%

5. Stability of RL Algorithms

REINFORCE++ appears more stable than GRPO.
Further experiments are expected to confirm this finding.

Acknowledgements

This project builds upon and references several open-source works:

veRL Framework: Reinforcement learning framework.
Logic-RL: Reproduction of R1-Zero on logic puzzles.
OAT-ZERO: Insights on reasoning with pure RL.
TinyZero: Implementation of reward models and Countdown task.
DeepScaler: Iterative context scaling with GRPO.
Knights and Knaves (K&K) Puzzle Dataset: Logical reasoning tasks for LLMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Dataset

RL Reward Design

Training

Key Findings

1. Smallest Model Capable of Learning Reasoning

2. No "Aha Moment" During Pure RL

Table: Appearance of Self-Reflection and Verification Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

Table: Appearance of Summarization Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

Figures (Base Model = Qwen2.5-3B-Instruct):

4. Language Mixing

Table: Language Distribution in Model Outputs (Base Model = Qwen2.5-3B-Instruct)

5. Stability of RL Algorithms

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Dataset

RL Reward Design

Training

Key Findings

1. Smallest Model Capable of Learning Reasoning

2. No "Aha Moment" During Pure RL

Table: Appearance of Self-Reflection and Verification Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

Table: Appearance of Summarization Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

Figures (Base Model = Qwen2.5-3B-Instruct):

4. Language Mixing

Table: Language Distribution in Model Outputs (Base Model = Qwen2.5-3B-Instruct)

5. Stability of RL Algorithms

Acknowledgements