Skip to content

Lightweight replication study of DeepSeek-R1-Zero. Explores pure RL without SFT for post-training for reasoning capability. "No Aha Moment", and "Longer CoT ≠ Accuracy".

Notifications You must be signed in to change notification settings

DolbyUUU/Logic-RL-Lite

Repository files navigation

Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Logic-RL-Lite is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for reasoning capabilities. It is a follow-up of the Logic-RL project.

It leverages the following key components:

  1. RL Framework: veRL
  2. RL Algorithms: REINFORCE++ and GRPO
  3. RL Dataset: Knights and Knaves (K&K) Logic Puzzle Dataset
  4. Base Models: Qwen2.5 (1.5B, 3B), Llama3.2 (3B)

Dataset

Knights and Knaves (K&K) Logic Puzzle: Imagine there are two types of people: Knights and Knaves. Knights always tell the truth. Knaves always lie.

The K&K dataset is designed to test logical reasoning capabilities by presenting puzzles involving statements made by multiple "people," where the goal is to determine who is a knight and who is a knave based on the given clues.


RL Reward Design

  1. Format Reward: Yes
  2. Answer Reward: Yes
  3. Language Consistency Reward or Others: No

Training

After configuring your WandB, GPUs, and other settings, execute the training:

bash run_rl_trainer_xxx.sh

Key Findings

For more visualized details, refer to my WandB report:
Logic-RL-Lite Training Report

Note: The findings may be specific to this experimentation setup.

1. Smallest Model Capable of Learning Reasoning

  • 1.5B Models and Smaller:
    • Instruction-tuned or pretrained models cannot learn reasoning.
  • 3B Models:
    • Instruction-tuned models (e.g., Qwen2.5-3B) can learn reasoning.
    • Pretrained models (e.g., Llama3.2-3B) struggle to learn reasoning.
    • Hypothesis: Qwen2.5-3B-Pretrain is likely somewhat instruction-tuned, making it significantly "smarter" than Llama3.2-3B-Pretrain.
  • 7B Models and Larger:
    • Consistently learn reasoning.

2. No "Aha Moment" During Pure RL

  • Self-reflection and rethinking behaviors appear at epoch 0 (or even step 0) in instruction-tuned base models.
  • These behaviors likely stem from instruction tuning, rather than emergent properties of pure RL.
  • See findings from OAT-ZERO and Logic-RL.

Table: Appearance of Self-Reflection and Verification Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

Keyword Epoch Step
rethink 0 4
re-think N/A N/A
think again N/A N/A
retry N/A N/A
re-try N/A N/A
try again N/A N/A
recheck 0 0
re-check 0 14
check again 0 52
reevaluate 0 121
re-evaluate 0 0
double check 0 1
double-check 0 7
verify 0 1
aha N/A N/A
wait 0 63

Table: Appearance of Summarization Keywords During Training (Base Model = Qwen2.5-3B-Instruct)

Keyword Epoch Step
summarize 0 1
summary 0 0

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

  • While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
  • This aligns with superficial self-reflection findings from OAT-ZERO.

Figures (Base Model = Qwen2.5-3B-Instruct):

  • Left Figure: Answer accuracy versus token count distribution.
  • Right Figure: Regression analysis of accuracy against token count.
Barplot: Answer Accuracy vs Token Count Regression: Accuracy vs Token Count

4. Language Mixing

  • Instruction-Tuned Model as Base Model:
    • Rare occurrences of language mixing.
  • Pretrained Model as Base Model:
    • Language mixing is more prevalent.

Table: Language Distribution in Model Outputs (Base Model = Qwen2.5-3B-Instruct)

Output Type Only English Only Chinese Mixed (English & Chinese)
model_think 98.71% 0.00% 0.82%
model_answer_raw 99.44% 0.00% 0.00%

5. Stability of RL Algorithms

  • REINFORCE++ appears more stable than GRPO.
  • Further experiments are expected to confirm this finding.

Acknowledgements

This project builds upon and references several open-source works:

About

Lightweight replication study of DeepSeek-R1-Zero. Explores pure RL without SFT for post-training for reasoning capability. "No Aha Moment", and "Longer CoT ≠ Accuracy".

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published