Logic-RL-Lite is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for reasoning capabilities. It is a follow-up of the Logic-RL project.
It leverages the following key components:
- RL Framework: veRL
- RL Algorithms: REINFORCE++ and GRPO
- RL Dataset: Knights and Knaves (K&K) Logic Puzzle Dataset
- Base Models: Qwen2.5 (1.5B, 3B), Llama3.2 (3B)
Knights and Knaves (K&K) Logic Puzzle: Imagine there are two types of people: Knights and Knaves. Knights always tell the truth. Knaves always lie.
The K&K dataset is designed to test logical reasoning capabilities by presenting puzzles involving statements made by multiple "people," where the goal is to determine who is a knight and who is a knave based on the given clues.
- Format Reward: Yes
- Answer Reward: Yes
- Language Consistency Reward or Others: No
After configuring your WandB, GPUs, and other settings, execute the training:
bash run_rl_trainer_xxx.sh
For more visualized details, refer to my WandB report:
Logic-RL-Lite Training Report
Note: The findings may be specific to this experimentation setup.
- 1.5B Models and Smaller:
- Instruction-tuned or pretrained models cannot learn reasoning.
- 3B Models:
- Instruction-tuned models (e.g., Qwen2.5-3B) can learn reasoning.
- Pretrained models (e.g., Llama3.2-3B) struggle to learn reasoning.
- Hypothesis: Qwen2.5-3B-Pretrain is likely somewhat instruction-tuned, making it significantly "smarter" than Llama3.2-3B-Pretrain.
- 7B Models and Larger:
- Consistently learn reasoning.
- Self-reflection and rethinking behaviors appear at epoch 0 (or even step 0) in instruction-tuned base models.
- These behaviors likely stem from instruction tuning, rather than emergent properties of pure RL.
- See findings from OAT-ZERO and Logic-RL.
Table: Appearance of Self-Reflection and Verification Keywords During Training (Base Model = Qwen2.5-3B-Instruct)
Keyword | Epoch | Step |
---|---|---|
rethink | 0 | 4 |
re-think | N/A | N/A |
think again | N/A | N/A |
retry | N/A | N/A |
re-try | N/A | N/A |
try again | N/A | N/A |
recheck | 0 | 0 |
re-check | 0 | 14 |
check again | 0 | 52 |
reevaluate | 0 | 121 |
re-evaluate | 0 | 0 |
double check | 0 | 1 |
double-check | 0 | 7 |
verify | 0 | 1 |
aha | N/A | N/A |
wait | 0 | 63 |
Keyword | Epoch | Step |
---|---|---|
summarize | 0 | 1 |
summary | 0 | 0 |
- While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
- This aligns with superficial self-reflection findings from OAT-ZERO.
- Left Figure: Answer accuracy versus token count distribution.
- Right Figure: Regression analysis of accuracy against token count.
- Instruction-Tuned Model as Base Model:
- Rare occurrences of language mixing.
- Pretrained Model as Base Model:
- Language mixing is more prevalent.
Output Type | Only English | Only Chinese | Mixed (English & Chinese) |
---|---|---|---|
model_think |
98.71% | 0.00% | 0.82% |
model_answer_raw |
99.44% | 0.00% | 0.00% |
- REINFORCE++ appears more stable than GRPO.
- Further experiments are expected to confirm this finding.
This project builds upon and references several open-source works:
- veRL Framework: Reinforcement learning framework.
- Logic-RL: Reproduction of R1-Zero on logic puzzles.
- OAT-ZERO: Insights on reasoning with pure RL.
- TinyZero: Implementation of reward models and Countdown task.
- DeepScaler: Iterative context scaling with GRPO.
- Knights and Knaves (K&K) Puzzle Dataset: Logical reasoning tasks for LLMs.