The reward function design #16

Yukang-Lin · 2025-03-15T09:30:55Z

Hi, thanks for your great work. I notice in your paper that tells your special design in GRPO "We also employ two techniques to stabilize the RL training process: modified version of length reward [Yeo et al.] with weaker preference for short correct answers and importance sampling weight clipping [MiniMax et al.]."

However, I failed to find the difference between your code and the official DeepScaleR project, could you show me where the changes are or give me a hint. Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The reward function design #16

The reward function design #16

Yukang-Lin commented Mar 15, 2025

The reward function design #16

The reward function design #16

Comments

Yukang-Lin commented Mar 15, 2025