update repro benchmarks and readme

mlfoundations · Jan 17, 2025 · b22691e · b22691e
1 parent 8e39718
commit b22691e
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -51,6 +51,8 @@ huggingface-cli login
   - **HumanEval**: [Code generation and problem solving](https://github.com/openai/human-eval)
   - **ZeroEval**: [Logical reasoning and problem solving](https://github.com/WildEval/ZeroEval)
   - **MBPP**: [Python programming benchmark](https://github.com/google-research/google-research/tree/master/mbpp)
+  - **AIME24**: [Math Reasoning Dataset](https://huggingface.co/datasets/AI-MO/aimo-validation-aime)
+  - **AMC23**: [Math Reasoning Dataset](https://huggingface.co/datasets/AI-MO/aimo-validation-amc)
   - **Arena-Hard-Auto** (Coming soon): [Automatic evaluation tool for instruction-tuned LLMs](https://github.com/lmarena/arena-hard-auto)
   - **SWE-Bench** (Coming soon): [Evaluating large language models on real-world software issues](https://github.com/princeton-nlp/SWE-bench)
   - **SafetyBench** (Coming soon): [Evaluating the safety of LLMs](https://github.com/thu-coai/SafetyBench)

diff --git a/reproduced_benchmarks.md b/reproduced_benchmarks.md
@@ -51,3 +51,5 @@
 | ZeroEval    | Negin   | meta-llama/Llama-3.1-8B-Instruct       |crux                           | 40.75                          | 39.88
 |             |         |                                        |math-l5                        | 24.69                          | 22.19
 |             |         |                                        |zebra                          | 11.70                          | 12.8
+| AMC23      | Hritik   | Qwen/Qwen2.5-7B-Instruct       |                               | 20/40                          | 24/40                         |
+| AIME24      | Hritik   | Qwen/Qwen2.5-7B-Instruct       |                               | 4/30                         | 3/30                        |