diff --git a/README.md b/README.md index 5991700..d47a862 100644 --- a/README.md +++ b/README.md @@ -51,6 +51,8 @@ huggingface-cli login - **HumanEval**: [Code generation and problem solving](https://github.com/openai/human-eval) - **ZeroEval**: [Logical reasoning and problem solving](https://github.com/WildEval/ZeroEval) - **MBPP**: [Python programming benchmark](https://github.com/google-research/google-research/tree/master/mbpp) + - **AIME24**: [Math Reasoning Dataset](https://huggingface.co/datasets/AI-MO/aimo-validation-aime) + - **AMC23**: [Math Reasoning Dataset](https://huggingface.co/datasets/AI-MO/aimo-validation-amc) - **Arena-Hard-Auto** (Coming soon): [Automatic evaluation tool for instruction-tuned LLMs](https://github.com/lmarena/arena-hard-auto) - **SWE-Bench** (Coming soon): [Evaluating large language models on real-world software issues](https://github.com/princeton-nlp/SWE-bench) - **SafetyBench** (Coming soon): [Evaluating the safety of LLMs](https://github.com/thu-coai/SafetyBench) diff --git a/reproduced_benchmarks.md b/reproduced_benchmarks.md index c7d9288..c5f4d23 100644 --- a/reproduced_benchmarks.md +++ b/reproduced_benchmarks.md @@ -51,3 +51,5 @@ | ZeroEval | Negin | meta-llama/Llama-3.1-8B-Instruct |crux | 40.75 | 39.88 | | | |math-l5 | 24.69 | 22.19 | | | |zebra | 11.70 | 12.8 +| AMC23 | Hritik | Qwen/Qwen2.5-7B-Instruct | | 20/40 | 24/40 | +| AIME24 | Hritik | Qwen/Qwen2.5-7B-Instruct | | 4/30 | 3/30 | \ No newline at end of file