Skip to content

Commit

Permalink
update repro benchmarks and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Hritikbansal committed Jan 17, 2025
1 parent 8e39718 commit b22691e
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ huggingface-cli login
- **HumanEval**: [Code generation and problem solving](https://github.com/openai/human-eval)
- **ZeroEval**: [Logical reasoning and problem solving](https://github.com/WildEval/ZeroEval)
- **MBPP**: [Python programming benchmark](https://github.com/google-research/google-research/tree/master/mbpp)
- **AIME24**: [Math Reasoning Dataset](https://huggingface.co/datasets/AI-MO/aimo-validation-aime)
- **AMC23**: [Math Reasoning Dataset](https://huggingface.co/datasets/AI-MO/aimo-validation-amc)
- **Arena-Hard-Auto** (Coming soon): [Automatic evaluation tool for instruction-tuned LLMs](https://github.com/lmarena/arena-hard-auto)
- **SWE-Bench** (Coming soon): [Evaluating large language models on real-world software issues](https://github.com/princeton-nlp/SWE-bench)
- **SafetyBench** (Coming soon): [Evaluating the safety of LLMs](https://github.com/thu-coai/SafetyBench)
Expand Down
2 changes: 2 additions & 0 deletions reproduced_benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,5 @@
| ZeroEval | Negin | meta-llama/Llama-3.1-8B-Instruct |crux | 40.75 | 39.88
| | | |math-l5 | 24.69 | 22.19
| | | |zebra | 11.70 | 12.8
| AMC23 | Hritik | Qwen/Qwen2.5-7B-Instruct | | 20/40 | 24/40 |
| AIME24 | Hritik | Qwen/Qwen2.5-7B-Instruct | | 4/30 | 3/30 |

0 comments on commit b22691e

Please sign in to comment.