LLM Benchmark Reproduction Results

Benchmark	Tester	Model	Metric/Score	Our Results	Reported Results	Reported Results Source
Alpaca Eval	Etash	Meta LLama 3 8b Instruct	win_rate	24.04	22.57
			standard_error	1.27	1.26
			avg_length	1937	1899
			length_controlled_winrate	24.24	22.92
HumanEval	Etash	DeepSeek Coder 6.7B Instruct	Python	0.798	78.9%
			Bash	0.354	36.7%
MBPP	Etash	DeepSeek Coder 6.7B Instruct	pass@1	0.64	65.4%
RepoBench	Negin	StarCoder	EM (cross_file_first)	28.0	28.0
			ES (cross_file_first)	67.37	69.6
			EM (cross_file_random)	37.08	37.3
			ES (cross_file_random)	71.28	73.69
			EM (in_file)	34.17	33.8
			ES (in_file)	70.46	72.37
			EM (weighted avg)	31.69	31.69
			ES (weighted avg)	69.09	71.2
		Codegen-350M-mono	EM (cross_file_first)	15.27	15.14
			ES (cross_file_first)	58.03	60.1
			EM (cross_file_random)	27.7	27.7
			ES (cross_file_random)	67.33	68.9
			EM (in_file)	25.11	25.2
			ES (in_file)	66.28	67.8
			EM (weighted avg)	22.12	20.7
			ES (weighted avg)	62.9	64.2
		Codegen-2B-mono	EM (cross_file_first)	22.12	22.1
			ES (cross_file_first)	62.9	64.9
			EM (cross_file_random)	34.18	34.4
			ES (cross_file_random)	71.12	72.6
			EM (in_file)	31.14	31.2
			ES (in_file)	69.61	70.93
			EM (weighted avg)	27.26	27.3
			ES (weighted avg)	66.57	68.3
		Codegen-6B-mono	EM (cross_file_first)	26.81	26.9
			ES (cross_file_first)	66.23	67.9
			EM (cross_file_random)	38.28	38.3
			ES (cross_file_random)	73.34	74.5
			EM (in_file)	34.9	34.96
			ES (in_file)	71.21	72.59
			EM (weighted avg)	31.56	31.67
			ES (weighted avg)	69.16	70.68
MTBench	Etash	stabilityai/stablelm-tuned-alpha-7b		1.0	2.75
		Nexusflow/Starling-LM-7B-beta		7.69	8.12
		mistralai/Mistral-7B-Instruct-v0.1		6.65	6.84
		databricks/dolly-v2-12b		1.087	3.28
WildBench	Etash	princeton-nlp/gemma-2-9b-it-SimPO		5.083	5.33
		meta-llama/Meta-Llama-3-8B-Instruct		3.27	2.92
IFEval	Negin	meta-llama/Llama-3.1-8B-Instruct		79.1	80.4
ZeroEval	Negin	meta-llama/Llama-3.1-8B-Instruct	crux	40.75	39.88
			math-l5	24.69	22.19
			zebra	11.70	12.8
MixEval	Negin	Meta-Llama-3-8B-Instruct	overall	73.0	75.0
			TriviaQA	67.5	71.7
			MMLU	71.3	71.9
			HellaSwag	66.3	65.7
AMC23	Ryan	Qwen/Qwen2.5-Math-1.5B-Instruct	accuracy	24/40	24/40	Qwen2.5-Math Paper Table 5
AIME24	Ryan	Qwen/Qwen2.5-Math-1.5B-Instruct	accuracy	3/30	3/30	Qwen2.5-Math Paper Table 5
		Qwen/Qwen2.5-32B-Instruct	accuracy	6/30	5/30	Sky-T1 Blog Post
		Qwen/QwQ-32B-Preview	accuracy	13/30	15/30	Sky-T1 Blog Post
		NovaSky-AI/Sky-T1-32B-Preview	accuracy	13/30	13/30	Sky-T1 Blog Post
MATH500	Ryan	Qwen/Qwen2.5-32B-Instruct	accuracy	78.6	76.2	Sky-T1 Blog Post
		NovaSky-AI/Sky-T1-32B-Preview	accuracy	84.0	82.4	Sky-T1 Blog Post
		Qwen/QwQ-32B-Preview	accuracy	83.6	85.4	Sky-T1 Blog Post
BigCodeBench	Marianna	Qwen/Qwen2.5-14B-Instruct	instruct (pass@1)	41.5	39.8
			complete (pass@1)	52.6	52.2
		meta-llama/Meta-Llama-3.1-8B-Instruct	instruct (pass@1)	30.7	32.8
			complete (pass@1)	41.9	40.5
		Qwen/Qwen2.5-7B-Instruct	instruct (pass@1)	35.2	37.6
			complete (pass@1)	46.7	46.1
LiveBench	Negin	Meta-Llama-3-8B-Instruct	global_average	26.78	26.74	https://livebench.ai/#/?q=3.1
HumanEvalPlus	Sedrick	mistralai/Mistral-7B-Instruct-v0.2	accuracy (pass@1)	27.44	36.0	EvalPlus Leaderboard
		meta-llama/Llama-3.1-8B-Instruct	accuracy (pass@1)	62.2	62.8	EvalPlus Leaderboard
		google/codegemma-7b-it	accuracy (pass@1)	36.6	51.8	EvalPlus Leaderboard
MBPPPlus	Sedrick	mistralai/Mistral-7B-Instruct-v0.2	accuracy (pass@1)	43.9	37.0	EvalPlus Leaderboard
		meta-llama/Llama-3.1-8B-Instruct	accuracy (pass@1)	58.7	55.6	EvalPlus Leaderboard
		google/codegemma-7b-it	accuracy (pass@1)	56.6	56.9	EvalPlus Leaderboard
LiveCodeBench	Negin	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	(pass@1)	37.9	37.6	DeepSeek-R1 Paper Table 5
GPQADiamond	Negin	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	accuracy	52	49.1	DeepSeek-R1 Paper Table 5
MultiPL-E.	Marianna	Qwen/CodeQwen1.5-7B-Chat	java (pass@1)	64.0	61.04	Big Code Leaderboard
			cpp (pass@1)	67.85	61.0	Big Code Leaderboard
			cs (pass@1)	52.6	52.2	Big Code Leaderboard
			php (pass@1)	64.7	69.22	Big Code Leaderboard
			sh (pass@1)	37.0	39.2	Qwen2.5-Coder Technical Report (Table 17)
			ts (pass@1)	73.0	71.7	Qwen2.5-Coder Technical Report (Table 17)
			js (pass@1)	69.5	75.2	Qwen2.5-Coder Technical Report (Table 17)
CRUXEval	Marianna	Qwen/Qwen2.5-Coder-7B-Instruct	Input-CoT (pass@1)	66.2	65.8	Qwen2.5-Coder Technical Report (Table 18)
			Output-CoT (pass@1)	66.7	65.9
		Qwen/Qwen2.5-Coder-3B-Instruct	Input-CoT (pass@1)	53.4	53.2
			Output-CoT (pass@1)	53.3	56.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproduced_benchmarks.md

reproduced_benchmarks.md

LLM Benchmark Reproduction Results

Files

reproduced_benchmarks.md

Latest commit

History

reproduced_benchmarks.md

File metadata and controls

LLM Benchmark Reproduction Results