Skip to content

Actions: stanford-crfm/helm

Test

Actions

Loading...
Loading

Show workflow options

Create status badge

Loading
2,835 workflow runs
2,835 workflow runs

Filter by Event

Filter by Status

Filter by Branch

Filter by Actor

Fixes for BigCodeBench Evaluator (#3310)
Test #7972: Commit bfaa13a pushed by yifanmai
February 5, 2025 19:24 10m 10s main
February 5, 2025 19:24 10m 10s
Use original instance IDs in IFEval (#3275)
Test #7971: Commit 28a8853 pushed by yifanmai
February 5, 2025 19:22 11m 44s main
February 5, 2025 19:22 11m 44s
Adding IMDB_PTBR Scenario (#3284)
Test #7970: Commit 23167ba pushed by yifanmai
February 5, 2025 19:22 10m 12s main
February 5, 2025 19:22 10m 12s
Add Financial Phrasebank scenario
Test #7969: Pull request #3302 synchronize by yifanmai
February 5, 2025 19:16 11m 10s yifanmai/financial-phrasebank
February 5, 2025 19:16 11m 10s
Add o3-mini model (#3304)
Test #7968: Commit 34d1002 pushed by yifanmai
February 5, 2025 18:55 12m 6s main
February 5, 2025 18:55 12m 6s
Add ECHR Judgment Classification scenario (#3311)
Test #7967: Commit 78ec8cf pushed by yifanmai
February 5, 2025 17:46 10m 41s main
February 5, 2025 17:46 10m 41s
Audio HELM v0 fixes
Test #7966: Pull request #3314 opened by teetone
February 5, 2025 09:45 10m 50s audiohelm
February 5, 2025 09:45 10m 50s
Remove empty files in the Vocal Sound audio scenario
Test #7965: Pull request #3313 opened by ImKeTT
February 5, 2025 06:02 9m 37s ImKeTT:vocal-sound-fix
February 5, 2025 06:02 9m 37s
Add multiple annotators to Omni-MATH and rename shared modules (#3291)
Test #7964: Commit 5fb6ee8 pushed by yifanmai
February 5, 2025 01:34 10m 37s main
February 5, 2025 01:34 10m 37s
Add execution accuracy metric to Bird-SQL (#3312)
Test #7963: Commit 5a30836 pushed by yifanmai
February 5, 2025 01:33 10m 35s main
February 5, 2025 01:33 10m 35s
Add execution accuracy metric to Bird-SQL
Test #7962: Pull request #3312 opened by yifanmai
February 5, 2025 00:47 10m 20s yifanmai/fix-bird-metrics
February 5, 2025 00:47 10m 20s
Add Deepseek-R1 model (#3305)
Test #7961: Commit cd9e516 pushed by yifanmai
February 5, 2025 00:09 9m 57s main
February 5, 2025 00:09 9m 57s
Add ECHR Judgment Classification scenario
Test #7959: Pull request #3311 opened by yifanmai
February 4, 2025 22:59 10m 34s yifanmai/fix-echr-judge
February 4, 2025 22:59 10m 34s
Add QwQ model on Together AI (#3307)
Test #7958: Commit f62ea62 pushed by yifanmai
February 4, 2025 22:17 9m 50s main
February 4, 2025 22:17 9m 50s
Add QwQ model on Together AI
Test #7957: Pull request #3307 synchronize by yifanmai
February 4, 2025 22:16 10m 8s yifanmai/qwq
February 4, 2025 22:16 10m 8s
Fixes for BigCodeBench Evaluator
Test #7956: Pull request #3310 opened by yifanmai
February 4, 2025 22:15 10m 35s yifanmai/bigcodebench-evaluator
February 4, 2025 22:15 10m 35s
Add Spider 1.0 scenario (#3300)
Test #7955: Commit 5a50569 pushed by yifanmai
February 4, 2025 17:17 9m 53s main
February 4, 2025 17:17 9m 53s
Switch aggregation for tables benchmark from win rate to mean (#3309)
Test #7954: Commit 6fb429e pushed by yifanmai
February 4, 2025 17:17 10m 34s main
February 4, 2025 17:17 10m 34s
Switch aggregation for tables benchmark from win rate to mean
Test #7953: Pull request #3309 synchronize by yifanmai
February 4, 2025 01:46 9m 57s yifanmai/mean-tables
February 4, 2025 01:46 9m 57s
Switch aggregation for tables benchmark from win rate to mean
Test #7952: Pull request #3309 opened by yifanmai
February 4, 2025 01:22 10m 29s yifanmai/mean-tables
February 4, 2025 01:22 10m 29s
Add support to redact model outputs (#3301)
Test #7951: Commit 714a97d pushed by MiguelAFH
February 3, 2025 22:40 9m 59s main
February 3, 2025 22:40 9m 59s
Add support to redact model outputs
Test #7950: Pull request #3301 synchronize by MiguelAFH
February 3, 2025 22:27 9m 54s redact-output
February 3, 2025 22:27 9m 54s
Add Mistral Small 3 model (#3308)
Test #7949: Commit 2401e5e pushed by yifanmai
February 3, 2025 17:56 9m 44s main
February 3, 2025 17:56 9m 44s
Add Phi 3.5 models (#3306)
Test #7948: Commit 228e0f1 pushed by yifanmai
February 1, 2025 04:01 9m 39s main
February 1, 2025 04:01 9m 39s