Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@AkshitaB This is an attempt at merging my old workflow from the
run_lm_eval.py
script in catwalk with the new steps and functionality in this repo. Feel free to leave it in a branch for now if it's not suitable to be merged.From my viewpoint, the main advantages of running it this way are:
To run this in beaker requires a somewhat clunky gantry command, as well as uploading the config files (either to beaker or nfs) if that's used instead of direct parameters (which is also supported). But that's easy enough once it's in your workflow. My gantry example uses my OLMoEvalLatest beaker image which includes the OLMo code and I try to keep up to date.
Since this is an internal repo,
GITHUB_TOKEN
is needed to run gantry. Also,GDRIVE_SERVICE_ACCOUNT_JSON
is needed to upload to google sheets (I tweaked the google sheet upload a bit, behind asimple_pipeline
argument, to add the beaker ID, remove the tango stuff, and add aall_metrics
column, see example in my "OLMo-evals-testing" sheet).I'm totally open to suggestions for how to refactor this to accomplish the main advantages above without being quite as hacky. :)