Skip to content

Latest commit

 

History

History
181 lines (127 loc) · 6.8 KB

readme_evalharness.md

File metadata and controls

181 lines (127 loc) · 6.8 KB

How to run lm-eval on Megatron-DeepSpeed checkpoint using the original setup

A great portion of this eval harness feature is inherited from bigscience-workshop/Megatron-DeepSpeed#212, but with code/doc changes (e.g., to support case without pipeline parallelism and MoE models).

This particular setup uses the normal deepspeed checkpoint and requires no conversion to Megatron-LM.

Prerequisites

  1. Install software

On login console with external network

Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and best-download==0.0.7 needed to download some tasks. Below package version numbers are what we tested that work.

(maybe need pip install --upgrade pip)
pip install best-download==0.0.7 lm-eval==0.2.0 datasets==1.15.1 transformers==4.20.1 huggingface-hub==0.8.1
  1. Pre-download needed datasets

some symlinks due to lm-harness' issues with relative position of data

mkdir data
cd ../../tasks/eval_harness/
ln -s ../../examples_deepspeed/MoE/data/ data
cd ../../examples_deepspeed/MoE/

Then install datasets for the tasks:

python ../../tasks/eval_harness/download.py --task_list hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext,logiqa,mathqa,mc_taco,mrpc,prost,pubmedqa,qnli,qqp,sciq,sst,wnli

Previously we set export HF_DATASETS_OFFLINE=1 to make the dataset offline after the above manual download. But somehow now this could trigger error on some kind of online verification for some of the datasets, so it's recommended to only set offline mode when necessary.

  1. Prepare the script

ds_evalharness.sh is the example script.

  1. Edit:
PP_SIZE=1
TP_SIZE=1
NO_PP="true"
EP_PARALLEL_SIZE=1
NUM_NODE=1
NUM_GPU_PER_NODE=1

to match the eval topology.

Edit:

CHECKPOINT_PATH=
CONFIG_PATH=
RESULT_PATH=

to the checkpoint/ds config you want to use, and where to save the results.

  1. Adjust the following to fit the chosen GPU. As of last check for 1.3B model the settings are one of:
EVAL_MICRO_BATCH_SIZE=6  # 16GB GPU 1.3B model
EVAL_MICRO_BATCH_SIZE=12 # 32GB GPU 1.3B model

If you get OOM lower it further.

  1. If not using the Deepspeed path, disable it by removing:
    --deepspeed \
    --deepspeed_config ds_config.json \

If you didn't disable it and the program crashed on checkpoint loading unable to find some key, disable deepspeed as explained above.

Note that for MoE models and for models without pipeline parallelism, currently they might not work for the case without deepspeed.

Running lm-eval in Mixtral

In Mixtral LM evaluation harness can be triggered directly from generic run script. To run the tests, use a pre-trained model checkpoint and load it in evaluation framework using the same training script, adding HL_RUN_EVAL_HARNESS=1, path HL_CHECKPOINTS_DIR and tag HL_CHECKPOINT_LOAD_TAG of the saved checkpoint:

HL_RUN_EVAL_HARNESS=1 \
HL_CHECKPOINTS_DIR=<dir> \
HL_CHECKPOINT_LOAD_TAG=global_step1000 \
HL_TRUST_REMOTE_CODE=1 \
HL_EVAL_TASKS='wikitext,webqs,winogrande' \
$MEGATRON_DEEPSPEED_ROOT/scripts/run_mixtral.sh

Standard model arguments for inference such as 3D config, batch size, etc. are also required. Specify HL_EVAL_TASKS to run the tests on a subset of the tasks. For tasks not included in HuggingFace database, pass HL_TRUST_REMOTE_CDDE=1. For some tasks, pre-downloaded dataset may be needed, additional preparation steps can be found in the section above.