A great portion of this eval harness feature is inherited from bigscience-workshop/Megatron-DeepSpeed#212, but with code/doc changes (e.g., to support case without pipeline parallelism and MoE models).
This particular setup uses the normal deepspeed checkpoint and requires no conversion to Megatron-LM.
- Install software
On login console with external network
Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and best-download==0.0.7
needed to download some tasks.
Below package version numbers are what we tested that work.
(maybe need pip install --upgrade pip)
pip install best-download==0.0.7 lm-eval==0.2.0 datasets==1.15.1 transformers==4.20.1 huggingface-hub==0.8.1
- Pre-download needed datasets
some symlinks due to lm-harness' issues with relative position of data
mkdir data
cd ../../tasks/eval_harness/
ln -s ../../examples_deepspeed/MoE/data/ data
cd ../../examples_deepspeed/MoE/
Then install datasets for the tasks:
python ../../tasks/eval_harness/download.py --task_list hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext,logiqa,mathqa,mc_taco,mrpc,prost,pubmedqa,qnli,qqp,sciq,sst,wnli
Previously we set export HF_DATASETS_OFFLINE=1
to make the dataset offline after the above manual download. But somehow now this could trigger error on some kind of online verification for some of the datasets, so it's recommended to only set offline mode when necessary.
- Prepare the script
ds_evalharness.sh
is the example script.
- Edit:
PP_SIZE=1
TP_SIZE=1
NO_PP="true"
EP_PARALLEL_SIZE=1
NUM_NODE=1
NUM_GPU_PER_NODE=1
to match the eval topology.
Edit:
CHECKPOINT_PATH=
CONFIG_PATH=
RESULT_PATH=
to the checkpoint/ds config you want to use, and where to save the results.
- Adjust the following to fit the chosen GPU. As of last check for 1.3B model the settings are one of:
EVAL_MICRO_BATCH_SIZE=6 # 16GB GPU 1.3B model
EVAL_MICRO_BATCH_SIZE=12 # 32GB GPU 1.3B model
If you get OOM lower it further.
- If not using the Deepspeed path, disable it by removing:
--deepspeed \
--deepspeed_config ds_config.json \
If you didn't disable it and the program crashed on checkpoint loading unable to find some key, disable deepspeed as explained above.
Note that for MoE models and for models without pipeline parallelism, currently they might not work for the case without deepspeed.
In Mixtral LM evaluation harness can be triggered directly from generic run script. To run the tests, use a pre-trained model checkpoint and load it in evaluation framework using the same training script, adding HL_RUN_EVAL_HARNESS=1
, path HL_CHECKPOINTS_DIR
and tag HL_CHECKPOINT_LOAD_TAG
of the saved checkpoint:
HL_RUN_EVAL_HARNESS=1 \
HL_CHECKPOINTS_DIR=<dir> \
HL_CHECKPOINT_LOAD_TAG=global_step1000 \
HL_TRUST_REMOTE_CODE=1 \
HL_EVAL_TASKS='wikitext,webqs,winogrande' \
$MEGATRON_DEEPSPEED_ROOT/scripts/run_mixtral.sh
Standard model arguments for inference such as 3D config, batch size, etc. are also required. Specify HL_EVAL_TASKS
to run the tests on a subset of the tasks.
For tasks not included in HuggingFace database, pass HL_TRUST_REMOTE_CDDE=1
. For some tasks, pre-downloaded dataset may be needed, additional preparation steps can be found in the section above.