Skip to content

Commit

Permalink
Better no_weights for pytorch and TGI backends (#91)
Browse files Browse the repository at this point in the history
  • Loading branch information
IlyasMoutawwakil authored Nov 30, 2023
1 parent f75d9f9 commit 30b43fc
Show file tree
Hide file tree
Showing 11 changed files with 375 additions and 308 deletions.
84 changes: 38 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# Optimum-Benchmark

Optimum-Benchmark is a unified multi-backend utility for benchmarking `transformers`, `diffusers`, `peft` and `timm` models with [Optimum](https://github.com/huggingface/optimum)'s optimizations & quantization, for [inference](https://github.com/huggingface/optimum#accelerated-inference) & [training](https://github.com/huggingface/optimum#accelerated-training), on different backends & hardwares (OnnxRuntime, Intel Neural Compressor, OpenVINO, Habana Gaudi Processor (HPU), etc).
Optimum-Benchmark is a unified multi-backend utility for benchmarking `transformers`, `diffusers`, `peft` and `timm` models with [Optimum](https://github.com/huggingface/optimum)'s optimizations & quantization schemes, for [inference](https://github.com/huggingface/optimum#accelerated-inference) & [training](https://github.com/huggingface/optimum#accelerated-training), on different backends & hardwares (OnnxRuntime, Intel Neural Compressor, OpenVINO, Habana Gaudi Processor (HPU), AMD Instinct GPUs, etc).

The experiment management and tracking is handled using [hydra](https://hydra.cc/) which allows for simple cli with minimum configuration changes and maximum flexibility (inspired by [tune](https://github.com/huggingface/tune)).

## Motivation

- Many hardware vendors would want to know how their hardware performs compared to others on the same models.
- Many HF users would want to know how their chosen model performs in terms of latency, throughput, memory usage, energy consumption, etc.
- Optimum offers a lot of hardware and backend specific optimizations & quantization schemas that can be applied to models and improve their performance.
- Benchmarks depend heavily on many factors, like input/hardware/releases/etc, but most don't report these factors (e.g. comparing an A100 to an RTX 3090 on a singleton batch).
- Hardware vendors wanting to know how their hardware performs compared to others on the same models.
- HF users wanting to know how their chosen model performs in terms of latency, throughput, memory usage, energy consumption, etc.
- Experimenting with Optimum and Transformers' arsenal of optimizations & quantization schemes that can be applied to models and improve their computational/memory/energy efficiency.
- [...]

## Features

`optimum-benchmark` allows you to run benchmarks with no code and minimal user input, just specify:

- The device to use (e.g. `cuda`).
- The type of device (e.g. `cuda`).
- The launcher to use (e.g. `process`).
- The type of benchmark (e.g. `training`)
- The backend to run on (e.g. `onnxruntime`).
- The model name or path (e.g. `bert-base-uncased`)
Expand All @@ -26,39 +26,44 @@ Everything else is either optional or inferred from the model's name or path.

### Supported Backends/Devices

- [x] Pytorch backend for CPU
- [x] Pytorch backend for CUDA
- [x] Pytorch backend for CPU (Intel, AMD, ARM, etc)
- [x] Pytorch backend for CUDA (NVIDIA and AMD GPUs)
- [ ] Pytorch backend for Habana Gaudi Processor (HPU)
- [x] OnnxRuntime backend for CPUExecutionProvider
- [x] OnnxRuntime backend for CUDAExecutionProvider
- [ ] OnnxRuntime backend for ROCMExecutionProvider
- [x] OnnxRuntime backend for TensorrtExecutionProvider
- [x] Intel Neural Compressor backend for CPU
- [x] OpenVINO backend for CPU

### Benchmark features

- [x] Latency and throughput tracking (default).
- [x] Peak memory tracking (`benchmark.memory=true`).
- [x] Energy and carbon emissions (`benchmark.energy=true`).
- [x] Warm up runs before inference (`benchmark.warmup_runs=20`).
- [x] Warm up steps during training (`benchmark.warmup_steps=20`).
- [x] Inputs shapes control (e.g. `benchmark.input_shapes.sequence_length=128`).
- [x] Dataset shapes control (e.g. `benchmark.dataset_shapes.dataset_size=1000`).
- [x] Forward and Generation pass control (e.g. for an LLM `benchmark.generate.max_new_tokens=100`, for a diffusion model `benchmark.forward.num_images_per_prompt=4`).
- [x] Memory tracking (`benchmark.memory=true`)
- [x] Latency and throughput tracking of forward pass (default)
- [x] Warm up runs before inference (`benchmark.warmup_runs=20`)
- [x] Warm up steps during training (`benchmark.warmup_steps=20`)
- [x] Energy and carbon emissions tracking (`benchmark.energy=true`)
- [x] Inputs shapes control (e.g. `benchmark.input_shapes.sequence_length=128`)
- [x] Dataset shapes control (e.g. `benchmark.dataset_shapes.dataset_size=1000`)
- [x] Latancy and throughput tracking of generation pass (auto-enabled for generative models)
- [x] Prefill latency and Decoding throughput deduced from generation and forward pass (auto-enabled for generative models)
- [x] Forward and Generation pass control (e.g. for an LLM `benchmark.generate_kwargs.max_new_tokens=100`, for a diffusion model `benchmark.forward_kwargs.num_images_per_prompt=4`)

### Backend features

- [x] Random weights initialization (`backend.no_weights=true` for fast model instantiation without downloading weights).
- [x] Onnxruntime Quantization and AutoQuantization (`backend.quantization=true` or `backend.auto_quantization=avx2`, etc).
- [x] Onnxruntime Calibration for Static Quantization (`backend.quantization_config.is_static=true`, etc).
- [x] Onnxruntime Optimization and AutoOptimization (`backend.optimization=true` or `backend.auto_optimization=O4`, etc).
- [x] PEFT training (`backend.peft_strategy=lora`, `backend.peft_config.task_type=CAUSAL_LM`, etc).
- [x] DDP training (`backend.use_ddp=true`, `backend.ddp_config.nproc_per_node=2`, etc).
- [x] BitsAndBytes quantization scheme (`backend.quantization_scheme=bnb`, `backend.quantization_config.load_in_4bit`, etc).
- [x] GPTQ quantization scheme (`backend.quantization_scheme=gptq`, `backend.quantization_config.bits=4`, etc).
- [x] Optimum's BetterTransformer (`backend.bettertransformer=true`).
- [x] Automatic Mixed Precision (`backend.amp_autocast=true`).
- [x] Dynamo/Inductor compiling (`backend.torch_compile=true`).
- [x] Random weights initialization (`backend.no_weights=true` for fast model instantiation without downloading weights)
- [x] Onnxruntime Quantization and AutoQuantization (`backend.quantization=true` or `backend.auto_quantization=avx2`, etc)
- [x] Onnxruntime Calibration for Static Quantization (`backend.quantization_config.is_static=true`, etc)
- [x] Onnxruntime Optimization and AutoOptimization (`backend.optimization=true` or `backend.auto_optimization=O4`, etc)
- [x] BitsAndBytes quantization scheme (`backend.quantization_scheme=bnb`, `backend.quantization_config.load_in_4bit`, etc)
- [x] GPTQ quantization scheme (`backend.quantization_scheme=gptq`, `backend.quantization_config.bits=4`, etc)
- [x] PEFT training (`backend.peft_strategy=lora`, `backend.peft_config.task_type=CAUSAL_LM`, etc)
- [x] Distributed inference/training (`launcher=torchrun`, `launcher.n_proc_per_node=2`, etc)
- [x] Transformers' Flash Attention V2 (`backend.use_flash_attention_v2=true`)
- [x] Optimum's BetterTransformer (`backend.to_bettertransformer=true`)
- [x] DeepSpeed-Inference support (`backend.deepspeed_inference=true`)
- [x] Dynamo/Inductor compiling (`backend.torch_compile=true`)
- [x] Automatic Mixed Precision (`backend.amp_autocast=true`)

## Quickstart

Expand All @@ -73,9 +78,7 @@ python -m pip install git+https://github.com/huggingface/optimum-benchmark.git
or by cloning the repository and installing it in editable mode:

```bash
git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark

python -m pip install -e .
git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark && python -m pip install -e .
```

Depending on the backends you want to use, you might need to install some extra dependencies:
Expand All @@ -86,7 +89,7 @@ Depending on the backends you want to use, you might need to install some extra
- Intel Neural Compressor: `pip install optimum-benchmark[neural-compressor]`
- Text Generation Inference: `pip install optimum-benchmark[text-generation-inference]`

You can now run a benchmark using the command line by specifying the configuration directory and the configuration name. Both arguments are mandatory for `hydra`. `config-dir` is the directory where the configuration files are stored and `config-name` is the name of the configuration file without its `.yaml` extension.
You can now run a benchmark using the command line by specifying the configuration directory and the configuration name. Both arguments are mandatory for `hydra`. `--config-dir` is the directory where the configuration files are stored and `--config-name` is the name of the configuration file without its `.yaml` extension.

```bash
optimum-benchmark --config-dir examples/ --config-name pytorch_bert
Expand All @@ -103,12 +106,13 @@ The directory for storing these results can be changed by setting `hydra.run.dir
It's easy to override the default behavior of a benchmark from the command line.

```bash
optimum-benchmark --config-dir examples/ --config-name pytorch_bert model=gpt2 device=cuda:1
optimum-benchmark --config-dir examples/ --config-name pytorch_bert model=gpt2 device=cuda
```

## Multirun configuration sweeps

You can easily run configuration sweeps using the `-m` or `--multirun` option. By default, configurations will be executed serially but other kinds of executions are supported with hydra's launcher plugins : `hydra/launcher=submitit`, `hydra/launcher=rays`, `hydra/launcher=joblib`, etc.
You can easily run configuration sweeps using the `-m` or `--multirun` option. By default, configurations will be executed serially but other kinds of executions are supported with hydra's launcher plugins : `=submitit`, `hydra/launcher=rays`, etc.
Note that the hydra launcher `hydra/launcher` is different than our own `launcher`, specifically `hydra/launcher` can only be used in `--multirun` mode, and will only handle the inter-run behavior.

```bash
optimum-benchmark --config-dir examples --config-name pytorch_bert -m device=cpu,cuda
Expand All @@ -120,18 +124,6 @@ Also, for integer parameters like `batch_size`, one can specify a range of value
optimum-benchmark --config-dir examples --config-name pytorch_bert -m device=cpu,cuda benchmark.input_shapes.batch_size='range(1,10,step=2)'
```

## Reporting benchmark results (WIP)

To aggregate the results of a benchmark (run(s) or sweep(s)), you can use the `optimum-report` command.

```bash
optimum-report --experiments {experiments_folder_1} {experiments_folder_2} --baseline {baseline_folder} --report-name {report_name}
```

This will create a report in the `reports` folder with the name `{report_name}`. The report will contain the results of the experiments in `{experiments_folder_1}` and `{experiments_folder_2}` compared to the results of the baseline in `{baseline_folder}` in the form of a `.csv` file, an `.svg` rich table and (a) `.png` plot(s).

You can also reuse some components of the reporting script for your use case (examples in [`examples/training-llamas`] and [`examples/running-llamas`]).

## Configurations structure

You can create custom configuration files following the [examples here](examples).
Expand All @@ -158,6 +150,6 @@ Contributions are welcome! And we're happy to help you get started. Feel free to
Things that we'd like to see:

- More backends (Tensorflow, TFLite, Jax, etc).
- More hardware support (Habana Gaudi Processor (HPU), etc).
- More tests (right now we only have few tests per backend).
- Task evaluators for the most common tasks (would be great for output regression).
- More hardware support (Habana Gaudi Processor (HPU), RadeonOpenCompute (ROCm), etc).
Original file line number Diff line number Diff line change
@@ -1,20 +1,13 @@
defaults:
- backend: openvino # default backend
- launcher: inline # default launcher
- benchmark: inference # default benchmark
- experiment # inheriting experiment schema
- _self_ # for hydra 1.1 compatibility
- override hydra/job_logging: colorlog # colorful logging
- override hydra/hydra_logging: colorlog # colorful logging

hydra:
run:
dir: runs/${experiment_name}
sweep:
dir: sweeps/${experiment_name}
job:
chdir: true

experiment_name: openvino_sdxl
experiment_name: openvino_diffusion
model: stabilityai/stable-diffusion-2-1
device: cpu

Expand All @@ -26,3 +19,13 @@ backend:
benchmark:
input_shapes:
batch_size: 1

hydra:
run:
dir: runs/${experiment_name}
sweep:
dir: sweeps/${experiment_name}
job:
chdir: true
env_set:
OVERRIDE_BENCHMARKS: 1
10 changes: 6 additions & 4 deletions examples/pytorch_bert.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
defaults:
- backend: pytorch # default backend
- launcher: inline # default launcher
- benchmark: inference # default benchmark
- experiment # inheriting experiment schema
- _self_ # for hydra 1.1 compatibility
- override hydra/job_logging: colorlog # colorful logging
- override hydra/hydra_logging: colorlog # colorful logging

experiment_name: pytorch_bert
model: bert-base-uncased
device: cuda

hydra:
run:
dir: runs/${experiment_name}
Expand All @@ -14,9 +19,6 @@ hydra:
job:
chdir: true
env_set:
OVERRIDE_BENCHMARKS: 1
CUDA_VISIBLE_DEVICES: 0
CUDA_DEVICE_ORDER: PCI_BUS_ID

experiment_name: pytorch_bert
model: bert-base-uncased
device: cuda
Original file line number Diff line number Diff line change
@@ -1,35 +1,36 @@
defaults:
- backend: text-generation-inference # default backend
- benchmark: inference # default benchmark
- launcher: inline # default launcher
- experiment # inheriting experiment schema
- _self_ # for hydra 1.1 compatibility
- override hydra/job_logging: colorlog # colorful logging
- override hydra/hydra_logging: colorlog # colorful logging

hydra:
run:
dir: runs/${experiment_name}
sweep:
dir: sweeps/${experiment_name}
job:
chdir: true
env_set:
CUDA_VISIBLE_DEVICES: 0,1
CUDA_DEVICE_ORDER: PCI_BUS_ID

experiment_name: text_generation_inference
experiment_name: tgi_llama
model: NousResearch/Llama-2-7b-hf
device: cuda

backend:
no_weights: true
initial_isolation_check: false
continous_isolation_check: false
torch_dtype: float16
continuous_isolation: false
# no_weights: true # wok in progress

benchmark:
input_shapes:
batch_size: 32
sequence_length: 128

new_tokens: 1000

hydra:
run:
dir: runs/${experiment_name}
sweep:
dir: sweeps/${experiment_name}
job:
chdir: true
env_set:
OVERRIDE_BENCHMARKS: 1
CUDA_VISIBLE_DEVICES: 0,1
CUDA_DEVICE_ORDER: PCI_BUS_ID
35 changes: 24 additions & 11 deletions optimum_benchmark/backends/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,18 @@

import numpy as np
from optimum.exporters import TasksManager
from transformers import AutoConfig, AutoProcessor
from transformers import (
AutoConfig,
AutoProcessor,
GenerationConfig,
Pipeline,
PretrainedConfig,
PreTrainedModel,
TrainerState,
)
from transformers.utils import ModelOutput

if TYPE_CHECKING:
from transformers import (
Pipeline,
PretrainedConfig,
PreTrainedModel,
TrainerState,
)
from transformers.utils import ModelOutput

from .utils import PreTrainedProcessor

from ..task_utils import DIFFUSION_TASKS, TEXT_GENERATION_TASKS
Expand All @@ -50,8 +51,9 @@ class Backend(Generic[BackendConfigT], ABC):
config: BackendConfigT
isolation_thread: Optional[Process]
pretrained_model: Union["PreTrainedModel", "Pipeline"]
pretrained_processor: Optional["PreTrainedProcessor"]
pretrained_config: Optional["PretrainedConfig"]
pretrained_processor: Optional["PreTrainedProcessor"]
pretrained_generation_config: Optional["GenerationConfig"]
automodel_class: Callable[..., "PreTrainedModel"]

def __init__(self, model: str, task: str, device: str, hub_kwargs: Dict[str, Any]):
Expand All @@ -74,7 +76,7 @@ def __init__(self, model: str, task: str, device: str, hub_kwargs: Dict[str, Any

try:
# sometimes contains information about the model's
# input shapes that're not available in the config
# input shapes that are not available in the config
self.pretrained_processor = AutoProcessor.from_pretrained(
pretrained_model_name_or_path=self.model, **self.hub_kwargs
)
Expand All @@ -83,6 +85,17 @@ def __init__(self, model: str, task: str, device: str, hub_kwargs: Dict[str, Any
LOGGER.warning("Could not find the model's preprocessor")
self.pretrained_processor = None

if self.is_text_generation_model():
try:
self.pretrained_generation_config = GenerationConfig.from_pretrained(
pretrained_model_name=self.model, **self.hub_kwargs
)
except Exception:
LOGGER.warning("Could not find the model's generation config")
self.pretrained_generation_config = None
else:
self.pretrained_generation_config = None

self.automodel_class = TasksManager.get_model_class_for_task(
framework="pt", # TODO: make this configurable to add support for other frameworks
task=self.task,
Expand Down
Loading

0 comments on commit 30b43fc

Please sign in to comment.