Better no_weights for pytorch and TGI backends (#91)

huggingface · Nov 30, 2023 · 30b43fc · 30b43fc
1 parent f75d9f9
commit 30b43fc
Show file tree

Hide file tree

Showing 11 changed files with 375 additions and 308 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,22 @@
 # Optimum-Benchmark
 
-Optimum-Benchmark is a unified multi-backend utility for benchmarking `transformers`, `diffusers`, `peft` and `timm` models with [Optimum](https://github.com/huggingface/optimum)'s optimizations & quantization, for [inference](https://github.com/huggingface/optimum#accelerated-inference) & [training](https://github.com/huggingface/optimum#accelerated-training), on different backends & hardwares (OnnxRuntime, Intel Neural Compressor, OpenVINO, Habana Gaudi Processor (HPU), etc).
+Optimum-Benchmark is a unified multi-backend utility for benchmarking `transformers`, `diffusers`, `peft` and `timm` models with [Optimum](https://github.com/huggingface/optimum)'s optimizations & quantization schemes, for [inference](https://github.com/huggingface/optimum#accelerated-inference) & [training](https://github.com/huggingface/optimum#accelerated-training), on different backends & hardwares (OnnxRuntime, Intel Neural Compressor, OpenVINO, Habana Gaudi Processor (HPU), AMD Instinct GPUs, etc).
 
 The experiment management and tracking is handled using [hydra](https://hydra.cc/) which allows for simple cli with minimum configuration changes and maximum flexibility (inspired by [tune](https://github.com/huggingface/tune)).
 
 ## Motivation
 
-- Many hardware vendors would want to know how their hardware performs compared to others on the same models.
-- Many HF users would want to know how their chosen model performs in terms of latency, throughput, memory usage, energy consumption, etc.
-- Optimum offers a lot of hardware and backend specific optimizations & quantization schemas that can be applied to models and improve their performance.
-- Benchmarks depend heavily on many factors, like input/hardware/releases/etc, but most don't report these factors (e.g. comparing an A100 to an RTX 3090 on a singleton batch).
+- Hardware vendors wanting to know how their hardware performs compared to others on the same models.
+- HF users wanting to know how their chosen model performs in terms of latency, throughput, memory usage, energy consumption, etc.
+- Experimenting with Optimum and Transformers' arsenal of optimizations & quantization schemes that can be applied to models and improve their computational/memory/energy efficiency.
 - [...]
 
 ## Features
 
 `optimum-benchmark` allows you to run benchmarks with no code and minimal user input, just specify:
 
-- The device to use (e.g. `cuda`).
+- The type of device (e.g. `cuda`).
+- The launcher to use (e.g. `process`).
 - The type of benchmark (e.g. `training`)
 - The backend to run on (e.g. `onnxruntime`).
 - The model name or path (e.g. `bert-base-uncased`)
@@ -26,39 +26,44 @@ Everything else is either optional or inferred from the model's name or path.
 
 ### Supported Backends/Devices
 
-- [x] Pytorch backend for CPU
-- [x] Pytorch backend for CUDA
+- [x] Pytorch backend for CPU (Intel, AMD, ARM, etc)
+- [x] Pytorch backend for CUDA (NVIDIA and AMD GPUs)
 - [ ] Pytorch backend for Habana Gaudi Processor (HPU)
 - [x] OnnxRuntime backend for CPUExecutionProvider
 - [x] OnnxRuntime backend for CUDAExecutionProvider
+- [ ] OnnxRuntime backend for ROCMExecutionProvider
 - [x] OnnxRuntime backend for TensorrtExecutionProvider
 - [x] Intel Neural Compressor backend for CPU
 - [x] OpenVINO backend for CPU
 
 ### Benchmark features
 
-- [x] Latency and throughput tracking (default).
-- [x] Peak memory tracking (`benchmark.memory=true`).
-- [x] Energy and carbon emissions (`benchmark.energy=true`).
-- [x] Warm up runs before inference (`benchmark.warmup_runs=20`).
-- [x] Warm up steps during training (`benchmark.warmup_steps=20`).
-- [x] Inputs shapes control (e.g. `benchmark.input_shapes.sequence_length=128`).
-- [x] Dataset shapes control (e.g. `benchmark.dataset_shapes.dataset_size=1000`).
-- [x] Forward and Generation pass control (e.g. for an LLM `benchmark.generate.max_new_tokens=100`, for a diffusion model `benchmark.forward.num_images_per_prompt=4`).
+- [x] Memory tracking (`benchmark.memory=true`)
+- [x] Latency and throughput tracking of forward pass (default)
+- [x] Warm up runs before inference (`benchmark.warmup_runs=20`)
+- [x] Warm up steps during training (`benchmark.warmup_steps=20`)
+- [x] Energy and carbon emissions tracking (`benchmark.energy=true`)
+- [x] Inputs shapes control (e.g. `benchmark.input_shapes.sequence_length=128`)
+- [x] Dataset shapes control (e.g. `benchmark.dataset_shapes.dataset_size=1000`)
+- [x] Latancy and throughput tracking of generation pass (auto-enabled for generative models)
+- [x] Prefill latency and Decoding throughput deduced from generation and forward pass (auto-enabled for generative models)
+- [x] Forward and Generation pass control (e.g. for an LLM `benchmark.generate_kwargs.max_new_tokens=100`, for a diffusion model `benchmark.forward_kwargs.num_images_per_prompt=4`)
 
 ### Backend features
 
-- [x] Random weights initialization (`backend.no_weights=true` for fast model instantiation without downloading weights).
-- [x] Onnxruntime Quantization and AutoQuantization (`backend.quantization=true` or `backend.auto_quantization=avx2`, etc).
-- [x] Onnxruntime Calibration for Static Quantization (`backend.quantization_config.is_static=true`, etc).
-- [x] Onnxruntime Optimization and AutoOptimization (`backend.optimization=true` or `backend.auto_optimization=O4`, etc).
-- [x] PEFT training (`backend.peft_strategy=lora`, `backend.peft_config.task_type=CAUSAL_LM`, etc).
-- [x] DDP training (`backend.use_ddp=true`, `backend.ddp_config.nproc_per_node=2`, etc).
-- [x] BitsAndBytes quantization scheme (`backend.quantization_scheme=bnb`, `backend.quantization_config.load_in_4bit`, etc).
-- [x] GPTQ quantization scheme (`backend.quantization_scheme=gptq`, `backend.quantization_config.bits=4`, etc).
-- [x] Optimum's BetterTransformer (`backend.bettertransformer=true`).
-- [x] Automatic Mixed Precision (`backend.amp_autocast=true`).
-- [x] Dynamo/Inductor compiling (`backend.torch_compile=true`).
+- [x] Random weights initialization (`backend.no_weights=true` for fast model instantiation without downloading weights)
+- [x] Onnxruntime Quantization and AutoQuantization (`backend.quantization=true` or `backend.auto_quantization=avx2`, etc)
+- [x] Onnxruntime Calibration for Static Quantization (`backend.quantization_config.is_static=true`, etc)
+- [x] Onnxruntime Optimization and AutoOptimization (`backend.optimization=true` or `backend.auto_optimization=O4`, etc)
+- [x] BitsAndBytes quantization scheme (`backend.quantization_scheme=bnb`, `backend.quantization_config.load_in_4bit`, etc)
+- [x] GPTQ quantization scheme (`backend.quantization_scheme=gptq`, `backend.quantization_config.bits=4`, etc)
+- [x] PEFT training (`backend.peft_strategy=lora`, `backend.peft_config.task_type=CAUSAL_LM`, etc)
+- [x] Distributed inference/training (`launcher=torchrun`, `launcher.n_proc_per_node=2`, etc)
+- [x] Transformers' Flash Attention V2 (`backend.use_flash_attention_v2=true`)
+- [x] Optimum's BetterTransformer (`backend.to_bettertransformer=true`)
+- [x] DeepSpeed-Inference support (`backend.deepspeed_inference=true`)
+- [x] Dynamo/Inductor compiling (`backend.torch_compile=true`)
+- [x] Automatic Mixed Precision (`backend.amp_autocast=true`)
 
 ## Quickstart
 
@@ -73,9 +78,7 @@ python -m pip install git+https://github.com/huggingface/optimum-benchmark.git
 or by cloning the repository and installing it in editable mode:
 
 ```bash
-git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark
-
-python -m pip install -e .
+git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark && python -m pip install -e .
 ```
 
 Depending on the backends you want to use, you might need to install some extra dependencies:
@@ -86,7 +89,7 @@ Depending on the backends you want to use, you might need to install some extra
 - Intel Neural Compressor: `pip install optimum-benchmark[neural-compressor]`
 - Text Generation Inference: `pip install optimum-benchmark[text-generation-inference]`
 
-You can now run a benchmark using the command line by specifying the configuration directory and the configuration name. Both arguments are mandatory for `hydra`. `config-dir` is the directory where the configuration files are stored and `config-name` is the name of the configuration file without its `.yaml` extension.
+You can now run a benchmark using the command line by specifying the configuration directory and the configuration name. Both arguments are mandatory for `hydra`. `--config-dir` is the directory where the configuration files are stored and `--config-name` is the name of the configuration file without its `.yaml` extension.
 
 ```bash
 optimum-benchmark --config-dir examples/ --config-name pytorch_bert
@@ -103,12 +106,13 @@ The directory for storing these results can be changed by setting `hydra.run.dir
 It's easy to override the default behavior of a benchmark from the command line.
 
 ```bash
-optimum-benchmark --config-dir examples/ --config-name pytorch_bert model=gpt2 device=cuda:1
+optimum-benchmark --config-dir examples/ --config-name pytorch_bert model=gpt2 device=cuda
 ```
 
 ## Multirun configuration sweeps
 
-You can easily run configuration sweeps using the `-m` or `--multirun` option. By default, configurations will be executed serially but other kinds of executions are supported with hydra's launcher plugins : `hydra/launcher=submitit`, `hydra/launcher=rays`, `hydra/launcher=joblib`, etc.
+You can easily run configuration sweeps using the `-m` or `--multirun` option. By default, configurations will be executed serially but other kinds of executions are supported with hydra's launcher plugins : `=submitit`, `hydra/launcher=rays`, etc.
+Note that the hydra launcher `hydra/launcher` is different than our own `launcher`, specifically `hydra/launcher` can only be used in `--multirun` mode, and will only handle the inter-run behavior.
 
 ```bash
 optimum-benchmark --config-dir examples --config-name pytorch_bert -m device=cpu,cuda
@@ -120,18 +124,6 @@ Also, for integer parameters like `batch_size`, one can specify a range of value
 optimum-benchmark --config-dir examples --config-name pytorch_bert -m device=cpu,cuda benchmark.input_shapes.batch_size='range(1,10,step=2)'
 ```
 
-## Reporting benchmark results (WIP)
-
-To aggregate the results of a benchmark (run(s) or sweep(s)), you can use the `optimum-report` command.
-
-```bash
-optimum-report --experiments {experiments_folder_1} {experiments_folder_2} --baseline {baseline_folder} --report-name {report_name}
-```
-
-This will create a report in the `reports` folder with the name `{report_name}`. The report will contain the results of the experiments in `{experiments_folder_1}` and `{experiments_folder_2}` compared to the results of the baseline in `{baseline_folder}` in the form of a `.csv` file, an `.svg` rich table and (a) `.png` plot(s).
-
-You can also reuse some components of the reporting script for your use case (examples in [`examples/training-llamas`] and [`examples/running-llamas`]).
-
 ## Configurations structure
 
 You can create custom configuration files following the [examples here](examples).
@@ -158,6 +150,6 @@ Contributions are welcome! And we're happy to help you get started. Feel free to
 Things that we'd like to see:
 
 - More backends (Tensorflow, TFLite, Jax, etc).
+- More hardware support (Habana Gaudi Processor (HPU), etc).
 - More tests (right now we only have few tests per backend).
 - Task evaluators for the most common tasks (would be great for output regression).
-- More hardware support (Habana Gaudi Processor (HPU), RadeonOpenCompute (ROCm), etc).
diff --git a/examples/openvino_stable_diffusion.yaml → examples/openvino_diffusion.yaml b/examples/openvino_stable_diffusion.yaml → examples/openvino_diffusion.yaml
@@ -1,20 +1,13 @@
 defaults:
   - backend: openvino # default backend
+  - launcher: inline # default launcher
   - benchmark: inference # default benchmark
   - experiment # inheriting experiment schema
   - _self_ # for hydra 1.1 compatibility
   - override hydra/job_logging: colorlog # colorful logging
   - override hydra/hydra_logging: colorlog # colorful logging
 
-hydra:
-  run:
-    dir: runs/${experiment_name}
-  sweep:
-    dir: sweeps/${experiment_name}
-  job:
-    chdir: true
-
-experiment_name: openvino_sdxl
+experiment_name: openvino_diffusion
 model: stabilityai/stable-diffusion-2-1
 device: cpu
 
@@ -26,3 +19,13 @@ backend:
 benchmark:
   input_shapes:
     batch_size: 1
+
+hydra:
+  run:
+    dir: runs/${experiment_name}
+  sweep:
+    dir: sweeps/${experiment_name}
+  job:
+    chdir: true
+    env_set:
+      OVERRIDE_BENCHMARKS: 1
diff --git a/examples/pytorch_bert.yaml b/examples/pytorch_bert.yaml
@@ -1,11 +1,16 @@
 defaults:
   - backend: pytorch # default backend
+  - launcher: inline # default launcher
   - benchmark: inference # default benchmark
   - experiment # inheriting experiment schema
   - _self_ # for hydra 1.1 compatibility
   - override hydra/job_logging: colorlog # colorful logging
   - override hydra/hydra_logging: colorlog # colorful logging
 
+experiment_name: pytorch_bert
+model: bert-base-uncased
+device: cuda
+
 hydra:
   run:
     dir: runs/${experiment_name}
@@ -14,9 +19,6 @@ hydra:
   job:
     chdir: true
     env_set:
+      OVERRIDE_BENCHMARKS: 1
       CUDA_VISIBLE_DEVICES: 0
       CUDA_DEVICE_ORDER: PCI_BUS_ID
-
-experiment_name: pytorch_bert
-model: bert-base-uncased
-device: cuda
diff --git a/...ples/text_generation_inference_llama.yaml → examples/tgi_llama.yaml b/...ples/text_generation_inference_llama.yaml → examples/tgi_llama.yaml
@@ -1,35 +1,36 @@
 defaults:
   - backend: text-generation-inference # default backend
   - benchmark: inference # default benchmark
+  - launcher: inline # default launcher
   - experiment # inheriting experiment schema
   - _self_ # for hydra 1.1 compatibility
   - override hydra/job_logging: colorlog # colorful logging
   - override hydra/hydra_logging: colorlog # colorful logging
 
-hydra:
-  run:
-    dir: runs/${experiment_name}
-  sweep:
-    dir: sweeps/${experiment_name}
-  job:
-    chdir: true
-    env_set:
-      CUDA_VISIBLE_DEVICES: 0,1
-      CUDA_DEVICE_ORDER: PCI_BUS_ID
-
-experiment_name: text_generation_inference
+experiment_name: tgi_llama
 model: NousResearch/Llama-2-7b-hf
 device: cuda
 
 backend:
-  no_weights: true
-  initial_isolation_check: false
-  continous_isolation_check: false
   torch_dtype: float16
+  continuous_isolation: false
+  # no_weights: true # wok in progress
 
 benchmark:
   input_shapes:
     batch_size: 32
     sequence_length: 128
 
   new_tokens: 1000
+
+hydra:
+  run:
+    dir: runs/${experiment_name}
+  sweep:
+    dir: sweeps/${experiment_name}
+  job:
+    chdir: true
+    env_set:
+      OVERRIDE_BENCHMARKS: 1
+      CUDA_VISIBLE_DEVICES: 0,1
+      CUDA_DEVICE_ORDER: PCI_BUS_ID
diff --git a/optimum_benchmark/backends/base.py b/optimum_benchmark/backends/base.py
@@ -18,17 +18,18 @@
 
 import numpy as np
 from optimum.exporters import TasksManager
-from transformers import AutoConfig, AutoProcessor
+from transformers import (
+    AutoConfig,
+    AutoProcessor,
+    GenerationConfig,
+    Pipeline,
+    PretrainedConfig,
+    PreTrainedModel,
+    TrainerState,
+)
+from transformers.utils import ModelOutput
 
 if TYPE_CHECKING:
-    from transformers import (
-        Pipeline,
-        PretrainedConfig,
-        PreTrainedModel,
-        TrainerState,
-    )
-    from transformers.utils import ModelOutput
-
     from .utils import PreTrainedProcessor
 
 from ..task_utils import DIFFUSION_TASKS, TEXT_GENERATION_TASKS
@@ -50,8 +51,9 @@ class Backend(Generic[BackendConfigT], ABC):
     config: BackendConfigT
     isolation_thread: Optional[Process]
     pretrained_model: Union["PreTrainedModel", "Pipeline"]
-    pretrained_processor: Optional["PreTrainedProcessor"]
     pretrained_config: Optional["PretrainedConfig"]
+    pretrained_processor: Optional["PreTrainedProcessor"]
+    pretrained_generation_config: Optional["GenerationConfig"]
     automodel_class: Callable[..., "PreTrainedModel"]
 
     def __init__(self, model: str, task: str, device: str, hub_kwargs: Dict[str, Any]):
@@ -74,7 +76,7 @@ def __init__(self, model: str, task: str, device: str, hub_kwargs: Dict[str, Any
 
             try:
                 # sometimes contains information about the model's
-                # input shapes that're not available in the config
+                # input shapes that are not available in the config
                 self.pretrained_processor = AutoProcessor.from_pretrained(
                     pretrained_model_name_or_path=self.model, **self.hub_kwargs
                 )
@@ -83,6 +85,17 @@ def __init__(self, model: str, task: str, device: str, hub_kwargs: Dict[str, Any
                 LOGGER.warning("Could not find the model's preprocessor")
                 self.pretrained_processor = None
 
+        if self.is_text_generation_model():
+            try:
+                self.pretrained_generation_config = GenerationConfig.from_pretrained(
+                    pretrained_model_name=self.model, **self.hub_kwargs
+                )
+            except Exception:
+                LOGGER.warning("Could not find the model's generation config")
+                self.pretrained_generation_config = None
+        else:
+            self.pretrained_generation_config = None
+
         self.automodel_class = TasksManager.get_model_class_for_task(
             framework="pt",  # TODO: make this configurable to add support for other frameworks
             task=self.task,