MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference.
MoE-Infinity is cost-effective yet fast:
- Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
- Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
- Supporting LLM acceleration techniques (such as FlashAttention).
- Supporting multi-GPU environments with numeorous OS-level performance optimizations.
- Achieving SOTA latency performance when serving MoEs in a resource-constrained GPU environment (in comparison with vLLM, HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).
MoE-Infinity is easy-to-use:
- HuggingFace model compatible, and HuggingFace programmer friendly.
- Supporting all available MoE checkpoints (including Deepseek-V2, Google Switch Transformers, Meta NLLB-MoE, and Mixtral).
Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.
Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes LongBench, GSM8K, FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.
Switch-large-128 | NLLB-MoE-54B | Mixtral-8x7b | DeepSeek-V2-Lite | |
---|---|---|---|---|
MoE-Infinity | 0.130 | 0.119 | 0.735 | 0.155 |
Accelerate | 1.043 | 3.071 | 6.633 | 1.743 |
DeepSpeed | 4.578 | 8.381 | 2.486 | 0.737 |
Mixtral Offloading | X | X | 1.752 | X |
Ollama | X | X | 0.903 | 1.250 |
vLLM | X | X | 2.137 | 0.493 |
We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.
conda create -n moe-infinity python=3.9
conda activate moe-infinity
# install from either PyPI or Source will trigger requirements.txt automatically
# install stable release
pip install moe-infinity
# install nightly release
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity
git clone https://github.com/EfficientMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12
Install FlashAttention (>=2.5.2) for faster inference with the following command.
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn
Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.
We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.
- The
offload_path
must be unique for each MoE model. Reusing the sameoffload_path
for different MoE models will result in unexpected behavior.
import torch
import os
from transformers import AutoTokenizer
from moe_infinity import MoE
user_home = os.path.expanduser('~')
checkpoint = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote=True)
config = {
"offload_path": os.path.join(user_home, "moe-infinity"),
"device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}
model = MoE(checkpoint, config)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")
output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
This command runs the script on selected GPUs.
CUDA_VISIBLE_DEVICES=0,1 python script.py
We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.
CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite-Chat" --offload_dir <your local path on SSD>
Start the OpenAI-compatible server locally
python -m moe_infinity.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V2-Lite-Chat --offload-dir ./offload_dir
Query the model via /v1/components/
. (We currently only support the required fields, i.e., "model" and "prompt").
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V2-Lite-Chat",
"prompt": "Hello, my name is"
}'
You can also use openai
python package to query the model.
pip install openai
python tests/test_oai_completions.py
Query the model via /v1/chat/completions
. (We currently only support the required fields, i.e., "model" and "messages").
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V2-Lite-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke"}
]
}'
You can also use openai
python package to query the model.
pip install openai
python tests/test_oai_chat_completions.py
We plan to release two functions in the following months:
- We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
- Supporting expert parallelism for distributed MoE inference.
- More (We welcome contributors to join us!)
If you use MoE-Inifity for your research, please cite our paper:
@misc{moe-infinity,
author = {Leyang Xue and
Yao Fu and
Zhan Lu and
Luo Mai and
Mahesh Marina},
title = {MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache},
archivePrefix= {arXiv},
eprint = {2401.14361},
year = {2024}
}