Make vLLM support `--enable-adaptive-compute` #4

GindaChen · 2025-01-30T00:13:19Z

User Flow

Service provider start a reasoning LLM model with --enable-adaptive-compute

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--enable-adaptive-compute \
--adaptive-compute-method "prompting" \
--adaptive-compute-token-interval 32 \
--adaptive-compute-prompt "Do you think you can provide the answer now?"

The --adaptive-compute-method can be (1) simple prompt, (2) logprob, etc. We will just support simple prompt for now.

2a. User send one request and it works out of the box.

python vllm/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py

2b. User send a batch of request should just work out of the box. (not exists)

python vllm/examples/online_serving/openai_chat_completion_with_reasoning_batch.py

Add a benchmark suite to measure the tokens we saved (not exists yet).

python vllm/benchmarks/reasoning/benchmark_adaptive_compute_batch.py
python vllm/benchmarks/reasoning/benchmark_adaptive_compute_online.py

The benchmark should include:

variables: A set of (model, dataset), threshold
For batch request: x-axis number of tokens, and y-axis accuracy
For online request: x-axis request rate, and y-axis deadline attainment

Goal

Make an vLLM openai-compatible endpoint that support --enable-adaptive-compute. This will require user to have minimum effort to support certaindex on CoT.
Ensure deepseek models are working. Need to test: Deepseek Qwen 7B/32B (with TP), ideally the R1 (quantized).

Non Goal / Optional Optimizations

Gang scheduling. We need to focus to get CoT supporting with OpenAI endpoint in vLLM (within 1 week). Not much performance gain we can claim to perform gang scheduling within one request. One possible optimization might be request prioritization: to prioritize requests already in the CoT process, instead of accepting more requests in the batch. This is nice to have, but not required.
Other reasoning algorithm (SC, MCTS, Rebase). We only focus on CoT for now. We have the code for other algorithms, and once we get vLLM buy in our idea, we do the rest.

Changes

Modify the vLLM serving_chat.py to support the adaptive compute functionality.
Write examples and benchmarks to confirm the support.
Run benchmark on models and provide the benchmark results.

Modify the vLLM serving_chat.py

a. (P0) Pure client-side change. Everything stays on client side. This is the fastest and most compatible way, with some two-way HTTP request costs of client-provider.
b. Add additional web endpoint. We provide our own endpoint (/v1/chat/reasoning) so we have better control for prompt-or-decode request. Inside the endpoint, we need to manage the request wisely.
c. Change existing openai endpoint. We properly engineer through serving_chat.py and send request / process requests based on the request phase.
d. Change the Engine. We add reasoning request management inside LLM engine.

Benchmark Work

Add light/medium/heavy threshold as default.
We provide preset threshold for standard (model, dataset)

The text was updated successfully, but these errors were encountered:

GindaChen changed the title ~~Make vLLM --enable-adaptive-compute~~ Make vLLM support --enable-adaptive-compute Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make vLLM support `--enable-adaptive-compute` #4

Make vLLM support `--enable-adaptive-compute` #4

GindaChen commented Jan 30, 2025 •

edited

Loading

Make vLLM support --enable-adaptive-compute #4

Make vLLM support --enable-adaptive-compute #4

Comments

GindaChen commented Jan 30, 2025 • edited Loading

Make vLLM support `--enable-adaptive-compute` #4

Make vLLM support `--enable-adaptive-compute` #4

GindaChen commented Jan 30, 2025 •

edited

Loading