You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Service provider start a reasoning LLM model with --enable-adaptive-compute
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--enable-adaptive-compute \
--adaptive-compute-method "prompting" \
--adaptive-compute-token-interval 32 \
--adaptive-compute-prompt "Do you think you can provide the answer now?"
The --adaptive-compute-method can be (1) simple prompt, (2) logprob, etc. We will just support simple prompt for now.
2a. User send one request and it works out of the box.
For batch request: x-axis number of tokens, and y-axis accuracy
For online request: x-axis request rate, and y-axis deadline attainment
Goal
Make an vLLM openai-compatible endpoint that support --enable-adaptive-compute. This will require user to have minimum effort to support certaindex on CoT.
Ensure deepseek models are working. Need to test: Deepseek Qwen 7B/32B (with TP), ideally the R1 (quantized).
Non Goal / Optional Optimizations
Gang scheduling. We need to focus to get CoT supporting with OpenAI endpoint in vLLM (within 1 week). Not much performance gain we can claim to perform gang scheduling within one request. One possible optimization might be request prioritization: to prioritize requests already in the CoT process, instead of accepting more requests in the batch. This is nice to have, but not required.
Other reasoning algorithm (SC, MCTS, Rebase). We only focus on CoT for now. We have the code for other algorithms, and once we get vLLM buy in our idea, we do the rest.
Changes
Modify the vLLM serving_chat.py to support the adaptive compute functionality.
Write examples and benchmarks to confirm the support.
Run benchmark on models and provide the benchmark results.
Modify the vLLM serving_chat.py
a. (P0) Pure client-side change. Everything stays on client side. This is the fastest and most compatible way, with some two-way HTTP request costs of client-provider.
b. Add additional web endpoint. We provide our own endpoint (/v1/chat/reasoning) so we have better control for prompt-or-decode request. Inside the endpoint, we need to manage the request wisely.
c. Change existing openai endpoint. We properly engineer through serving_chat.py and send request / process requests based on the request phase.
d. Change the Engine. We add reasoning request management inside LLM engine.
Benchmark Work
Add light/medium/heavy threshold as default.
We provide preset threshold for standard (model, dataset)
The text was updated successfully, but these errors were encountered:
GindaChen
changed the title
Make vLLM --enable-adaptive-compute
Make vLLM support --enable-adaptive-computeJan 30, 2025
User Flow
--enable-adaptive-compute
The
--adaptive-compute-method
can be (1) simple prompt, (2) logprob, etc. We will just support simple prompt for now.2a. User send one request and it works out of the box.
2b. User send a batch of request should just work out of the box. (not exists)
The benchmark should include:
Goal
--enable-adaptive-compute
. This will require user to have minimum effort to support certaindex on CoT.Non Goal / Optional Optimizations
Changes
serving_chat.py
to support the adaptive compute functionality.Modify the vLLM
serving_chat.py
/v1/chat/reasoning
) so we have better control for prompt-or-decode request. Inside the endpoint, we need to manage the request wisely.serving_chat.py
and send request / process requests based on the request phase.Engine
. We add reasoning request management inside LLM engine.Benchmark Work
The text was updated successfully, but these errors were encountered: