Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make vLLM support --enable-adaptive-compute #4

Open
3 tasks
GindaChen opened this issue Jan 30, 2025 · 0 comments
Open
3 tasks

Make vLLM support --enable-adaptive-compute #4

GindaChen opened this issue Jan 30, 2025 · 0 comments

Comments

@GindaChen
Copy link
Collaborator

GindaChen commented Jan 30, 2025

User Flow

  1. Service provider start a reasoning LLM model with --enable-adaptive-compute
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--enable-adaptive-compute \
--adaptive-compute-method "prompting" \
--adaptive-compute-token-interval 32 \
--adaptive-compute-prompt "Do you think you can provide the answer now?"

The --adaptive-compute-method can be (1) simple prompt, (2) logprob, etc. We will just support simple prompt for now.

2a. User send one request and it works out of the box.

python vllm/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py

2b. User send a batch of request should just work out of the box. (not exists)

python vllm/examples/online_serving/openai_chat_completion_with_reasoning_batch.py
  1. Add a benchmark suite to measure the tokens we saved (not exists yet).
python vllm/benchmarks/reasoning/benchmark_adaptive_compute_batch.py
python vllm/benchmarks/reasoning/benchmark_adaptive_compute_online.py

The benchmark should include:

  • variables: A set of (model, dataset), threshold
  • For batch request: x-axis number of tokens, and y-axis accuracy
  • For online request: x-axis request rate, and y-axis deadline attainment

Goal

  • Make an vLLM openai-compatible endpoint that support --enable-adaptive-compute. This will require user to have minimum effort to support certaindex on CoT.
  • Ensure deepseek models are working. Need to test: Deepseek Qwen 7B/32B (with TP), ideally the R1 (quantized).

Non Goal / Optional Optimizations

  • Gang scheduling. We need to focus to get CoT supporting with OpenAI endpoint in vLLM (within 1 week). Not much performance gain we can claim to perform gang scheduling within one request. One possible optimization might be request prioritization: to prioritize requests already in the CoT process, instead of accepting more requests in the batch. This is nice to have, but not required.
  • Other reasoning algorithm (SC, MCTS, Rebase). We only focus on CoT for now. We have the code for other algorithms, and once we get vLLM buy in our idea, we do the rest.

Changes

  • Modify the vLLM serving_chat.py to support the adaptive compute functionality.
  • Write examples and benchmarks to confirm the support.
  • Run benchmark on models and provide the benchmark results.

Modify the vLLM serving_chat.py

  • a. (P0) Pure client-side change. Everything stays on client side. This is the fastest and most compatible way, with some two-way HTTP request costs of client-provider.
  • b. Add additional web endpoint. We provide our own endpoint (/v1/chat/reasoning) so we have better control for prompt-or-decode request. Inside the endpoint, we need to manage the request wisely.
  • c. Change existing openai endpoint. We properly engineer through serving_chat.py and send request / process requests based on the request phase.
  • d. Change the Engine. We add reasoning request management inside LLM engine.

Benchmark Work

  • Add light/medium/heavy threshold as default.
  • We provide preset threshold for standard (model, dataset)
@GindaChen GindaChen changed the title Make vLLM --enable-adaptive-compute Make vLLM support --enable-adaptive-compute Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant