Adding basic kv-cache transfer to vllm v1 #1

mrn3088 · 2024-11-19T11:08:23Z

A working implementation for vllm to do kvcache transfer between prefill and decode engine.
Tested on the examples/offline_inference.py.
To run it, using the following commands in two terminal:

Prefill Engine

vllm/examples# VLLM_PORT=47651 python3 offline_inference.py --dist-factor 2 --rank 0 --local-rank 0 --role prefill --max-tokens 1

Decode Engine

vllm/examples# VLLM_PORT=47651 python3 offline_inference.py --dist-factor 2 --rank 1 --local-rank 1 --role decode --max-tokens 16

The first process will execute the model for one step (the prefill step), and send the hidden_state and kv_cache to the second process to complete the following decode.

For now, at least it's working.

Decode Engine Output, seems correct

Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.18it/s, est. speed input: 20.65 toks/s, output: 50.82 toks/s]
Prompt: 'Hello, my name is', Generated text: " Joel. I'm from Massachusetts and live in Melbourne, Australia.\nI'm"
Prompt: 'The president of the United States is', Generated text: ' about to be arrested in Europe for allegedly meddling in the 2016 election.\n\n'
Prompt: 'The capital of France is', Generated text: ' becoming a state of chaos with a significant urban and industrial boom. France’'
Prompt: 'The future of AI is', Generated text: ' not as simple as you think, and you have to understand it in order to'

Things to pay attention:

The current implementation has not been tested on other examples. Especially online examples. I'll check this later today.
Current kv transfer is not efficient. Ideally should only do one send/recv that transfer all kv_cache along with hidden_state.
What about more engines? Currently, the TP and PP is disabled since they aren't compatible yet.
Lots of hardcoded stuffs...

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

…-project#9696) Signed-off-by: André Jonasson <andre.jonasson@gmail.com>

…oject#9889) Signed-off-by: youkaichao <youkaichao@gmail.com>

…9933) Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

…9897) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com>

…t#8346) Signed-off-by: Peter Salas <peter@fixie.ai>

) Signed-off-by: kevin <kevin@anyscale.com>

…odels (vllm-project#9559)

Signed-off-by: youkaichao <youkaichao@gmail.com>

…ect#9930) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: Michael Green <mikegre@google.com>

…roject#9938) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

…project#9946)

Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: Nick Hill <nickhill@us.ibm.com>

Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>

Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com>

…ject#9974) Signed-off-by: MengqingCao <cmq0113@163.com>

…-project#9915) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…roject#10362)

…0356) Signed-off-by: youkaichao <youkaichao@gmail.com>

…tructured output with MistralTokenizer (vllm-project#10363) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>

Signed-off-by: ElizaWszola <eliza@neuralmagic.com>

Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>

)

…project#9919) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: youkaichao <youkaichao@gmail.com>

…ject#10385) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

…ct#10287) Signed-off-by: rbbang <anjaehyun87@gmail.com>

…led (vllm-project#10388) Signed-off-by: imkero <kerorek@outlook.com>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…ect#10383) Signed-off-by: youkaichao <youkaichao@gmail.com>

…odels (vllm-project#10374) Signed-off-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…ject#10394) Signed-off-by: Isotr0py <2037008807@qq.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

…ject#10403) Signed-off-by: imkero <kerorek@outlook.com>

vllm-project#10392) Signed-off-by: wchen61 <wchen61@foxmail.com>

Jocn2020 · 2024-11-19T18:35:28Z

vllm/v1/worker/gpu_model_runner.py

+            if role == "prefill" and prefill_step:
+                dist.send(hidden_states, dst=1)
+                for i in range(len(self.kv_caches)):
+                    dist.send(self.kv_caches[i], dst=1)


Nice work on the rank hack!
Quick comment from me is currently you send the entire kvcache. Next step we want to do is just sending the kvcache of specific requests' block ids which you can find in scheduler_output.scheduled_new_reqs or scheduler_output.scheduled_resumed_reqs

github-actions · 2025-02-18T02:36:00Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

DarkLight1337 and others added 30 commits November 1, 2024 14:09

[Frontend] Use a proper chat template for VLM2Vec (vllm-project#9912)

bb87acb

[Core] Refactor: Clean up unused argument in Scheduler._preempt (vllm…

6b3e1c2

…-project#9696) Signed-off-by: André Jonasson <andre.jonasson@gmail.com>

[torch.compile] use interpreter with stable api from pytorch (vllm-pr…

e1c27fc

…oject#9889) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix/Core] Flashinfer k_scale and v_scale (vllm-project#9861)

04506c3

[1/N] pass the complete config from engine to executor (vllm-project#…

2d75d7c

…9933) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] PicklingError on RayTaskError (vllm-project#9934)

7c2fc9c

Signed-off-by: Gene Su <e870252314@gmail.com>

[ci/build] Bump the patch-update group with 10 updates (vllm-project#…

ac2ddd2

…9897) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com>

[Core][VLM] Add precise multi-modal placeholder tracking (vllm-projec…

569fcc0

…t#8346) Signed-off-by: Peter Salas <peter@fixie.ai>

[ci/build] Have dependabot ignore pinned dependencies (vllm-project#9935

d4775ce

) Signed-off-by: kevin <kevin@anyscale.com>

[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…

9550387

…odels (vllm-project#9559)

[torch.compile] fix cpu broken code (vllm-project#9947)

c812fe5

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Docs] Update Granite 3.0 models in supported models table (vllm-proj…

68fc181

…ect#9930) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Doc] Updated tpu-installation.rst with more details (vllm-project#9926)

171ccd6

Signed-off-by: Michael Green <mikegre@google.com>

[2/N] executor pass the complete config to worker/modelrunner (vllm-p…

22c99c0

…roject#9938) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>

[V1] Fix EngineArgs refactor on V1 (vllm-project#9954)

6d3ca46

[bugfix] fix chatglm dummy_data_for_glmv (vllm-project#9955)

21e61ba

Signed-off-by: youkaichao <youkaichao@gmail.com>

[3/N] model runner pass the whole config to model (vllm-project#9958)

f09a8e0

Signed-off-by: youkaichao <youkaichao@gmail.com>

[CI/Build] Quoting around > (vllm-project#9956)

d19ef1b

[torch.compile] Adding torch compile to vision-language models (vllm-…

f386574

…project#9946)

[bugfix] fix tsts (vllm-project#9959)

685bcc3

Signed-off-by: youkaichao <youkaichao@gmail.com>

[V1] Support per-request seed (vllm-project#9945)

9455b48

Signed-off-by: Nick Hill <nickhill@us.ibm.com>

[Model] Add support for H2OVL-Mississippi models (vllm-project#9747)

a45ebaf

Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>

[V1] Fix Configs (vllm-project#9971)

89a2c17

[Bugfix] Fix MiniCPMV and Mllama BNB bug (vllm-project#9917)

d2310a1

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[Bugfix]Using the correct type hints (vllm-project#9885)

0f1221b

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

[Misc] Compute query_start_loc/seq_start_loc on CPU (vllm-project#9447)

2f70c75

Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>

[Bugfix] Fix E2EL mean and median stats (vllm-project#9984)

a95a2ff

Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com>

[Bugfix][OpenVINO] Fix circular reference vllm-project#9939 (vllm-pro…

fe486ec

…ject#9974) Signed-off-by: MengqingCao <cmq0113@163.com>

[Frontend] Multi-Modality Support for Loading Local Image Files (vllm…

a435ea4

…-project#9915) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

jeejeelee and others added 27 commits November 15, 2024 10:34

[Bugfix] Fix fully sharded LoRA bug (vllm-project#10352)

5a62465

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[Misc] Fix some help info of arg_utils to improve readability (vllm-p…

1be25ac

…roject#10362)

[core][misc] keep compatibility for old-style classes (vllm-project#1…

ffddf91

…0356) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] Ensure special tokens are properly filtered out for guided s…

affa3bb

…tructured output with MistralTokenizer (vllm-project#10363) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>

[Misc] Bump up test_fused_moe tolerance (vllm-project#10364)

e6d15ee

Signed-off-by: ElizaWszola <eliza@neuralmagic.com>

[Misc] bump mistral common version (vllm-project#10367)

b866673

Signed-off-by: simon-mo <simon.mo@hey.com>

[Docs] Add Nebius as sponsors (vllm-project#10371)

4467cd1

Signed-off-by: simon-mo <simon.mo@hey.com>

[Frontend] Add --version flag to CLI (vllm-project#10369)

de1a339

Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Doc] Move PR template content to docs (vllm-project#10159)

b0a608b

Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Docs] Misc updates to TPU installation instructions (vllm-project#10165

5bef6c8

)

[Frontend] Automatic detection of chat content format from AST (vllm-…

42cdb3c

…project#9919) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[doc] add doc for the plugin system (vllm-project#10372)

6d5a548

Signed-off-by: youkaichao <youkaichao@gmail.com>

[misc][plugin] improve log messages (vllm-project#10386)

e7257f4

Signed-off-by: youkaichao <youkaichao@gmail.com>

[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (vllm-pro…

9a62e9a

…ject#10385) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

[Misc] Update benchmark to support image_url file or http (vllm-proje…

2692313

…ct#10287) Signed-off-by: rbbang <anjaehyun87@gmail.com>

[Misc] Medusa supports custom bias (vllm-project#10361)

fae08af

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

24ec29c

…led (vllm-project#10388) Signed-off-by: imkero <kerorek@outlook.com>

[V1] Add code owners for V1 (vllm-project#10397)

d1bc041

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[2/N][torch.compile] make compilation cfg part of vllm cfg (vllm-proj…

80d031d

…ect#10383) Signed-off-by: youkaichao <youkaichao@gmail.com>

[V1] Refactor model executable interface for all text-only language m…

fb0e946

…odels (vllm-project#10374) Signed-off-by: Roger Wang <ywang@roblox.com>

[CI/Build] Fix IDC hpu [Device not found] issue (vllm-project#10384)

cf37750

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

[Bugfix][CPU] Fix CPU embedding runner with tensor parallel (vllm-pro…

0399523

…ject#10394) Signed-off-by: Isotr0py <2037008807@qq.com>

[platforms] refactor cpu code (vllm-project#10402)

dc08693

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Hardware] [HPU]add mark_step for hpu (vllm-project#10239)

52002dd

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

[Bugfix] Fix mrope_position_delta in non-last prefill chunk (vllm-pro…

287ed74

…ject#10403) Signed-off-by: imkero <kerorek@outlook.com>

[Misc] Enhance offline_inference to support user-configurable paramet… (

c0adeb8

vllm-project#10392) Signed-off-by: wchen61 <wchen61@foxmail.com>

Implemented kvcache transfer (naive send/recv)

83f6707

Jocn2020 reviewed Nov 19, 2024

View reviewed changes

github-actions bot added the stale label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding basic kv-cache transfer to vllm v1 #1

Adding basic kv-cache transfer to vllm v1 #1

mrn3088 commented Nov 19, 2024 •

edited by github-actions bot

Loading

Jocn2020 Nov 19, 2024 •

edited

Loading

github-actions bot commented Feb 18, 2025

Adding basic kv-cache transfer to vllm v1 #1

Are you sure you want to change the base?

Adding basic kv-cache transfer to vllm v1 #1

Conversation

mrn3088 commented Nov 19, 2024 • edited by github-actions bot Loading

Prefill Engine

Decode Engine

Decode Engine Output, seems correct

Jocn2020 Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 18, 2025

mrn3088 commented Nov 19, 2024 •

edited by github-actions bot

Loading

Jocn2020 Nov 19, 2024 •

edited

Loading