Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Optimizations of Inference for batch_size = 1 #746

Open
OlinLai opened this issue Feb 24, 2025 · 1 comment
Open

Question about Optimizations of Inference for batch_size = 1 #746

OlinLai opened this issue Feb 24, 2025 · 1 comment

Comments

@OlinLai
Copy link

OlinLai commented Feb 24, 2025

There are some papers showing perspectives as below.
" However, the following two issues can lead to low GPU utilization. First, the Decode stage of GPT requires frequent sequential computing of a single token. Second, to improve the Quality of Service (QoS) and meet the real-time requirements, the acceleration cluster will not process small batches of user input by waiting for more data, so the batch size is usually set to one. The architectureof GPU is designed to process batches of data using data parallelism. In this case, the GPU will face insufficient computing intensity. "

So, I wonder whether LightLLM has some specific optimizations for batch_size=1 inference.

If so, could you please introduce it briefly or inform me the location of that part of code?

Thanks a lot!

@hiworldwzj
Copy link
Collaborator

Currently, we haven't made any special optimizations for batch 1. For example, I know that changing GEMM to a custom-optimized GEMV operator, as well as some special operator implementations, can also accelerate batch 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants