Question about Optimizations of Inference for batch_size = 1 #746

OlinLai · 2025-02-24T00:41:39Z

There are some papers showing perspectives as below.
" However, the following two issues can lead to low GPU utilization. First, the Decode stage of GPT requires frequent sequential computing of a single token. Second, to improve the Quality of Service (QoS) and meet the real-time requirements, the acceleration cluster will not process small batches of user input by waiting for more data, so the batch size is usually set to one. The architectureof GPU is designed to process batches of data using data parallelism. In this case, the GPU will face insufficient computing intensity. "

So, I wonder whether LightLLM has some specific optimizations for batch_size=1 inference.

If so, could you please introduce it briefly or inform me the location of that part of code?

Thanks a lot!

hiworldwzj · 2025-02-24T10:01:05Z

Currently, we haven't made any special optimizations for batch 1. For example, I know that changing GEMM to a custom-optimized GEMV operator, as well as some special operator implementations, can also accelerate batch 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Optimizations of Inference for batch_size = 1 #746

Question about Optimizations of Inference for batch_size = 1 #746

OlinLai commented Feb 24, 2025

hiworldwzj commented Feb 24, 2025

Question about Optimizations of Inference for batch_size = 1 #746

Question about Optimizations of Inference for batch_size = 1 #746

Comments

OlinLai commented Feb 24, 2025

hiworldwzj commented Feb 24, 2025