You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some papers showing perspectives as below.
" However, the following two issues can lead to low GPU utilization. First, the Decode stage of GPT requires frequent sequential computing of a single token. Second, to improve the Quality of Service (QoS) and meet the real-time requirements, the acceleration cluster will not process small batches of user input by waiting for more data, so the batch size is usually set to one. The architectureof GPU is designed to process batches of data using data parallelism. In this case, the GPU will face insufficient computing intensity. "
So, I wonder whether LightLLM has some specific optimizations for batch_size=1 inference.
If so, could you please introduce it briefly or inform me the location of that part of code?
Thanks a lot!
The text was updated successfully, but these errors were encountered:
Currently, we haven't made any special optimizations for batch 1. For example, I know that changing GEMM to a custom-optimized GEMV operator, as well as some special operator implementations, can also accelerate batch 1.
There are some papers showing perspectives as below.
" However, the following two issues can lead to low GPU utilization. First, the Decode stage of GPT requires frequent sequential computing of a single token. Second, to improve the Quality of Service (QoS) and meet the real-time requirements, the acceleration cluster will not process small batches of user input by waiting for more data, so the batch size is usually set to one. The architectureof GPU is designed to process batches of data using data parallelism. In this case, the GPU will face insufficient computing intensity. "
So, I wonder whether LightLLM has some specific optimizations for batch_size=1 inference.
If so, could you please introduce it briefly or inform me the location of that part of code?
Thanks a lot!
The text was updated successfully, but these errors were encountered: