LLM inference with Intel AMX #664
Replies: 3 comments
-
We support it on aws instance. |
Beta Was this translation helpful? Give feedback.
-
Hi, and thank you for the response!
Compared to the code in the LLM Runtime, I removed the line that quantizes the model.
Perhaps, am I forced to quantize the model? |
Beta Was this translation helpful? Give feedback.
-
I'm afraid yes. We only enable AMX with quantized weight (as it benefits little due to intensive runtime conversion). In fact, you are even not running our optimized LLM Runtime if |
Beta Was this translation helpful? Give feedback.
-
I would like to perform model inference using the capabilities of Intel AMX. I have conducted inference on Llama-2 using the code provided in the LLM Runtime section. I noticed that the inference times with the code from the repository, which uses the "intel_extension_for_transformers.transformers" library, and the inference times using the base transformers library are similar.
Therefore, I was wondering if I need to enable Intel AMX on my machine somehow. Currently, I am using the m7i.2xlarge and m7i.4xlarge instances on AWS. Any suggestions? Thank you!
Beta Was this translation helpful? Give feedback.
All reactions