LLM inference with Intel AMX #664

mcipriano01 · 2023-11-10T13:32:10Z

mcipriano01
Nov 10, 2023

I would like to perform model inference using the capabilities of Intel AMX. I have conducted inference on Llama-2 using the code provided in the LLM Runtime section. I noticed that the inference times with the code from the repository, which uses the "intel_extension_for_transformers.transformers" library, and the inference times using the base transformers library are similar.

Therefore, I was wondering if I need to enable Intel AMX on my machine somehow. Currently, I am using the m7i.2xlarge and m7i.4xlarge instances on AWS. Any suggestions? Thank you!

kevinintel · 2023-11-15T05:54:30Z

kevinintel
Nov 15, 2023
Maintainer

We support it on aws instance.
Can you share how to execute llm runtime?

0 replies

mcipriano01 · 2023-11-15T08:29:30Z

mcipriano01
Nov 15, 2023
Author

Hi, and thank you for the response!
The code used is this.

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
import time
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

while True:
    print("> ", end="")
    prompt = input().strip()
    if prompt == "quit":
        break
    start = time.time()
    b_prompt = "[INST]{}[/INST]".format(prompt)
    inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
    outputs = model.generate(inputs, streamer=streamer,
                num_beams=1, max_new_tokens=512, do_sample=True, repetition_penalty=1.1)
    inference_time = time.time() - start
    print(f'\n> Total Inference Time: {inference_time} seconds')

Compared to the code in the LLM Runtime, I removed the line that quantizes the model.

woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")

Perhaps, am I forced to quantize the model?
Thanks for the help! :)

0 replies

DDEle · 2023-11-21T07:19:34Z

DDEle
Nov 21, 2023
Collaborator

Perhaps, am I forced to quantize the model?

I'm afraid yes. We only enable AMX with quantized weight (as it benefits little due to intensive runtime conversion). In fact, you are even not running our optimized LLM Runtime if WeightOnlyQuantConfig is not provided.

intel-extension-for-transformers/intel_extension_for_transformers/transformers/modeling/modeling_auto.py

Lines 140 to 145 in f7d6baa

    
           if isinstance(quantization_config, WeightOnlyQuantConfig): 
        
               logger.info("Applying Weight Only Quantization.") 
        
               if use_llm_runtime: 
        
                   logger.info("Using LLM runtime.") 
        
                   quantization_config.post_init_runtime() 
        
                   from intel_extension_for_transformers.llm.runtime.graph import Model

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM inference with Intel AMX #664

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

LLM inference with Intel AMX #664

mcipriano01 Nov 10, 2023

Replies: 3 comments

kevinintel Nov 15, 2023 Maintainer

mcipriano01 Nov 15, 2023 Author

DDEle Nov 21, 2023 Collaborator

mcipriano01
Nov 10, 2023

kevinintel
Nov 15, 2023
Maintainer

mcipriano01
Nov 15, 2023
Author

DDEle
Nov 21, 2023
Collaborator