This repository has been archived by the owner on Oct 25, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 211
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
786ee07
commit 99e0c73
Showing
466 changed files
with
9,445 additions
and
2,619 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
86 changes: 86 additions & 0 deletions
86
...nsformers/transformers/kv_cache_compression/models/modeling_llama/index.rst.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
intel_extension_for_transformers.transformers.kv_cache_compression.models.modeling_llama | ||
======================================================================================== | ||
|
||
.. py:module:: intel_extension_for_transformers.transformers.kv_cache_compression.models.modeling_llama | ||
.. autoapi-nested-parse:: | ||
|
||
PyTorch llama model. | ||
|
||
|
||
|
||
Classes | ||
------- | ||
|
||
.. autoapisummary:: | ||
|
||
intel_extension_for_transformers.transformers.kv_cache_compression.models.modeling_llama.LlamaAttention | ||
intel_extension_for_transformers.transformers.kv_cache_compression.models.modeling_llama.LlamaFlashAttention2 | ||
intel_extension_for_transformers.transformers.kv_cache_compression.models.modeling_llama.LlamaSdpaAttention | ||
|
||
|
||
Functions | ||
--------- | ||
|
||
.. autoapisummary:: | ||
|
||
intel_extension_for_transformers.transformers.kv_cache_compression.models.modeling_llama.apply_rotary_pos_emb | ||
|
||
|
||
Module Contents | ||
--------------- | ||
|
||
.. py:function:: apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1) | ||
Applies Rotary Position Embedding to the query and key tensors. | ||
|
||
:param q: The query tensor. | ||
:type q: `torch.Tensor` | ||
:param k: The key tensor. | ||
:type k: `torch.Tensor` | ||
:param cos: The cosine part of the rotary embedding. | ||
:type cos: `torch.Tensor` | ||
:param sin: The sine part of the rotary embedding. | ||
:type sin: `torch.Tensor` | ||
:param position_ids: Deprecated and unused. | ||
:type position_ids: `torch.Tensor`, *optional* | ||
:param unsqueeze_dim: The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and | ||
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note | ||
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and | ||
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes | ||
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have | ||
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2. | ||
:type unsqueeze_dim: `int`, *optional*, defaults to 1 | ||
|
||
:returns: `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding. | ||
|
||
|
||
.. py:class:: LlamaAttention(config: transformers.models.llama.configuration_llama.LlamaConfig, layer_idx: Optional[int] = None) | ||
Multi-headed attention from 'Attention Is All You Need' paper. | ||
|
||
|
||
.. py:class:: LlamaFlashAttention2(*args, **kwargs) | ||
Llama flash attention module. | ||
|
||
This module inherits from `LlamaAttention` as the weights of the module stays | ||
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of | ||
flash attention and deal with padding tokens in case the input contains any of them. | ||
|
||
|
||
.. py:class:: LlamaSdpaAttention(config: transformers.models.llama.configuration_llama.LlamaConfig, layer_idx: Optional[int] = None) | ||
Llama attention module using torch.nn.functional.scaled_dot_product_attention. | ||
|
||
This module inherits from | ||
`LlamaAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to | ||
SDPA API. | ||
|
||
|
48 changes: 48 additions & 0 deletions
48
...rs/transformers/modeling/modeling_gaudi/models/bart/modeling_bart/index.rst.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.bart.modeling_bart | ||
=============================================================================================== | ||
|
||
.. py:module:: intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.bart.modeling_bart | ||
.. autoapi-nested-parse:: | ||
|
||
PyTorch BART model. | ||
|
||
|
||
|
||
Classes | ||
------- | ||
|
||
.. autoapisummary:: | ||
|
||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.bart.modeling_bart.gaudi_BartLearnedPositionalEmbedding | ||
|
||
|
||
Functions | ||
--------- | ||
|
||
.. autoapisummary:: | ||
|
||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.bart.modeling_bart.gaudi_BartAttention_forward | ||
|
||
|
||
Module Contents | ||
--------------- | ||
|
||
.. py:class:: gaudi_BartLearnedPositionalEmbedding(num_embeddings: int, embedding_dim: int) | ||
This module learns positional embeddings up to a fixed maximum size. | ||
|
||
|
||
.. py:method:: forward(input_ids: torch.Tensor, past_key_values_length: torch.Tensor = torch.tensor(0)) | ||
`input_ids' shape is expected to be [bsz x seqlen]. | ||
|
||
|
||
|
||
.. py:function:: gaudi_BartAttention_forward(self, hidden_states: torch.Tensor, key_value_states: Optional[torch.Tensor] = None, past_key_value: Optional[Tuple[torch.Tensor]] = None, attention_mask: Optional[torch.Tensor] = None, layer_head_mask: Optional[torch.Tensor] = None, output_attentions: bool = False, token_idx: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] | ||
|
||
Input shape: Batch x Time x Channel | ||
|
||
|
15 changes: 15 additions & 0 deletions
15
...transformers/modeling/modeling_gaudi/models/llama/pos_shift_llama/index.rst.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.llama.pos_shift_llama | ||
================================================================================================== | ||
|
||
.. py:module:: intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.llama.pos_shift_llama | ||
.. autoapi-nested-parse:: | ||
|
||
Adapted from https://github.com/tomaarsen/attention_sinks | ||
Note (accelerate inference with hpu graphs in V1.15.1): | ||
1. avoid using data dependent dynamic flow | ||
2. avoid updating tensor by in-place view (a[:, idx] = c) | ||
3. make all shapes static | ||
|
||
|
||
|
41 changes: 41 additions & 0 deletions
41
...nsformers/modeling/modeling_gaudi/models/mistral/modeling_mistral/index.rst.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.mistral.modeling_mistral | ||
===================================================================================================== | ||
|
||
.. py:module:: intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.mistral.modeling_mistral | ||
.. autoapi-nested-parse:: | ||
|
||
PyTorch Mistral model. | ||
|
||
|
||
|
||
Functions | ||
--------- | ||
|
||
.. autoapisummary:: | ||
|
||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.mistral.modeling_mistral.gaudi_mistral_rmsnorm_forward | ||
intel_extension_for_transformers.transformers.modeling.modeling_gaudi.models.mistral.modeling_mistral.gaudi_mistral_repeat_kv | ||
|
||
|
||
Module Contents | ||
--------------- | ||
|
||
.. py:function:: gaudi_mistral_rmsnorm_forward(self, hidden_states) | ||
The only differences are: | ||
- override RMSNorm with Habana fused RMSNorm | ||
|
||
|
||
.. py:function:: gaudi_mistral_repeat_kv(query_states: torch.Tensor, key_states: torch.Tensor, value_states: torch.Tensor, attention_mask: torch.Tensor, n_rep: int) | ||
The only differences are: | ||
- Append num_key_value_heads == 1 check as kv states can be broadcasted during | ||
matmuls so need to expand and reshape them. | ||
- Add new args query_states, key_states, value_states and attention_mask and update the logic for expansion. | ||
The query states go from (batch, num_heads, seqlen, head_dim) to | ||
(batch, num_key_value_heads, n_rep, seqlen, head_dim) | ||
The key/value states go from (batch, num_key_value_heads, seqlen, head_dim) to | ||
(batch, num_key_value_heads, 1, seqlen, head_dim) | ||
|
||
|
Oops, something went wrong.