Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlephBERT encoding for batch of Sentence Pairs raise IndexError: index out of range in self #5

Open
Abigail-gs opened this issue Nov 17, 2022 · 1 comment

Comments

@Abigail-gs
Copy link

I’m trying to get AlephBERT encoding for sentence pairs.
I manage to get the tokens for the batch of sentences. But when I feed it to AlephBERT I get :
IndexError: index out of range in self

For a single pair of sentences, the AlephBERT model does accept the tokens. But for lists of sentences pairs, I get this error. I don’t understand where is the problem. :-(
Here is my code:

alephbert_tokenizer = AutoTokenizer.from_pretrained('onlplab/alephbert-base')  
alephbert = AutoModel.from_pretrained('onlplab/alephbert-base')

sentence_a = ['שני דגים שחו בים','שניהם צללו לים','כמה נשארו?','שלושה דגים שחו בנהר', 'אחד צלל לים'] 
sentence_b = ['אחת שתיים שלוש','בקול גדול גדול','רק ישבו ופטפטו אחת שתיים שלוש','מאחורי ההר ישבו שלושה גמדים', 'לא אכלו ולא שתו אחת שתיים שלוש'] 

tokens = alephbert_tokenizer(sentence_a, sentence_b, 
                                padding = 'max_length', 
                                truncation = True,
                                max_length = 512, 
                                return_tensors = 'pt')

batch_input_ids = tokens['input_ids']
batch_token_type_ids = tokens['token_type_ids']
batch_attention_mask = tokens['attention_mask']

alephbert(batch['input_ids'], batch['attention_mask'], batch['token_type_ids'])

Here is the Trackback:

IndexError                                Traceback (most recent call last)
Input In [76], in <cell line: 1>()
----> 1 alephbert(batch['input_ids'], batch['attention_mask'], batch['token_type_ids'])

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\torch\nn\modules\module.py:889, in Module._call_impl(self, *input, **kwargs)
    887     result = self._slow_forward(*input, **kwargs)
    888 else:
--> 889     result = self.forward(*input, **kwargs)
    890 for hook in itertools.chain(
    891         _global_forward_hooks.values(),
    892         self._forward_hooks.values()):
    893     hook_result = hook(self, input, result)

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\transformers\models\bert\modeling_bert.py:1010, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1003 # Prepare head mask if needed
   1004 # 1.0 in head_mask indicate we keep the head
   1005 # attention_probs has shape bsz x n_heads x N x N
   1006 # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
   1007 # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
   1008 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
-> 1010 embedding_output = self.embeddings(
   1011     input_ids=input_ids,
   1012     position_ids=position_ids,
   1013     token_type_ids=token_type_ids,
   1014     inputs_embeds=inputs_embeds,
   1015     past_key_values_length=past_key_values_length,
   1016 )
   1017 encoder_outputs = self.encoder(
   1018     embedding_output,
   1019     attention_mask=extended_attention_mask,
   (...)
   1027     return_dict=return_dict,
   1028 )
   1029 sequence_output = encoder_outputs[0]

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\torch\nn\modules\module.py:889, in Module._call_impl(self, *input, **kwargs)
    887     result = self._slow_forward(*input, **kwargs)
    888 else:
--> 889     result = self.forward(*input, **kwargs)
    890 for hook in itertools.chain(
    891         _global_forward_hooks.values(),
    892         self._forward_hooks.values()):
    893     hook_result = hook(self, input, result)

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\transformers\models\bert\modeling_bert.py:236, in BertEmbeddings.forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    234 if inputs_embeds is None:
    235     inputs_embeds = self.word_embeddings(input_ids)
--> 236 token_type_embeddings = self.token_type_embeddings(token_type_ids)
    238 embeddings = inputs_embeds + token_type_embeddings
    239 if self.position_embedding_type == "absolute":

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\torch\nn\modules\module.py:889, in Module._call_impl(self, *input, **kwargs)
    887     result = self._slow_forward(*input, **kwargs)
    888 else:
--> 889     result = self.forward(*input, **kwargs)
    890 for hook in itertools.chain(
    891         _global_forward_hooks.values(),
    892         self._forward_hooks.values()):
    893     hook_result = hook(self, input, result)

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\torch\nn\modules\sparse.py:156, in Embedding.forward(self, input)
    155 def forward(self, input: Tensor) -> Tensor:
--> 156     return F.embedding(
    157         input, self.weight, self.padding_idx, self.max_norm,
    158         self.norm_type, self.scale_grad_by_freq, self.sparse)

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\torch\nn\functional.py:1916, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1910     # Note [embedding_renorm set_grad_enabled]
   1911     # XXX: equivalent to
   1912     # with torch.no_grad():
   1913     #   torch.embedding_renorm_
   1914     # remove once script supports set_grad_enabled
   1915     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

I also paste here the working example for one pair for reference:

sentence_a1 = 'אני אוהב שוקולד, ועוגות גבינה'
sentence_b1 = 'וארטיק וסוכריות ותות גינה'

single_s_tok = alephbert_tokenizer(sentence_a1,sentence_b1, is_split_into_words=True)  
input_ids = torch.tensor(single_s_tok['input_ids']).unsqueeze(0)
attention_mask = torch.tensor(single_s_tok['attention_mask']).unsqueeze(0)
token_type_ids = torch.tensor(single_s_tok['token_type_ids']).unsqueeze(0)
alephbert(input_ids, attention_mask, token_type_ids)

I am stuck on this for some time now :-( , any help is very much appreciated.

@NoySal
Copy link

NoySal commented Aug 7, 2024

Abigail, has your issue been solved?
I'm stuck on the same issue and would love some tips...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants