Underexplaining a crash-causing keyword as a simple speed-increasing optimization #1596

FaaizMemonPurdue · 2025-02-02T22:27:03Z

Doc request
When calling the tokenization script listed under [preprocessing]https://huggingface.co/docs/transformers/main/en/tasks/token_classification#preprocess):
`def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

labels = []
for i, label in enumerate(examples[f"ner_tags"]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:  # Set the special tokens to -100.
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:  # Only label the first token of a given word.
            label_ids.append(label[word_idx])
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    labels.append(label_ids)

tokenized_inputs["labels"] = labels
return tokenized_inputs`

Using the argument batched=True is not just a speed optimization as suggested here:

When batched=False, this script will outright crash on label_ids.append(label[word_idx]) as the label would be an integer, not an array.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Underexplaining a crash-causing keyword as a simple speed-increasing optimization #1596

Underexplaining a crash-causing keyword as a simple speed-increasing optimization #1596

FaaizMemonPurdue commented Feb 2, 2025

Underexplaining a crash-causing keyword as a simple speed-increasing optimization #1596

Underexplaining a crash-causing keyword as a simple speed-increasing optimization #1596

Comments

FaaizMemonPurdue commented Feb 2, 2025