Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Underexplaining a crash-causing keyword as a simple speed-increasing optimization #1596

Open
FaaizMemonPurdue opened this issue Feb 2, 2025 · 0 comments

Comments

@FaaizMemonPurdue
Copy link

Doc request
When calling the tokenization script listed under [preprocessing]https://huggingface.co/docs/transformers/main/en/tasks/token_classification#preprocess):
`def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

labels = []
for i, label in enumerate(examples[f"ner_tags"]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:  # Set the special tokens to -100.
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:  # Only label the first token of a given word.
            label_ids.append(label[word_idx])
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    labels.append(label_ids)

tokenized_inputs["labels"] = labels
return tokenized_inputs`

Using the argument batched=True is not just a speed optimization as suggested here:

Image

When batched=False, this script will outright crash on label_ids.append(label[word_idx]) as the label would be an integer, not an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant