Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Overlapping Ratio Threshold support #81

Open
ain-soph opened this issue Sep 17, 2024 · 3 comments
Open

[Feature Request] Overlapping Ratio Threshold support #81

ain-soph opened this issue Sep 17, 2024 · 3 comments

Comments

@ain-soph
Copy link

ain-soph commented Sep 17, 2024

Overlapping Ratio

Currently, find_overlap will be True when any single overlap occurs.

if find_overlap(true_range, pred_range) and true not in true_which_overlapped_with_pred:

def find_overlap(true_range: range, pred_range: range) -> set:
"""Find the overlap between two ranges
Find the overlap between two ranges. Return the overlapping values if
present, else return an empty set().
Examples:
>>> find_overlap((1, 2), (2, 3))
2
>>> find_overlap((1, 2), (3, 4))
set()
"""
true_set = set(true_range)
pred_set = set(pred_range)
overlaps = true_set.intersection(pred_set)
return overlaps

However, in most cases, we hope there could be an overlapping ratio threshold.
That is something like this

pred = {'start':10, 'end':15}
label = {'start':12, 'end':18}

# calculate union and intersection
union = {'start':10, 'end':18}
intersection = {'start':12, 'end':15}

#calculate ratio
ratio = (15-12) / (18-10)

return ratio > threshold

The current find_overlap uses set operation to find overlaps, which seems to be time inefficient. It would be directly obtained via start and end values:

Here's my implementation:

def find_overlap(self, true: dict[str, int | str], pred: dict[str, int | str]) -> bool:
    start_max = max(true['start'], pred['start'])
    end_min = min(true['end'], pred['end'])
    if start_max >= end_min:
        return False
    start_min = min(true['start'], pred['start'])
    end_max = max(true['end'], pred['end'])
    overlap_ratio = (end_min - start_max) / (end_max - start_min)
    return overlap_ratio > self.overlap_ratio_threshold

Last Character excluded

I wonder why we consider the last token, which is very counter-intuition. This comes from #32. Maybe @aflueckiger could provide any explanation on this? Does your data end includes the last character?

I think for most data, the start and end are the offsets in the original text string:
text[start:end] which means the last character is excluded. text[1:3] and text[3:5] don't have any overlapping.

# overlapping needs to take into account last token as well
pred_range = range(pred["start"], pred["end"] + 1)
true_range = range(true["start"], true["end"] + 1)

Any support for huggingface Evaluate?

Would the maintainers consider using the standard of huggingface Evaluate? which means inheriting evaluate.Metric and pushing to huggingface hub. Afterwards, users could directly call metric = evaluate.load('{hub_url}')

Example: https://huggingface.co/spaces/evaluate-metric/glue/blob/main/glue.py

@aflueckiger
Copy link
Contributor

aflueckiger commented Sep 21, 2024

The original code as well as my additions are indeed confusing. This boils down to the CoNLL format, which is based on token-offsets instead char-offsets when I see correctly.

Since we have token indices, we have to add one to take into account the last token.

>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']

>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']

It is baffling that there is still no up-to-date and no-hacky NER evaluation procedure. A huggingface implementation that gets rid of historic assumptions is absolutely desirable.
I also think that an overlap ratio is a good idea.

Yet, I don't have the capacity for a HF implementation since NER has not been my focus recently.

What is the plan with this repo, @ivyleavedtoadflax @davidsbatista?

@ain-soph
Copy link
Author

ain-soph commented Sep 24, 2024

@aflueckiger thanks for your positive feedback.
But I'm still confused about the "last token included" part.

["Test", "my", "DOUBLE", "ORG"]  # len(4)

when we want to get the ["DOUBLE", "ORG"], it shall be [2:4] for sure. The entity shall be {'start': 2, 'end': 4}.
I don't see any reason to be the case of [2:3]

You mean in your data, the entity is recorded as {'start': 2, 'end': 3}? I think that data record format is strange to me.

@aflueckiger
Copy link
Contributor

aflueckiger commented Sep 24, 2024

Unless I am totally confused, the token indices are not saved as part of the dataset. They are simply an enumeration of the tokens as the text is tokenized the original format (see here for an example).
Thus, we have to add +1 to the end indices. Otherwise we end up with empty spans for single word entities.

# Token IDs
# 0: Test
# 1: my
# 2: DOUBLE
# 3: ORG

>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']

>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']

I hope it is clearer now. I fully agree that the code is confusing and needs a complete revamp though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants