[Feature Request] Overlapping Ratio Threshold support #81

ain-soph · 2024-09-17T21:27:57Z

Overlapping Ratio

Currently, find_overlap will be True when any single overlap occurs.

Line 330 in df0e695

    
           if find_overlap(true_range, pred_range) and true not in true_which_overlapped_with_pred:

nervaluate/src/nervaluate/utils.py

Lines 85 to 104 in df0e695

    
           def find_overlap(true_range: range, pred_range: range) -> set: 
        
               """Find the overlap between two ranges 
        
               Find the overlap between two ranges. Return the overlapping values if 
        
               present, else return an empty set(). 
        
               Examples: 
        
               >>> find_overlap((1, 2), (2, 3)) 
        
               2 
        
               >>> find_overlap((1, 2), (3, 4)) 
        
               set() 
        
               """ 
        
               true_set = set(true_range) 
        
               pred_set = set(pred_range) 
        
               overlaps = true_set.intersection(pred_set) 
        
               return overlaps

However, in most cases, we hope there could be an overlapping ratio threshold.
That is something like this

pred = {'start':10, 'end':15}
label = {'start':12, 'end':18}

# calculate union and intersection
union = {'start':10, 'end':18}
intersection = {'start':12, 'end':15}

#calculate ratio
ratio = (15-12) / (18-10)

return ratio > threshold

The current find_overlap uses set operation to find overlaps, which seems to be time inefficient. It would be directly obtained via start and end values:

Here's my implementation:

def find_overlap(self, true: dict[str, int | str], pred: dict[str, int | str]) -> bool:
    start_max = max(true['start'], pred['start'])
    end_min = min(true['end'], pred['end'])
    if start_max >= end_min:
        return False
    start_min = min(true['start'], pred['start'])
    end_max = max(true['end'], pred['end'])
    overlap_ratio = (end_min - start_max) / (end_max - start_min)
    return overlap_ratio > self.overlap_ratio_threshold

Last Character excluded

I wonder why we consider the last token, which is very counter-intuition. This comes from #32. Maybe @aflueckiger could provide any explanation on this? Does your data end includes the last character?

I think for most data, the start and end are the offsets in the original text string:
text[start:end] which means the last character is excluded. text[1:3] and text[3:5] don't have any overlapping.

nervaluate/src/nervaluate/evaluate.py

Lines 294 to 296 in df0e695

    
           # overlapping needs to take into account last token as well 
        
           pred_range = range(pred["start"], pred["end"] + 1) 
        
           true_range = range(true["start"], true["end"] + 1)

Any support for huggingface Evaluate?

Would the maintainers consider using the standard of huggingface Evaluate? which means inheriting evaluate.Metric and pushing to huggingface hub. Afterwards, users could directly call metric = evaluate.load('{hub_url}')

Example: https://huggingface.co/spaces/evaluate-metric/glue/blob/main/glue.py

The text was updated successfully, but these errors were encountered:

aflueckiger · 2024-09-21T19:14:33Z

The original code as well as my additions are indeed confusing. This boils down to the CoNLL format, which is based on token-offsets instead char-offsets when I see correctly.

Since we have token indices, we have to add one to take into account the last token.

>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']

>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']

It is baffling that there is still no up-to-date and no-hacky NER evaluation procedure. A huggingface implementation that gets rid of historic assumptions is absolutely desirable.
I also think that an overlap ratio is a good idea.

Yet, I don't have the capacity for a HF implementation since NER has not been my focus recently.

What is the plan with this repo, @ivyleavedtoadflax @davidsbatista?

ain-soph · 2024-09-24T17:55:37Z

@aflueckiger thanks for your positive feedback.
But I'm still confused about the "last token included" part.

["Test", "my", "DOUBLE", "ORG"]  # len(4)

when we want to get the ["DOUBLE", "ORG"], it shall be [2:4] for sure. The entity shall be {'start': 2, 'end': 4}.
I don't see any reason to be the case of [2:3]

You mean in your data, the entity is recorded as {'start': 2, 'end': 3}? I think that data record format is strange to me.

aflueckiger · 2024-09-24T18:50:42Z

Unless I am totally confused, the token indices are not saved as part of the dataset. They are simply an enumeration of the tokens as the text is tokenized the original format (see here for an example).
Thus, we have to add +1 to the end indices. Otherwise we end up with empty spans for single word entities.

# Token IDs
# 0: Test
# 1: my
# 2: DOUBLE
# 3: ORG

>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']

>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']

I hope it is clearer now. I fully agree that the code is confusing and needs a complete revamp though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Overlapping Ratio Threshold support #81

[Feature Request] Overlapping Ratio Threshold support #81

ain-soph commented Sep 17, 2024 •

edited

Loading

aflueckiger commented Sep 21, 2024 •

edited

Loading

ain-soph commented Sep 24, 2024 •

edited

Loading

aflueckiger commented Sep 24, 2024 •

edited

Loading

[Feature Request] Overlapping Ratio Threshold support #81

[Feature Request] Overlapping Ratio Threshold support #81

Comments

ain-soph commented Sep 17, 2024 • edited Loading

Overlapping Ratio

Last Character excluded

Any support for huggingface Evaluate?

aflueckiger commented Sep 21, 2024 • edited Loading

ain-soph commented Sep 24, 2024 • edited Loading

aflueckiger commented Sep 24, 2024 • edited Loading

ain-soph commented Sep 17, 2024 •

edited

Loading

aflueckiger commented Sep 21, 2024 •

edited

Loading

ain-soph commented Sep 24, 2024 •

edited

Loading

aflueckiger commented Sep 24, 2024 •

edited

Loading