Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add range validation #473

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 42 additions & 2 deletions nomenclature/processor/data_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from pandas import concat
from pyam import IamDataFrame
from pyam.logging import adjust_log_level
from pydantic import computed_field, field_validator, model_validator
from pydantic import computed_field, field_validator, model_validator, Field

from nomenclature.definition import DataStructureDefinition
from nomenclature.error import ErrorCollector
Expand Down Expand Up @@ -86,9 +86,49 @@ def criteria(self):
)


class DataValidationCriteriaRange(DataValidationCriteria):
range: list[float] = Field(..., min_length=2, max_length=2)

@model_validator(mode="after")
def check_range_is_valid(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validator represents a design choice in my opinion. We for sure want the range to be ordered and the way I see it, is we have two options to ensure this:

  1. (implemented here) we check and throw and error if the range is not ordered
  2. We take the input value for range, sort it and only then save it into the attribute

The same design question presents itself when talking about the ordering of the severity levels. Currently we through an error if they are not sorted by the most severe first. Alternatively, we could also take the input, sort it and then save it.

The question does not (and probably should not) be answered, I just wanted to raise the point to be discussed at some stage. My (slight) personal preference would be to take whatever the user provides us, sort it and then work with it. This way we could potentially reduce the amount of frustrated nomenclature users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design-choice on checking the order of severity levels was my suggestion - I think it's better to force users to follow an easy-to-read convention. And it also avoids the ordering as one extra step in the code, improving readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design-choice on checking the order of severity levels was my suggestion - I think it's better to force users to follow an easy-to-read convention. And it also avoids the ordering as one extra step in the code, improving readability.

Fair point in making that design choice and sticking to it here as well, works for me. You don't really save any code though since in one case you'd be sorting the inputs, in the other case you're checking that they arrive sorted, six and one half dozen.

if self.range[0] > self.range[1]:
raise ValueError("Validation range is invalid: " + str(self.criteria))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message does not contain enough information for the user to fix the issue:

Suggested change
raise ValueError("Validation range is invalid: " + str(self.criteria))
raise ValueError("Validation range must be ordered with the smaller value first: " + str(self.criteria))

return self

@computed_field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of the computed_field decorator 👍 To make things more explicit and follow the pydantic docs (https://docs.pydantic.dev/2.10/concepts/fields/#the-computed_field-decorator), I'd adding the property decorator as well for this and lower_bound.

def upper_bound(self) -> float:
return self.range[1]

@computed_field
def lower_bound(self) -> float:
return self.range[0]

@property
def validation_args(self):
"""Attributes used for validation (as bounds)."""
return self.model_dump(
exclude_none=True,
exclude_unset=True,
exclude=["warning_level", "range"],
)

@property
def criteria(self):
return self.model_dump(
exclude_none=True,
exclude_unset=True,
exclude=["warning_level", "lower_bound", "upper_bound"],
)


class DataValidationCriteriaMultiple(IamcDataFilter):
validation: (
list[DataValidationCriteriaValue | DataValidationCriteriaBounds] | None
list[
DataValidationCriteriaValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to:

Suggested change
DataValidationCriteriaValue
DataValidationCriteria

| DataValidationCriteriaBounds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| DataValidationCriteriaBounds

| DataValidationCriteriaRange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| DataValidationCriteriaRange

]
| None
) = None

@model_validator(mode="after")
Expand Down
11 changes: 11 additions & 0 deletions tests/data/validation/validate_data/validate_warning_range.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
- variable: Primary Energy
year: 2010
validation:
- range: [ 1, 5 ]
- warning_level: low
upper_bound: 2.5
lower_bound: 1
- variable: Primary Energy|Coal
year: 2010
upper_bound: 5
lower_bound: 1
8 changes: 7 additions & 1 deletion tests/test_validate_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def test_DataValidator_apply_fails(simple_df, file, item_1, item_2, item_3, capl

@pytest.mark.parametrize(
"file, value",
[("joined", 6.0), ("joined", 3.0), ("legacy", 6.0)],
[("joined", 6.0), ("joined", 3.0), ("legacy", 6.0), ("range", 6.0)],
)
def test_DataValidator_validate_with_warning(file, value, simple_df, caplog):
"""Checks that failed validation rows are printed in log."""
Expand All @@ -154,6 +154,7 @@ def test_DataValidator_validate_with_warning(file, value, simple_df, caplog):
0 model_a scen_a World Primary Energy EJ/yr 2010 6.0 error
1 model_a scen_b World Primary Energy EJ/yr 2010 7.0 error"""
)

if file == "legacy":
# prints both error and low warning levels for legacy format
# because these are treated as independent validation-criteria
Expand All @@ -164,6 +165,11 @@ def test_DataValidator_validate_with_warning(file, value, simple_df, caplog):
0 model_a scen_a World Primary Energy EJ/yr 2010 6.0 low
1 model_a scen_b World Primary Energy EJ/yr 2010 7.0 low"""

if file == "range":
failed_validation_message = failed_validation_message.replace(
"upper_bound: 5.0, lower_bound: 1.0", "range: [1.0, 5.0]"
)

if value == 3.0:
# prints each warning level when each is triggered by different rows
failed_validation_message = """
Expand Down