Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parallel file validation for a datapackage #1722

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pierrecamilleri
Copy link
Collaborator

@pierrecamilleri pierrecamilleri commented Dec 13, 2024

Context

  • In validator.py, Resource was imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor).
  • The tests would not intercept this error, as they were performed on a schema with a foreign key, and the validation would fall back on synchronous validation in this case.
  • Importing Resource (not as a type) into validator.py leads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.

Suggested changes and fixes

  • I repaired the tests, using datapackages with no foreign keys to test parallel validation. I checked that these tests would have failed because with the bug.
  • I moved the implementation of Validator.validate_resource and Validator.validate_package methods to the Resource.validate and Package.validate methods. This solves the circular import problem, and removes duplication (method signatures).
  • The tests have been dispatched as well.
  • I soft deprecated the Validator class, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.

Note

I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).

Archive of explorations
  • Test that it works correctly
    • Works without error, however, I could not produce any significant performance gain in my tests.
  • Check why the tests did not intercept this error
    • The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the validate_parallel function was actually never executed in the tests.
  • See if I can find a more elegant way to deal with the circular import. I actually don't see the benefit of having this validator.py class instead of dispatching the methods to Resource and DataPackage directly, but I may have missed something.

@pierrecamilleri pierrecamilleri marked this pull request as ready for review January 27, 2025 14:14
Copy link
Contributor

@pdelboca pdelboca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @pierrecamilleri !

Just one small comment but feel free to merge :)

self, time=timer.time, errors=exception.to_errors()
)

# TODO: remove in next version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually like to add more explicit messages for future maintainers.

Something like:

  • Remove in X.Y.Z when foo gets deprecated
  • Remove when Y is removed
  • Remove when we no longer support X

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed ! (This was copy-pasted).
I think this is a breaking change from v4 to v5, so let's say "remove in v6".
(I'll check for any dependency to the hash, but to me this feature is dropped with no dependencies)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding --parallel CLI flag breaks validation
2 participants