Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-47201: Fix duplicates in non-find-first dataset search #1151

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Feb 6, 2025

Fix an issue where duplicate results could appear in a non-find-first dataset search, if the same dataset appeared in multiple collections in a chain.

This was occurring because we were forcing the addition of the collection key field to make the rows distinct. But on a non-find-first search, we don't have the window function to de-duplicate the rows by dataset ID. So we need to:

  1. Keep the collection key out of the rows (because this is preventing rows from being de-duplicated)
  2. Treat dataset ID as a unique key instead (so that we don't drop rows with the same data ID but different dataset IDs).

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

@dhirving dhirving force-pushed the tickets/DM-47201 branch 3 times, most recently from 4427829 to f98563f Compare February 6, 2025 22:43
@dhirving
Copy link
Contributor Author

dhirving commented Feb 6, 2025

@TallJimbo I'm not quite done with this, but could you take a quick look and let me know if this seems like I'm solving the right problem?

It seems excessively complicated to me but I haven't thought of a better way to handle this yet -- I'll take a closer look in the morning with fresh eyes.

(I will also be adding more tests to this -- we don't have sufficient coverage around non-trivial non-find-first queries.)

Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.36%. Comparing base (6704b41) to head (bbfa0cf).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...t/daf/butler/direct_query_driver/_query_builder.py 76.47% 4 Missing ⚠️
...hon/lsst/daf/butler/direct_query_driver/_driver.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1151   +/-   ##
=======================================
  Coverage   89.36%   89.36%           
=======================================
  Files         367      367           
  Lines       49540    49553   +13     
  Branches     6016     6019    +3     
=======================================
+ Hits        44269    44281   +12     
- Misses       3852     3853    +1     
  Partials     1419     1419           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving
Copy link
Contributor Author

dhirving commented Feb 6, 2025

By the way, I had mentioned that dataset fields were missing from the joins stage... it turned out to not be relevant because needs_dataset_distinct wasn't actually causing the issue. The fields were missing in analyze_projection because in DirectQueryDriver.build_query(), we don't add the fields from the projection stage until after we analyze the projection:

         # Finish setting up the projection part of the builder.
        builder.analyze_projection()
        # The joins-stage query also needs to include all columns needed by the
        # downstream projection query.  Note that this:
        # - never adds new dimensions to the joins stage (since those are
        #   always a superset of the projection-stage dimensions);
        # - does not affect our previous determination of
        #   needs_dataset_distinct, because any dataset fields being added to
        #   the joins stage here are already in the projection.
        builder.joins_analysis.columns.update(builder.projection_columns)

timj and others added 2 commits February 28, 2025 10:14
Fix an issue where duplicate results could appear in a non-find-first dataset search, if the same dataset appeared in multiple collections in a chain.

This was occurring because we were forcing the addition of the collection key field to make the rows distinct.  But on a non-find-first search, we don't have the window function to de-duplicate the rows by dataset ID, so we need to keep the collection key out of the rows and treat dataset ID as a unique key  instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants