DM-47201: Fix duplicates in non-find-first dataset search #1151

dhirving · 2025-02-06T22:32:30Z

Fix an issue where duplicate results could appear in a non-find-first dataset search, if the same dataset appeared in multiple collections in a chain.

This was occurring because we were forcing the addition of the collection key field to make the rows distinct. But on a non-find-first search, we don't have the window function to de-duplicate the rows by dataset ID. So we need to:

Keep the collection key out of the rows (because this is preventing rows from being de-duplicated)
Treat dataset ID as a unique key instead (so that we don't drop rows with the same data ID but different dataset IDs).

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

dhirving · 2025-02-06T22:52:23Z

@TallJimbo I'm not quite done with this, but could you take a quick look and let me know if this seems like I'm solving the right problem?

It seems excessively complicated to me but I haven't thought of a better way to handle this yet -- I'll take a closer look in the morning with fresh eyes.

(I will also be adding more tests to this -- we don't have sufficient coverage around non-trivial non-find-first queries.)

codecov · 2025-02-06T22:57:15Z

Codecov Report

Attention: Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.36%. Comparing base (6704b41) to head (bbfa0cf).

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...t/daf/butler/direct_query_driver/_query_builder.py	76.47%	4 Missing ⚠️
...hon/lsst/daf/butler/direct_query_driver/_driver.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1151   +/-   ##
=======================================
  Coverage   89.36%   89.36%           
=======================================
  Files         367      367           
  Lines       49540    49553   +13     
  Branches     6016     6019    +3     
=======================================
+ Hits        44269    44281   +12     
- Misses       3852     3853    +1     
  Partials     1419     1419

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dhirving · 2025-02-06T23:05:50Z

By the way, I had mentioned that dataset fields were missing from the joins stage... it turned out to not be relevant because needs_dataset_distinct wasn't actually causing the issue. The fields were missing in analyze_projection because in DirectQueryDriver.build_query(), we don't add the fields from the projection stage until after we analyze the projection:

         # Finish setting up the projection part of the builder.
        builder.analyze_projection()
        # The joins-stage query also needs to include all columns needed by the
        # downstream projection query.  Note that this:
        # - never adds new dimensions to the joins stage (since those are
        #   always a superset of the projection-stage dimensions);
        # - does not affect our previous determination of
        #   needs_dataset_distinct, because any dataset fields being added to
        #   the joins stage here are already in the projection.
        builder.joins_analysis.columns.update(builder.projection_columns)

Fix an issue where duplicate results could appear in a non-find-first dataset search, if the same dataset appeared in multiple collections in a chain. This was occurring because we were forcing the addition of the collection key field to make the rows distinct. But on a non-find-first search, we don't have the window function to de-duplicate the rows by dataset ID, so we need to keep the collection key out of the rows and treat dataset ID as a unique key instead.

dhirving force-pushed the tickets/DM-47201 branch 3 times, most recently from 4427829 to f98563f Compare February 6, 2025 22:43

timj and others added 2 commits February 28, 2025 10:14

Check that deduplication occurs with RUN/CALIBRATION queries

9edadfc

dhirving force-pushed the tickets/DM-47201 branch from f98563f to bbfa0cf Compare February 28, 2025 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-47201: Fix duplicates in non-find-first dataset search #1151

DM-47201: Fix duplicates in non-find-first dataset search #1151

dhirving commented Feb 6, 2025 •

edited

Loading

dhirving commented Feb 6, 2025

codecov bot commented Feb 6, 2025 •

edited

Loading

dhirving commented Feb 6, 2025 •

edited

Loading

DM-47201: Fix duplicates in non-find-first dataset search #1151

Are you sure you want to change the base?

DM-47201: Fix duplicates in non-find-first dataset search #1151

Conversation

dhirving commented Feb 6, 2025 • edited Loading

Checklist

dhirving commented Feb 6, 2025

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

dhirving commented Feb 6, 2025 • edited Loading

dhirving commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

dhirving commented Feb 6, 2025 •

edited

Loading