Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize-databricks: reset needsMerge flag after each transaction #2425

Merged
merged 2 commits into from
Feb 24, 2025

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Feb 19, 2025

Description:

The next transaction might not have any updates, so reset the needsMerge flag to prevent merge queries from being run indefinitely after one merge query has been run.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

The next transaction might not have any updates, so reset the `needsMerge` flag
to prevent merge queries from being run indefinitely after one merge query has
been run.
@williamhbaker williamhbaker force-pushed the wb/databricks-reset-merge-flag branch from 26bbf69 to 1ec7d89 Compare February 19, 2025 16:50
Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case to our integration test to show this scenario please?

@williamhbaker williamhbaker force-pushed the wb/databricks-reset-merge-flag branch from a5cdc7c to e62db23 Compare February 20, 2025 15:57
"Queries": [
"\n\tCOPY INTO `some-schema`.simple FROM (\n SELECT\n\t\tid::LONG, canary::STRING, flow_published_at::TIMESTAMP, flow_document::STRING\n FROM '/Volumes/main/some-schema/flow_staging/flow_temp_tables'\n\t)\n FILEFORMAT = JSON\n FILES = ('<uuid>')\n FORMAT_OPTIONS ( 'mode' = 'FAILFAST', 'ignoreMissingFiles' = 'false' )\n\tCOPY_OPTIONS ( 'mergeSchema' = 'true' )\n ;\n"
"\n\tCOPY INTO `some-schema`.duplicate_keys_standard FROM (\n SELECT\n\t\tid::LONG, flow_published_at::TIMESTAMP, int::LONG, str::STRING, flow_document::STRING\n FROM '/Volumes/main/some-schema/flow_staging/flow_temp_tables'\n\t)\n FILEFORMAT = JSON\n FILES = ('<uuid>')\n FORMAT_OPTIONS ( 'mode' = 'FAILFAST', 'ignoreMissingFiles' = 'false' )\n\tCOPY_OPTIONS ( 'mergeSchema' = 'true' )\n ;\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the code change to reset the needsMerge flag, this would have been run as a merge query even though all the keys for the transaction are new.

@@ -1,4 +1,5 @@
[
[],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this initial empty transaction for unrelated reasons since I was updating all the test snapshots anyway (see commit message) - note that this causes the materialized timestamps to shift up for the subsequent transactions, which accounts for a lot of the diff in the snapshots.

…ions

Extends the materialization integration test fixture to include an additional
transaction for the "duplicate keys" scenario where all of the keys are new. The
progression is then 1) All keys are new for the initial transaction, 2) Updates
to existing keys for the next transaction 3) (New) A transaction with all new
keys.

This exercises the behavior some materializations implement where merge types of
queries are only run if some of the keys to store have updates, and otherwise a
more efficient direct insert is run. It is reflected in the test snapshots for
materializations that persist queries in their state, most notably for
databricks, where a bug in this behavior was fixed in the prior commit.

Since I had to update most of the test snapshots for this, I also included the
following changes:

* Updated the google sheets materialization test setup to use an encrypted
  config file like all the other materialization tests.
* Did a bit of cleaned up the CI spec file mostly related to the above.
* Added an empty transaction to the test fixture, which is run first. This
  simulates what happens if a materialization is configured with a "notBefore"
  setting, and processes some empty transactions at first due to skipping over
  data. This revealed that materialize-sqlserver still wasn't handling this
  correctly, so that is fixed as well.
@williamhbaker williamhbaker force-pushed the wb/databricks-reset-merge-flag branch from e62db23 to 5e7f007 Compare February 20, 2025 16:55
@williamhbaker
Copy link
Member Author

Can you add a test case to our integration test to show this scenario please?

👍 I extended our existing "duplicate keys" test to illustrate this scenario. We need to transition through transactions where there are merges to one where there aren't, so I had to add on an additional transaction to the fixture. The outcome is apparent for materialize-databricks since it shows the queries it will run in the connector checkpoint.

I made some other opportunistic changes while I was updating all the test snapshots which are described in the commit message. It took some fiddling to get all the tests passing actually but in the end I think we're better off for it!

Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@williamhbaker williamhbaker merged commit f715347 into main Feb 24, 2025
53 of 56 checks passed
@williamhbaker williamhbaker deleted the wb/databricks-reset-merge-flag branch February 24, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants