-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
materialize-databricks: reset needsMerge
flag after each transaction
#2425
Conversation
The next transaction might not have any updates, so reset the `needsMerge` flag to prevent merge queries from being run indefinitely after one merge query has been run.
26bbf69
to
1ec7d89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case to our integration test to show this scenario please?
a5cdc7c
to
e62db23
Compare
"Queries": [ | ||
"\n\tCOPY INTO `some-schema`.simple FROM (\n SELECT\n\t\tid::LONG, canary::STRING, flow_published_at::TIMESTAMP, flow_document::STRING\n FROM '/Volumes/main/some-schema/flow_staging/flow_temp_tables'\n\t)\n FILEFORMAT = JSON\n FILES = ('<uuid>')\n FORMAT_OPTIONS ( 'mode' = 'FAILFAST', 'ignoreMissingFiles' = 'false' )\n\tCOPY_OPTIONS ( 'mergeSchema' = 'true' )\n ;\n" | ||
"\n\tCOPY INTO `some-schema`.duplicate_keys_standard FROM (\n SELECT\n\t\tid::LONG, flow_published_at::TIMESTAMP, int::LONG, str::STRING, flow_document::STRING\n FROM '/Volumes/main/some-schema/flow_staging/flow_temp_tables'\n\t)\n FILEFORMAT = JSON\n FILES = ('<uuid>')\n FORMAT_OPTIONS ( 'mode' = 'FAILFAST', 'ignoreMissingFiles' = 'false' )\n\tCOPY_OPTIONS ( 'mergeSchema' = 'true' )\n ;\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the code change to reset the needsMerge
flag, this would have been run as a merge query even though all the keys for the transaction are new.
@@ -1,4 +1,5 @@ | |||
[ | |||
[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this initial empty transaction for unrelated reasons since I was updating all the test snapshots anyway (see commit message) - note that this causes the materialized timestamps to shift up for the subsequent transactions, which accounts for a lot of the diff in the snapshots.
…ions Extends the materialization integration test fixture to include an additional transaction for the "duplicate keys" scenario where all of the keys are new. The progression is then 1) All keys are new for the initial transaction, 2) Updates to existing keys for the next transaction 3) (New) A transaction with all new keys. This exercises the behavior some materializations implement where merge types of queries are only run if some of the keys to store have updates, and otherwise a more efficient direct insert is run. It is reflected in the test snapshots for materializations that persist queries in their state, most notably for databricks, where a bug in this behavior was fixed in the prior commit. Since I had to update most of the test snapshots for this, I also included the following changes: * Updated the google sheets materialization test setup to use an encrypted config file like all the other materialization tests. * Did a bit of cleaned up the CI spec file mostly related to the above. * Added an empty transaction to the test fixture, which is run first. This simulates what happens if a materialization is configured with a "notBefore" setting, and processes some empty transactions at first due to skipping over data. This revealed that materialize-sqlserver still wasn't handling this correctly, so that is fixed as well.
e62db23
to
5e7f007
Compare
👍 I extended our existing "duplicate keys" test to illustrate this scenario. We need to transition through transactions where there are merges to one where there aren't, so I had to add on an additional transaction to the fixture. The outcome is apparent for I made some other opportunistic changes while I was updating all the test snapshots which are described in the commit message. It took some fiddling to get all the tests passing actually but in the end I think we're better off for it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description:
The next transaction might not have any updates, so reset the
needsMerge
flag to prevent merge queries from being run indefinitely after one merge query has been run.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is