materialize-databricks: reset `needsMerge` flag after each transaction #2425

williamhbaker · 2025-02-19T16:16:34Z

Description:

The next transaction might not have any updates, so reset the needsMerge flag to prevent merge queries from being run indefinitely after one merge query has been run.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

The next transaction might not have any updates, so reset the `needsMerge` flag to prevent merge queries from being run indefinitely after one merge query has been run.

mdibaiee

Can you add a test case to our integration test to show this scenario please?

williamhbaker · 2025-02-20T16:00:15Z

tests/materialize/materialize-databricks/snapshot.json

        "Queries": [
-          "\n\tCOPY INTO `some-schema`.simple FROM (\n    SELECT\n\t\tid::LONG, canary::STRING, flow_published_at::TIMESTAMP, flow_document::STRING\n  FROM '/Volumes/main/some-schema/flow_staging/flow_temp_tables'\n\t)\n  FILEFORMAT = JSON\n  FILES = ('<uuid>')\n  FORMAT_OPTIONS ( 'mode' = 'FAILFAST', 'ignoreMissingFiles' = 'false' )\n\tCOPY_OPTIONS ( 'mergeSchema' = 'true' )\n  ;\n"
+          "\n\tCOPY INTO `some-schema`.duplicate_keys_standard FROM (\n    SELECT\n\t\tid::LONG, flow_published_at::TIMESTAMP, int::LONG, str::STRING, flow_document::STRING\n  FROM '/Volumes/main/some-schema/flow_staging/flow_temp_tables'\n\t)\n  FILEFORMAT = JSON\n  FILES = ('<uuid>')\n  FORMAT_OPTIONS ( 'mode' = 'FAILFAST', 'ignoreMissingFiles' = 'false' )\n\tCOPY_OPTIONS ( 'mergeSchema' = 'true' )\n  ;\n"


Without the code change to reset the needsMerge flag, this would have been run as a merge query even though all the keys for the transaction are new.

williamhbaker · 2025-02-20T16:01:12Z

tests/materialize/fixture.json

@@ -1,4 +1,5 @@
 [
+  [],


Added this initial empty transaction for unrelated reasons since I was updating all the test snapshots anyway (see commit message) - note that this causes the materialized timestamps to shift up for the subsequent transactions, which accounts for a lot of the diff in the snapshots.

…ions Extends the materialization integration test fixture to include an additional transaction for the "duplicate keys" scenario where all of the keys are new. The progression is then 1) All keys are new for the initial transaction, 2) Updates to existing keys for the next transaction 3) (New) A transaction with all new keys. This exercises the behavior some materializations implement where merge types of queries are only run if some of the keys to store have updates, and otherwise a more efficient direct insert is run. It is reflected in the test snapshots for materializations that persist queries in their state, most notably for databricks, where a bug in this behavior was fixed in the prior commit. Since I had to update most of the test snapshots for this, I also included the following changes: * Updated the google sheets materialization test setup to use an encrypted config file like all the other materialization tests. * Did a bit of cleaned up the CI spec file mostly related to the above. * Added an empty transaction to the test fixture, which is run first. This simulates what happens if a materialization is configured with a "notBefore" setting, and processes some empty transactions at first due to skipping over data. This revealed that materialize-sqlserver still wasn't handling this correctly, so that is fixed as well.

williamhbaker · 2025-02-20T17:12:52Z

Can you add a test case to our integration test to show this scenario please?

👍 I extended our existing "duplicate keys" test to illustrate this scenario. We need to transition through transactions where there are merges to one where there aren't, so I had to add on an additional transaction to the fixture. The outcome is apparent for materialize-databricks since it shows the queries it will run in the connector checkpoint.

I made some other opportunistic changes while I was updating all the test snapshots which are described in the commit message. It took some fiddling to get all the tests passing actually but in the end I think we're better off for it!

mdibaiee

Thanks!

materialize-databricks: reset needsMerge flag after each transaction

1ec7d89

The next transaction might not have any updates, so reset the `needsMerge` flag to prevent merge queries from being run indefinitely after one merge query has been run.

williamhbaker force-pushed the wb/databricks-reset-merge-flag branch from 26bbf69 to 1ec7d89 Compare February 19, 2025 16:50

williamhbaker requested a review from mdibaiee February 19, 2025 16:51

mdibaiee reviewed Feb 20, 2025

View reviewed changes

williamhbaker force-pushed the wb/databricks-reset-merge-flag branch from a5cdc7c to e62db23 Compare February 20, 2025 15:57

williamhbaker commented Feb 20, 2025

View reviewed changes

williamhbaker force-pushed the wb/databricks-reset-merge-flag branch from e62db23 to 5e7f007 Compare February 20, 2025 16:55

mdibaiee approved these changes Feb 24, 2025

View reviewed changes

williamhbaker merged commit f715347 into main Feb 24, 2025
53 of 56 checks passed

williamhbaker deleted the wb/databricks-reset-merge-flag branch February 24, 2025 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

materialize-databricks: reset `needsMerge` flag after each transaction #2425

materialize-databricks: reset `needsMerge` flag after each transaction #2425

williamhbaker commented Feb 19, 2025 •

edited

Loading

mdibaiee left a comment

williamhbaker Feb 20, 2025

williamhbaker Feb 20, 2025

williamhbaker commented Feb 20, 2025

mdibaiee left a comment

materialize-databricks: reset needsMerge flag after each transaction #2425

materialize-databricks: reset needsMerge flag after each transaction #2425

Conversation

williamhbaker commented Feb 19, 2025 • edited Loading

mdibaiee left a comment

Choose a reason for hiding this comment

williamhbaker Feb 20, 2025

Choose a reason for hiding this comment

williamhbaker Feb 20, 2025

Choose a reason for hiding this comment

williamhbaker commented Feb 20, 2025

mdibaiee left a comment

Choose a reason for hiding this comment

materialize-databricks: reset `needsMerge` flag after each transaction #2425

materialize-databricks: reset `needsMerge` flag after each transaction #2425

williamhbaker commented Feb 19, 2025 •

edited

Loading