Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Row IDs and View Discovery #2451

Merged
merged 4 commits into from
Feb 27, 2025

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Feb 26, 2025

Description:

This PR adds two distinct features which I'm copying over from source-postgres-batch to other SQL batch captures:

  • Row IDs and their use as the fallback collection key for tables without a primary key.
  • An advanced option to return views from discovery.

I'm partly just being lazy by combining the two into a single PR, but it also partly makes sense to do so because views don't have a primary key and also don't always have a useful cursor column, and so we need the keyless-table full-refresh row ID behavior to really make them work nicely.

This PR implements #2382 for source-mysql-batch.


This change is Reviewable

@willdonnelly willdonnelly requested a review from a team February 26, 2025 00:17
@willdonnelly willdonnelly added the change:planned This is a planned change label Feb 26, 2025
@willdonnelly willdonnelly force-pushed the wgd/batch-mysql-rowid-20250224 branch from cb71cbd to 871f2b2 Compare February 26, 2025 00:20
@@ -112,10 +112,19 @@ func (r Resource) Validate() error {
type documentMetadata struct {
Polled time.Time `json:"polled" jsonschema:"title=Polled Timestamp,description=The time at which the update query which produced this document as executed."`
Index int `json:"index" jsonschema:"title=Result Index,description=The index of this document within the query execution which produced it."`
RowID int64 `json:"row_id" jsonschema:"title=Row ID,description=Row ID of the Document, counting up from zero."`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this add a required row_id field to all pre-existing collections, which do not have that field in their documents? That seems like it would be problematic, since it would prevent reading any pre-existing data from the collections without schema violations if there were new materializations set up from the collections or backfills of materializations etc.

I believe this question would also apply to the similar change to source-postgres-batch, although I didn't think of it when I was looking at that PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a very good catch. Adding the Row ID field to all documents going forward was intentional, but I missed the issue with discovering it as required vs optional.

It will definitely have to be required in contexts where it's used as a key, so the obvious options are:

  • it's optional or required depending on the setting of the keyless_row_id flag, so any capture which might possibly use it as a key will always discover it as required but old ones won't unless they opt into the new behavior
  • it's optional most of the time and only becomes required for a particular table when we actually use the row ID as a fallback key

Option (2) seems less impactful by default but it would mean that users couldn't explicitly change the key selection to the row ID (which is something I could actually see people wanting to do occasionally because keyless tables doing a full-refresh with inferred row ID deletions is currently the only way to get deletions out of the batch SQL captures)

This copies over the "Row ID" property and related changes from
source-postgres-batch, including:
 - The `/_meta/row_id` and `/_meta/op` properties.
 - Feature-flagged logic to use `/_meta/row_id` as the collection
   key in discovery for source tables without a primary key.
 - Special cases for inferring deletes/updates when capturing a
   full-refresh table with `/_meta/row_id` as the collection key.
 - Tests of keyless-table behavior and the new feature flag.
Implements an advanced option for returning views from discovery.
@willdonnelly willdonnelly force-pushed the wgd/batch-mysql-rowid-20250224 branch from 477cc3c to 90ebe2d Compare February 26, 2025 23:20
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@willdonnelly willdonnelly merged commit 71a18bf into main Feb 27, 2025
54 of 56 checks passed
@willdonnelly willdonnelly deleted the wgd/batch-mysql-rowid-20250224 branch February 27, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants