-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-mysql: Row IDs and View Discovery #2451
Conversation
cb71cbd
to
871f2b2
Compare
@@ -112,10 +112,19 @@ func (r Resource) Validate() error { | |||
type documentMetadata struct { | |||
Polled time.Time `json:"polled" jsonschema:"title=Polled Timestamp,description=The time at which the update query which produced this document as executed."` | |||
Index int `json:"index" jsonschema:"title=Result Index,description=The index of this document within the query execution which produced it."` | |||
RowID int64 `json:"row_id" jsonschema:"title=Row ID,description=Row ID of the Document, counting up from zero."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this add a required row_id
field to all pre-existing collections, which do not have that field in their documents? That seems like it would be problematic, since it would prevent reading any pre-existing data from the collections without schema violations if there were new materializations set up from the collections or backfills of materializations etc.
I believe this question would also apply to the similar change to source-postgres-batch, although I didn't think of it when I was looking at that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a very good catch. Adding the Row ID field to all documents going forward was intentional, but I missed the issue with discovering it as required vs optional.
It will definitely have to be required in contexts where it's used as a key, so the obvious options are:
- it's optional or required depending on the setting of the keyless_row_id flag, so any capture which might possibly use it as a key will always discover it as required but old ones won't unless they opt into the new behavior
- it's optional most of the time and only becomes required for a particular table when we actually use the row ID as a fallback key
Option (2) seems less impactful by default but it would mean that users couldn't explicitly change the key selection to the row ID (which is something I could actually see people wanting to do occasionally because keyless tables doing a full-refresh with inferred row ID deletions is currently the only way to get deletions out of the batch SQL captures)
This copies over the "Row ID" property and related changes from source-postgres-batch, including: - The `/_meta/row_id` and `/_meta/op` properties. - Feature-flagged logic to use `/_meta/row_id` as the collection key in discovery for source tables without a primary key. - Special cases for inferring deletes/updates when capturing a full-refresh table with `/_meta/row_id` as the collection key. - Tests of keyless-table behavior and the new feature flag.
Implements an advanced option for returning views from discovery.
477cc3c
to
90ebe2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description:
This PR adds two distinct features which I'm copying over from source-postgres-batch to other SQL batch captures:
I'm partly just being lazy by combining the two into a single PR, but it also partly makes sense to do so because views don't have a primary key and also don't always have a useful cursor column, and so we need the keyless-table full-refresh row ID behavior to really make them work nicely.
This PR implements #2382 for source-mysql-batch.
This change is