Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materializations: improved handling for "ambiguous" materialized fields #1200

Merged
merged 2 commits into from
Jan 23, 2024

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Jan 22, 2024

Description:

The materializations for some fields may be ambiguous if the destination systems are for example case insensitive, and a field like "thisField" will be materialized as the same column as "thisfield". Presently this is a concern for materialize-redshift, which treats everything as lowercase, and materialize-snowflake because of the historical way we have quoted identifiers.

Historically we have not handled these cases well, and would end up with a error during apply because of trying to create a duplicate column. More recently, changes were made to outright forbid these ambiguous fields, and instead require alternate projections of them in order to be materialized. Forbidding the fields was simple, but excessive since users generally want to use field selection (rather than projections on collections) to pick which ambiguous field they actually want as the quick and easy way to get something working.

This change enables that by implementing the following behaviors:

  • For a new binding of a materialization, ambiguous fields will all be optional if they aren't otherwise forbidden. The constraint description will say that only one of them should be selected to materialize, since we don't have any kind of mutual exclusion constraint. If a user selects more than one of these fields for a new materialization, it will fail to apply.
  • Once an ambiguous field is selected for materialization, all others from that set of ambiguous fields will be forbidden. If a binding is published with no ambiguous field selected, they will all still be constrained as optional so that it is possible to select one for an existing binding.
  • In the event of a pre-existing table with no previously persisted spec for that binding, the ambiguous fields will still be set as optional, because we don't know which projection the existing column should correspond to without a field selection. This is different than other pre-existing columns which we will recommend by default, but seems to be necessary given the ambiguous nature of the columns.

Also, a change is included to enforce that materialized collection keys are not null, or are nullable with a default value annotation as part of Validate calls. Materializations that use typed columns will currently fail in unspecified ways if a null value is actually present in a collection, and this ensures that Validate will not allow key fields that may be null.

Workflow steps:

Materialize a collection with fields that the destination system considers ambiguous. A blind publication of the materialization will not fail, but will also not include any of the ambiguous fields by default. Selecting one of them via field selection will work, and forbidden constraints will then result for other fields that are amiguous with the selected field.

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

There's some test snapshot update churn, since I re-ran all the tests with the latest binaries that include the flow_truncated metadata field.


This change is Reviewable

Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

c = &pm.Response_Validated_Constraint{
Type: pm.Response_Validated_Constraint_FIELD_FORBIDDEN,
Reason: fmt.Sprintf(
"Flow collection field '%s' is ambiguous with fields already being materialized as '%s' in the destination. Consider using an alternate, unambiguous projection of this field to allow it to be materialized",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 all this was no-doubt tedious to implement, but will be really helpful for users trying to understand what's happening and why.

c = &pm.Response_Validated_Constraint{
Type: pm.Response_Validated_Constraint_FIELD_OPTIONAL,
Reason: fmt.Sprintf(
"Flow collection field '%s' would be materialized as '%s', which is ambiguous with fields [%s]. Only a single field from this set should be selected. Consider using alternate projections if you want to materialize more than one of these fields",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this error is the same as the one in the new bindings case. Maybe it could be extracted into a common constant, but ... that confuses Go tooling and there's just one instance of duplication so probably best to leave as-is 🤷. Maybe just comment both to note that they're effectively the same reason and language should be changed together ?

Reason: "This field is a key in the current materialization",
}
} else {
// TODO(whb): Really this should be "FIELD_RECOMMENDED", but that is not a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lemme know if you think we need to implement this.

The materializations for some fields may be ambiguous if the destination systems are for example
case insensitive, and a field like "thisField" will be materialized as the same column as
"thisfield". Presently this is a concern for `materialize-redshift`, which treats everything as
lowercase, and `materialize-snowflake` because of the historical way we have quoted identifiers.

Historically we have not handled these cases well, and would end up with a error during apply
because of trying to create a duplicate column. More recently, changes were made to outright forbid
these ambiguous fields, and instead require alternate projections of them in order to be
materialized. Forbidding the fields was simple, but excessive since users generally want to use
field selection (rather than projections on collections) to pick which ambiguous field they actually
want as the quick and easy way to get something working.

This change enables that by implementing the following behaviors:
* For a new binding of a materialization, ambiguous fields will all be optional if they aren't
  otherwise forbidden. The constraint description will say that only one of them should be selected
  to materialize, since we don't have any kind of mutual exclusion constraint. If a user selects
  more than one of these fields for a new materialization, it will fail to apply.
* Once an ambiguous field is selected for materialization, all others from that set of ambiguous
  fields will be forbidden. If a binding is published with no ambiguous field selected, they will
  all still be constrained as optional so that it is possible to select one for an existing binding.
* In the event of a pre-existing table with no previously persisted spec for that binding, the
  ambiguous fields will still be set as optional, because we don't know which projection the
  existing column should correspond to without a field selection. This is different than other
  pre-existing columns which we will recommend by default, but seems to be necessary given the
  ambiguous nature of the columns.
… default value

Materializations do not support collections with nullable keys, and will generally fail if a null
value is every encountered for a collection key.

This changes validation to enforce that a nullable collection key not be allowed, unless it has a
default value annotation. If there is a default value annotation, the connector is assured that it
will never actually see a null value, and it will not try to drop NOT NULL constraints from the
materialized field in future `Apply` calls.
@williamhbaker williamhbaker merged commit 5d53313 into main Jan 23, 2024
43 of 44 checks passed
@williamhbaker williamhbaker deleted the wb/ambiguous-fields branch January 23, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants