-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
materializations: various enhancements for handling problematic field names #1266
Conversation
dfe5736
to
4d9d8cc
Compare
4d9d8cc
to
fce52d1
Compare
materialize-elasticsearch/driver.go
Outdated
// also truncate index names that contain a '#' character to drop everything including & | ||
// after that '#' character, so we'll normalize those too. | ||
if slices.Contains([]rune{'*', '<', '"', ' ', '\\', '/', ',', '|', '>', '?', ':', '#'}, r) { | ||
if !afterPrefix { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this on a separate line due to line length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a separate condition, since index names can't begin with an underscore and if a proposed index name has a disallowed character at the beginning it won't be valid to replace it with an underscore that would be at the beginning.
But looking at this now I think there's a simpler way to factor this that removes the second nested check of if !afterPrefix
so I'll push that change.
TypeMapper: mapper, | ||
ColumnValidator: columnValidator, | ||
MaxColumnCharLength: 255, | ||
CaseInsensitiveColumns: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm under the impression that in Snowflake unless the identifier is quoted, the fields will be case-insensitive. Does that not mean we should set this to true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CaseInsensitiveColumns
(which is a proxy for caseInsensitiveFields
in the materialize-boilerplate
validator) means that even if the fields are quoted, the destination system will still not allow more than one column with the same name that differs only in capitalization. Snowflake does allow "thisField"
vs "thisfield"
, so having it be false
represents that. Admittedly I'm finding it hard to come up with a variable name that succinctly represents what this toggle does, so let me know if you have suggestions on something that would be more clear.
…rialized fields Many systems will preserve capitalization in the names of materialized fields (for example, columns of a SQL table), but still will not allow columns with names that differ only in capitalization to co-exist within the same table. To prevent errors during `Apply`, an additional configuration is provided for `Validate` to consider field names to be case insensitive for the purposes of determining ambiguous columns. This is relevant for materializations to BigQuery, Databricks, MotherDuck, MySQL, and SQLServer. For SQL materializations, the configuration is made available through `materialize-sql` as a new configuration of the `sql.Dialect`. I also moved the `MaxColumnCharLength` (name adjusted slightly) out of the `sql.Endpoint` configuration and into the `sql.Dialect`, since it seemed to thematically fit better there alongside the case-sensitivity configuration.
There was a systemic logic error in how we were escaping quote characters within a quoted identifier. This fixes a number of those. Databricks and MySQL will be fixed in later commits which also change how they are sanitizing identifiers.
MySQL does not allow identifiers to have trailing spaces or characters with code points greater than U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters will be sanitized with an underscore.
Databricks has a set of characters that are strictly disallowed for column names, regardless of how their identifiers are quoted. These characters will be sanitized to underscores. Databricks also has a different, not entirely overlapping set of disallowed characters for table names. These will be sanitized as part of the resource path response, rather than returning an error from `Validate`. This will allow table names with spaces, which may be a common occurrence for Databricks materializations linked to source captures that will automatically add new bindings without user involvement in what the configured name for the tables are.
…e-defined "Metadata" fields ElasticSearch has a number of pre-defined "Metadata" fields (see https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html) that exist for every index, and trying to create an index with a separate mapping with any of these names will fail with a "field is defined more than once" sort of error. All we can really do is differentiate our Flow field by appending "_flow" to the name of the field to allow the field to be created. Doing this kind of sanitization is tricky because there are numerous places it has to be included for things to work out end-to-end without a higher-level framework like `materialize-sql`. And such a general framework may actually make sense to construct for materializations in general, but I'm not tackling that just yet.
Adds a set of tests that ensure destination systems can create materialized fields for Flow field names with some pretty challenging characters. The `materialize-motherduck` tests were enabled and ran as part of this, since they now work. Some additional handling in the test helpers was added to support delta updates configurations, since `materialize-motherduck` must currently always be set to delta updates.
Previously we would only normalize characters in index names that would be allowed by Flow collection names, but that's not fully sufficient since the configured index name in the resource configuration can be anything. It will usually be the collection name, but not always. This extends normalization to all characters that pose problems in index names.
fce52d1
to
7f0e8ea
Compare
Description:
Various enhancements for handling problematic field names in materializations that manage the equivalent of "columns" in their destination systems.
A few highlights:
materialize-elasticsearch
will sanitize field names by appending_flow
to them, in cases where they conflict with ES pre-deefined "Metadata" fields and would not otherwise be able to be materializedmaterialize-databricks
will sanitize field names that contain one of the set of characters that are always disallowed for column names, no matter how hard to you try to quote or escape themmaterialize-mysql
will sanitize field names under some less common circumstances...characters with high code points, and trailing spacesSee individual commit messages for more details.
I also did some manual testing with the absurd schema used for the validate & apply tests to make sure that we could actually materialize data to tables with these columns for all materializations (including storing and loading). And I did a bonus test where a resource path was configured with a table name of
" ,;{}().- problematic#Table � 𐀀 嶲 "
and all materializations worked with that exceptmaterialize-mysql
, which may suggest that there is still some limitation with how MySQL itself or the client driver handles weird characters. It doesn't seem worth spending a ton of time trying to figure out why this is right now, but we'll probably have to handle it eventually.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is