Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materializations: various enhancements for handling problematic field names #1266

Merged
merged 7 commits into from
Feb 22, 2024

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Feb 15, 2024

Description:

Various enhancements for handling problematic field names in materializations that manage the equivalent of "columns" in their destination systems.

A few highlights:

  • materialize-elasticsearch will sanitize field names by appending _flow to them, in cases where they conflict with ES pre-deefined "Metadata" fields and would not otherwise be able to be materialized
  • materialize-databricks will sanitize field names that contain one of the set of characters that are always disallowed for column names, no matter how hard to you try to quote or escape them
  • materialize-mysql will sanitize field names under some less common circumstances...characters with high code points, and trailing spaces
  • Fixed a systemic issue with escaping identifier quotes within quoted identifiers for SQL materializations
  • Better handling for "ambiguous" field names, where the destination system preserves capitalization but still doesn't allow columns to co-exist that differ only in capitalization - this is actually pretty common

See individual commit messages for more details.

I also did some manual testing with the absurd schema used for the validate & apply tests to make sure that we could actually materialize data to tables with these columns for all materializations (including storing and loading). And I did a bonus test where a resource path was configured with a table name of " ,;{}().- problematic#Table � 𐀀 嶲 " and all materializations worked with that except materialize-mysql, which may suggest that there is still some limitation with how MySQL itself or the client driver handles weird characters. It doesn't seem worth spending a ton of time trying to figure out why this is right now, but we'll probably have to handle it eventually.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@williamhbaker williamhbaker changed the title materializations: various enhancements for handling problematic column names materializations: various enhancements for handling problematic field names Feb 15, 2024
@williamhbaker williamhbaker force-pushed the wb/weird-columns branch 2 times, most recently from dfe5736 to 4d9d8cc Compare February 15, 2024 20:30
@williamhbaker williamhbaker marked this pull request as ready for review February 15, 2024 20:51
// also truncate index names that contain a '#' character to drop everything including &
// after that '#' character, so we'll normalize those too.
if slices.Contains([]rune{'*', '<', '"', ' ', '\\', '/', ',', '|', '>', '?', ':', '#'}, r) {
if !afterPrefix {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this on a separate line due to line length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a separate condition, since index names can't begin with an underscore and if a proposed index name has a disallowed character at the beginning it won't be valid to replace it with an underscore that would be at the beginning.

But looking at this now I think there's a simpler way to factor this that removes the second nested check of if !afterPrefix so I'll push that change.

TypeMapper: mapper,
ColumnValidator: columnValidator,
MaxColumnCharLength: 255,
CaseInsensitiveColumns: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression that in Snowflake unless the identifier is quoted, the fields will be case-insensitive. Does that not mean we should set this to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CaseInsensitiveColumns (which is a proxy for caseInsensitiveFields in the materialize-boilerplate validator) means that even if the fields are quoted, the destination system will still not allow more than one column with the same name that differs only in capitalization. Snowflake does allow "thisField" vs "thisfield", so having it be false represents that. Admittedly I'm finding it hard to come up with a variable name that succinctly represents what this toggle does, so let me know if you have suggestions on something that would be more clear.

…rialized fields

Many systems will preserve capitalization in the names of materialized fields (for example, columns
of a SQL table), but still will not allow columns with names that differ only in capitalization to
co-exist within the same table. To prevent errors during `Apply`, an additional configuration is
provided for `Validate` to consider field names to be case insensitive for the purposes of
determining ambiguous columns.

This is relevant for materializations to BigQuery, Databricks, MotherDuck, MySQL, and SQLServer.

For SQL materializations, the configuration is made available through `materialize-sql` as a new
configuration of the `sql.Dialect`. I also moved the `MaxColumnCharLength` (name adjusted slightly)
out of the `sql.Endpoint` configuration and into the `sql.Dialect`, since it seemed to thematically
fit better there alongside the case-sensitivity configuration.
There was a systemic logic error in how we were escaping quote characters within a quoted
identifier. This fixes a number of those. Databricks and MySQL will be fixed in later commits which
also change how they are sanitizing identifiers.
MySQL does not allow identifiers to have trailing spaces or characters with code points greater than
U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters
will be sanitized with an underscore.
Databricks has a set of characters that are strictly disallowed for column names, regardless of how
their identifiers are quoted. These characters will be sanitized to underscores.

Databricks also has a different, not entirely overlapping set of disallowed characters for table
names. These will be sanitized as part of the resource path response, rather than returning an error
from `Validate`. This will allow table names with spaces, which may be a common occurrence for
Databricks materializations linked to source captures that will automatically add new bindings
without user involvement in what the configured name for the tables are.
…e-defined "Metadata" fields

ElasticSearch has a number of pre-defined "Metadata" fields (see
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html) that exist for
every index, and trying to create an index with a separate mapping with any of these names will fail
with a "field is defined more than once" sort of error. All we can really do is differentiate our
Flow field by appending "_flow" to the name of the field to allow the field to be created.

Doing this kind of sanitization is tricky because there are numerous places it has to be included
for things to work out end-to-end without a higher-level framework like `materialize-sql`. And such
a general framework may actually make sense to construct for materializations in general, but I'm
not tackling that just yet.
Adds a set of tests that ensure destination systems can create materialized fields for Flow field
names with some pretty challenging characters.

The `materialize-motherduck` tests were enabled and ran as part of this, since they now work. Some
additional handling in the test helpers was added to support delta updates configurations, since
`materialize-motherduck` must currently always be set to delta updates.
Previously we would only normalize characters in index names that would be allowed by Flow
collection names, but that's not fully sufficient since the configured index name in the resource
configuration can be anything. It will usually be the collection name, but not always.

This extends normalization to all characters that pose problems in index names.
@williamhbaker williamhbaker merged commit 4a8cec2 into main Feb 22, 2024
46 of 49 checks passed
@williamhbaker williamhbaker deleted the wb/weird-columns branch February 22, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants