materializations: various enhancements for handling problematic field names #1266

williamhbaker · 2024-02-15T15:37:25Z

Description:

Various enhancements for handling problematic field names in materializations that manage the equivalent of "columns" in their destination systems.

A few highlights:

materialize-elasticsearch will sanitize field names by appending _flow to them, in cases where they conflict with ES pre-deefined "Metadata" fields and would not otherwise be able to be materialized
materialize-databricks will sanitize field names that contain one of the set of characters that are always disallowed for column names, no matter how hard to you try to quote or escape them
materialize-mysql will sanitize field names under some less common circumstances...characters with high code points, and trailing spaces
Fixed a systemic issue with escaping identifier quotes within quoted identifiers for SQL materializations
Better handling for "ambiguous" field names, where the destination system preserves capitalization but still doesn't allow columns to co-exist that differ only in capitalization - this is actually pretty common

See individual commit messages for more details.

I also did some manual testing with the absurd schema used for the validate & apply tests to make sure that we could actually materialize data to tables with these columns for all materializations (including storing and loading). And I did a bonus test where a resource path was configured with a table name of " ,;{}().- problematic#Table � 𐀀 嶲 " and all materializations worked with that except materialize-mysql, which may suggest that there is still some limitation with how MySQL itself or the client driver handles weird characters. It doesn't seem worth spending a ton of time trying to figure out why this is right now, but we'll probably have to handle it eventually.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

mdibaiee · 2024-02-21T11:12:10Z

materialize-elasticsearch/driver.go

+		// also truncate index names that contain a '#' character to drop everything including &
+		// after that '#' character, so we'll normalize those too.
+		if slices.Contains([]rune{'*', '<', '"', ' ', '\\', '/', ',', '|', '>', '?', ':', '#'}, r) {
+			if !afterPrefix {


nit: is this on a separate line due to line length?

It's a separate condition, since index names can't begin with an underscore and if a proposed index name has a disallowed character at the beginning it won't be valid to replace it with an underscore that would be at the beginning.

But looking at this now I think there's a simpler way to factor this that removes the second nested check of if !afterPrefix so I'll push that change.

mdibaiee · 2024-02-21T11:23:22Z

materialize-snowflake/sqlgen.go

+		TypeMapper:             mapper,
+		ColumnValidator:        columnValidator,
+		MaxColumnCharLength:    255,
+		CaseInsensitiveColumns: false,


I'm under the impression that in Snowflake unless the identifier is quoted, the fields will be case-insensitive. Does that not mean we should set this to true?

CaseInsensitiveColumns (which is a proxy for caseInsensitiveFields in the materialize-boilerplate validator) means that even if the fields are quoted, the destination system will still not allow more than one column with the same name that differs only in capitalization. Snowflake does allow "thisField" vs "thisfield", so having it be false represents that. Admittedly I'm finding it hard to come up with a variable name that succinctly represents what this toggle does, so let me know if you have suggestions on something that would be more clear.

…rialized fields Many systems will preserve capitalization in the names of materialized fields (for example, columns of a SQL table), but still will not allow columns with names that differ only in capitalization to co-exist within the same table. To prevent errors during `Apply`, an additional configuration is provided for `Validate` to consider field names to be case insensitive for the purposes of determining ambiguous columns. This is relevant for materializations to BigQuery, Databricks, MotherDuck, MySQL, and SQLServer. For SQL materializations, the configuration is made available through `materialize-sql` as a new configuration of the `sql.Dialect`. I also moved the `MaxColumnCharLength` (name adjusted slightly) out of the `sql.Endpoint` configuration and into the `sql.Dialect`, since it seemed to thematically fit better there alongside the case-sensitivity configuration.

There was a systemic logic error in how we were escaping quote characters within a quoted identifier. This fixes a number of those. Databricks and MySQL will be fixed in later commits which also change how they are sanitizing identifiers.

MySQL does not allow identifiers to have trailing spaces or characters with code points greater than U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters will be sanitized with an underscore.

Databricks has a set of characters that are strictly disallowed for column names, regardless of how their identifiers are quoted. These characters will be sanitized to underscores. Databricks also has a different, not entirely overlapping set of disallowed characters for table names. These will be sanitized as part of the resource path response, rather than returning an error from `Validate`. This will allow table names with spaces, which may be a common occurrence for Databricks materializations linked to source captures that will automatically add new bindings without user involvement in what the configured name for the tables are.

…e-defined "Metadata" fields ElasticSearch has a number of pre-defined "Metadata" fields (see https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html) that exist for every index, and trying to create an index with a separate mapping with any of these names will fail with a "field is defined more than once" sort of error. All we can really do is differentiate our Flow field by appending "_flow" to the name of the field to allow the field to be created. Doing this kind of sanitization is tricky because there are numerous places it has to be included for things to work out end-to-end without a higher-level framework like `materialize-sql`. And such a general framework may actually make sense to construct for materializations in general, but I'm not tackling that just yet.

Adds a set of tests that ensure destination systems can create materialized fields for Flow field names with some pretty challenging characters. The `materialize-motherduck` tests were enabled and ran as part of this, since they now work. Some additional handling in the test helpers was added to support delta updates configurations, since `materialize-motherduck` must currently always be set to delta updates.

Previously we would only normalize characters in index names that would be allowed by Flow collection names, but that's not fully sufficient since the configured index name in the resource configuration can be anything. It will usually be the collection name, but not always. This extends normalization to all characters that pose problems in index names.

williamhbaker changed the title ~~materializations: various enhancements for handling problematic column names~~ materializations: various enhancements for handling problematic field names Feb 15, 2024

williamhbaker force-pushed the wb/weird-columns branch 2 times, most recently from dfe5736 to 4d9d8cc Compare February 15, 2024 20:30

williamhbaker marked this pull request as ready for review February 15, 2024 20:51

williamhbaker requested a review from mdibaiee February 15, 2024 20:51

williamhbaker force-pushed the wb/weird-columns branch from 4d9d8cc to fce52d1 Compare February 16, 2024 21:06

mdibaiee reviewed Feb 21, 2024

View reviewed changes

williamhbaker added 7 commits February 21, 2024 09:44

materialize-mysql: sanitize certain field names

a08494f

MySQL does not allow identifiers to have trailing spaces or characters with code points greater than U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters will be sanitized with an underscore.

williamhbaker force-pushed the wb/weird-columns branch from fce52d1 to 7f0e8ea Compare February 21, 2024 14:45

mdibaiee approved these changes Feb 22, 2024

View reviewed changes

williamhbaker merged commit 4a8cec2 into main Feb 22, 2024
46 of 49 checks passed

williamhbaker deleted the wb/weird-columns branch February 22, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

materializations: various enhancements for handling problematic field names #1266

materializations: various enhancements for handling problematic field names #1266

williamhbaker commented Feb 15, 2024 •

edited

Loading

mdibaiee Feb 21, 2024

williamhbaker Feb 21, 2024

mdibaiee Feb 21, 2024

williamhbaker Feb 21, 2024

materializations: various enhancements for handling problematic field names #1266

materializations: various enhancements for handling problematic field names #1266

Conversation

williamhbaker commented Feb 15, 2024 • edited Loading

mdibaiee Feb 21, 2024

Choose a reason for hiding this comment

williamhbaker Feb 21, 2024

Choose a reason for hiding this comment

mdibaiee Feb 21, 2024

Choose a reason for hiding this comment

williamhbaker Feb 21, 2024

Choose a reason for hiding this comment

williamhbaker commented Feb 15, 2024 •

edited

Loading