Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The shape detector lens is not producing sensible caveats #363

Closed
okennedy opened this issue Jan 3, 2020 · 1 comment
Closed

The shape detector lens is not producing sensible caveats #363

okennedy opened this issue Jan 3, 2020 · 1 comment
Assignees

Comments

@okennedy
Copy link
Member

okennedy commented Jan 3, 2020

No description provided.

@okennedy okennedy added the bug label Jan 3, 2020
@okennedy okennedy self-assigned this Jan 3, 2020
@okennedy okennedy added enhancement and removed bug labels Jan 4, 2020
@okennedy
Copy link
Member Author

okennedy commented Jan 4, 2020

Ok... so this was a case of reality (and me) being stupider than I'd originally expected.

The specific error being seen was based on the NYC Cause of Death test files (now in the repo under /test/NYC_CoD). These two files come from the NYS open data portal and include data for 2008-2014 and 2008-2016 respectively. When given to the shape detector:

  • The shape detector inconsistently reports no nulls in the DEATHS column (which contains . characters for some lines) in the first dataset:
    • The second data instead reports a small %age of nulls in the same column
    • Two other columns: DEATH_RATE and AGE_ADJUSTED_DEATH_RATE also contain . characters that are reported properly as nulls.
  • The shape detector does not report the M/F -> M/F/Male/Female or the White Non-Hispanic/Black Non-Hispanic -> Non-Hispanic White/Non-Hispanic Black errors in the 2nd dataset.

🤦‍♂️

These errors result from a combination of several issues that should now be resolved (as of 8a37630)

CAST Inconsistency

Spark's CAST operation evaluates CAST('.' AS bigint) to 0, while Mimir's evaluates it to NULL. Curiously, CAST('.' AS double) evaluates to NULL in Spark as well...

The inconsistency between Mimir and Spark needs to be fixed (#364), but it was particularly pronounced because evaluation could be shared between Mimir and Spark. Due to an old optimization: Mimir would take over evaluation of the final stages of a query, since these usually had Scala UDFs and SQLite had rather poor performance due to repeated crossing of the JVM boundary.

I simplified the compiler pipeline, removing this unnecessary optimization, and behavior of the Null-ish facets should now be more stable and in particular, users shouldn't see query results that differ based on query complexity.

Legitimate Data Errors

So... it turns out that in the 2015/2016 data dump, NYC added blank cells as missing values. These are universally interpreted as NULL by Mimir and Spark, so the 2nd dataset actually had a mix of '.'s and ''s in the DEATHS column, and actually did legitimately have nulls where the first dataset did not.

Senility

I could have sworn that there were domain facets already implemented. I could have also sworn we had an oxfordComma method in StringUtils. Apparently, I was wrong on both counts... at least until now. There are two new facet types: DrawnFromRange locks in the min/max values of sequential-typed column and DrawnFromDomain locks in the set of distinct values of any string-typed column with fewer than 20 distinct values. This now correctly handles the two test data files.

@okennedy okennedy closed this as completed Jan 4, 2020
okennedy added a commit that referenced this issue Jan 4, 2020
Previously, ShapeWatcher Caveats would all get folded in under the same model.  Now, they get expanded out: each facet gets its own model (although if a single facet produces multiple warnings they'll get folded together).  (Should fix #363 fully with respect to the Vizier UI)

Also a small bugfix in the stringification of the DrawnFromRange facet warning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant