-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DH-18143: Improve handling of sort order for Iceberg tables #6646
base: main
Are you sure you want to change the base?
feat: DH-18143: Improve handling of sort order for Iceberg tables #6646
Conversation
fe2f89f
to
67d1003
Compare
67d1003
to
54e1f08
Compare
This PR has three commits as follows:
I am not a fan of Commit 2 for two reasons (please review the PR so the following make more sense):
|
extensions/iceberg/src/main/java/io/deephaven/iceberg/location/IcebergTableParquetLocation.java
Show resolved
Hide resolved
extensions/iceberg/src/main/java/io/deephaven/iceberg/location/IcebergTableParquetLocation.java
Show resolved
Hide resolved
if (field.nullOrder() == NullOrder.NULLS_FIRST && field.direction() == SortDirection.ASC) { | ||
sortColumn = SortColumn.asc(columnName); | ||
} else if (field.nullOrder() == NullOrder.NULLS_LAST && field.direction() == SortDirection.DESC) { | ||
sortColumn = SortColumn.desc(columnName); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should raise the issue of null-first, nulls-last with the engine team. Arguably, this is something we should want to support.
Additionally, we may need to hold of on handling any floating point columns.
-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN
, https://iceberg.apache.org/spec/#sorting
The -NaN
v NaN
is something I have not seen before, but another issue to raise w/ engine team.
In the meantime, I think the strategy of breaking and returning what we have so far should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @return A stream of {@link DataFile} objects. | ||
*/ | ||
public static Stream<DataFile> allDataFiles(@NotNull final Table table, @NotNull ManifestFile manifestFile) { | ||
return toStream(ManifestFiles.read(manifestFile, table.io())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently learned that the files themselves may have metadata, ie org.apache.iceberg.ManifestReader#spec
. It makes me want to add caution extending these helper methods too far. While we aren't passing along ManifestReader#spec
today, we may need to in the future and might need to model it as appropriate.
extensions/iceberg/src/main/java/io/deephaven/iceberg/location/IcebergTableParquetLocation.java
Outdated
Show resolved
Hide resolved
private List<SortColumn> computeSortedColumns() { | ||
final Integer sortOrderId = dataFile.sortOrderId(); | ||
// If sort order ID is missing, unknown or unsorted, we fall back to reading sort columns from the parquet file | ||
if (sortOrderId == null) { | ||
return super.getSortedColumns(); | ||
} | ||
final SortOrder sortOrder = tableAdapter.icebergTable().sortOrders().get(sortOrderId); | ||
if (sortOrder == null || sortOrder.isUnsorted()) { | ||
return super.getSortedColumns(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an interesting question: if the metadata exists on the file itself, should we prefer it? I can imagine a case where we are setting more specific sort column information in the file itself.
For example, maybe Iceberg knows this table is sorted on columns [A, B]
, but the parquet metadata gives us more information that it is sorted on columns [A, B, C]
.
There's also an argument to be made that we should completely ignore the metadata from the file itself, and only rely on Iceberg. This saves us from needing to materialize the parquet file metadata (at least from this code path). In particular, if Iceberg explicitly gives us back sortOrder.isUnsorted()
, maybe we should be okay just returning an empty list?
It's also possible that we want this to be configurable... it's not obvious to me what the best course of action is.
Related to DH-18143