Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DH-18143: Improve handling of sort order for Iceberg tables #6646

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Feb 13, 2025

Related to DH-18143

@malhotrashivam malhotrashivam added this to the 0.38.0 milestone Feb 13, 2025
@malhotrashivam malhotrashivam self-assigned this Feb 13, 2025
@malhotrashivam malhotrashivam marked this pull request as draft February 13, 2025 21:54
@malhotrashivam malhotrashivam force-pushed the nightly/sm-sort-order branch 2 times, most recently from fe2f89f to 67d1003 Compare February 14, 2025 19:04
@malhotrashivam malhotrashivam removed the request for review from devinrsmith February 14, 2025 19:04
@malhotrashivam
Copy link
Contributor Author

This PR has three commits as follows:

  • Commit1: If the manifest file indicates that the data file is sorted by specific columns, we should return those column names in TableLocation::getSortedColumns. This can help later for pushdown predicate-style filtering.
  • Commit2: If a Deephaven table being written is sorted on a column, we should set the corresponding sort order in the manifest file.
  • Commit3: If the table has a default sort order, we should allow user to opt into the sorting on default sort columns during writing, or opt out.

I am not a fan of Commit 2 for two reasons (please review the PR so the following make more sense):

  • Commit 2 adds extra complexity to iceberg writing code, which is not needed right now.
  • Commit 2 can only add a single column name to the data file, since we have a constraint (https://deephaven.atlassian.net/browse/DH-18700). Now a single column name is also added to the parquet file as a sort column. Therefore, even if we don't add the column name to data file, deephaven would still be read as a sort column through the parquet reading code and all tests will pass. Although, other iceberg reading tools will not know about such sort columns if we don't add it to the data file.
    So I think we can remove commit 2 from this PR.

@malhotrashivam malhotrashivam marked this pull request as ready for review February 14, 2025 20:54
Comment on lines +75 to +79
if (field.nullOrder() == NullOrder.NULLS_FIRST && field.direction() == SortDirection.ASC) {
sortColumn = SortColumn.asc(columnName);
} else if (field.nullOrder() == NullOrder.NULLS_LAST && field.direction() == SortDirection.DESC) {
sortColumn = SortColumn.desc(columnName);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should raise the issue of null-first, nulls-last with the engine team. Arguably, this is something we should want to support.

Additionally, we may need to hold of on handling any floating point columns.

-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN, https://iceberg.apache.org/spec/#sorting

The -NaN v NaN is something I have not seen before, but another issue to raise w/ engine team.

In the meantime, I think the strategy of breaking and returning what we have so far should be OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @return A stream of {@link DataFile} objects.
*/
public static Stream<DataFile> allDataFiles(@NotNull final Table table, @NotNull ManifestFile manifestFile) {
return toStream(ManifestFiles.read(manifestFile, table.io()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently learned that the files themselves may have metadata, ie org.apache.iceberg.ManifestReader#spec. It makes me want to add caution extending these helper methods too far. While we aren't passing along ManifestReader#spec today, we may need to in the future and might need to model it as appropriate.

Comment on lines 59 to 68
private List<SortColumn> computeSortedColumns() {
final Integer sortOrderId = dataFile.sortOrderId();
// If sort order ID is missing, unknown or unsorted, we fall back to reading sort columns from the parquet file
if (sortOrderId == null) {
return super.getSortedColumns();
}
final SortOrder sortOrder = tableAdapter.icebergTable().sortOrders().get(sortOrderId);
if (sortOrder == null || sortOrder.isUnsorted()) {
return super.getSortedColumns();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting question: if the metadata exists on the file itself, should we prefer it? I can imagine a case where we are setting more specific sort column information in the file itself.

For example, maybe Iceberg knows this table is sorted on columns [A, B], but the parquet metadata gives us more information that it is sorted on columns [A, B, C].

There's also an argument to be made that we should completely ignore the metadata from the file itself, and only rely on Iceberg. This saves us from needing to materialize the parquet file metadata (at least from this code path). In particular, if Iceberg explicitly gives us back sortOrder.isUnsorted(), maybe we should be okay just returning an empty list?

It's also possible that we want this to be configurable... it's not obvious to me what the best course of action is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants