Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading IPC files that use Arrow extension type #9373

Open
GPSnoopy opened this issue Jun 14, 2023 · 3 comments
Open

Allow reading IPC files that use Arrow extension type #9373

GPSnoopy opened this issue Jun 14, 2023 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@GPSnoopy
Copy link

Problem description

This is partially related to #9112 but much narrower in scope.

As the official Apache Arrow specs state, an implementation reading an Arrow file using extension types should work by falling back to the known parent type.

This extension metadata can annotate any of the built-in Arrow logical types. The intent is that an implementation that does not support an extension type can still handle the underlying data. For example a 16-byte UUID value could be embedded in FixedSizeBinary(16), and implementations that do not have this extension type can still work with the underlying binary values and pass along the custom_metadata in subsequent Arrow protocol messages.

Currently, using LazyFrame::scan_ipc() panics at the following line:

dt => panic!("Arrow datatype {dt:?} not supported by Polars. You probably need to activate that data-type feature."),

I suspect that the following

ArrowDataType::Extension(name, _, _) if name == "POLARS_EXTENSION_TYPE" => {
#[cfg(feature = "object")]
{
DataType::Object("extension")
}
#[cfg(not(feature = "object"))]
{
panic!("activate the 'object' feature to be able to load POLARS_EXTENSION_TYPE")
}
}

should be followed by something like

            ArrowDataType::Extension(_, _, _) => dt.to_logical_type().into(),

which would cause the function to recurse with the parent logical type when encountering an unknown extension.

@GPSnoopy GPSnoopy added the enhancement New feature or an improvement of an existing feature label Jun 14, 2023
@GPSnoopy
Copy link
Author

GPSnoopy commented Jun 15, 2023

I have created two PRs:

The Arrow2 PR needs to be merged first such that I can amend the Polars PR to depend on the official Arrow2 repository again.

@GPSnoopy
Copy link
Author

GPSnoopy commented Jun 20, 2023

The above does not seem to work for extension types of type Boolean. Not sure why; I can't see any special path for it.

The arrow2 PR is also not perfect. When the cast logical type just matches, no conversion is performed, but the array is not set to the target data type (I do not think this is the reason why Boolean does not work though, as other types do work).

@GPSnoopy
Copy link
Author

To be honest, I have not been able to get the extension types to work smoothly with arrow2 and polars. It is a shame, because according to the documentation, it is the proper way to expand on existing types. Instead, for the time being, I'm just adding extra metadata to the fields.

I'll close the PRs (they're still there in case anyone wants to pick up where I left) but I'm leaving this ticket open as a feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant