Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arrow Extension types #9112

Open
kylebarron opened this issue May 29, 2023 · 12 comments
Open

Support for Arrow Extension types #9112

kylebarron opened this issue May 29, 2023 · 12 comments
Labels
A-dtype Area: data types in general enhancement New feature or an improvement of an existing feature wish Features we would ideally want to support, but not right now

Comments

@kylebarron
Copy link
Contributor

kylebarron commented May 29, 2023

Problem description

(Creating this as a stub after closing #4014)

Now that the Array data type has been implemented, there may be interest in supporting Arrow extension types, which allow storing other pieces of metadata on the column. Some examples:

@kylebarron kylebarron added the enhancement New feature or an improvement of an existing feature label May 29, 2023
@spenczar
Copy link

I am very interested in FixedShapeTensor support. In case it helps guide implementation, I'll explain more about how I use them, which is to store covariance matrixes.

These covariance matrixes are typically 6x6, in my case, but sometimes 3x3. I'm interested in getting their diagonal elements, and in selecting specific off-axis elements.

Nullability of individual elements is very important to me as well.

@ritchie46
Copy link
Member

We can store a metadata flag on the Array data-type. I don't feel much for different codepaths for extension typed arrays. This would really add a lot of complexity and code-bloat.

I have to check, but I also don't think that's needed. Keeping the meta-data around should be enough.

@spenczar
Copy link

As of 6fbc142, polars.from_arrow raises an exception when passed an extension array:

>>> import polars
>>> import pyarrow as pa
>>> import numpy as np
>>> tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
>>> polars.from_arrow(tensor_array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/swnelson/code/3p/polars/py-polars/polars/convert.py", line 605, in from_arrow
    data=pl.Series._from_arrow(name, data, rechunk=rechunk),
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/swnelson/code/3p/polars/py-polars/polars/series/series.py", line 325, in _from_arrow
    return cls._from_pyseries(arrow_to_pyseries(name, values, rechunk))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/swnelson/code/3p/polars/py-polars/polars/utils/_construction.py", line 178, in arrow_to_pyseries
    pys = PySeries.from_arrow(name, array)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exceptions.ComputeError: cannot create series from Extension("arrow.fixed_shape_tensor", FixedSizeList(Field { name: "item", data_type: Int64, is_nullable: true, metadata: {} }, 3), Some("{\"shape\":[3]}"))

Similar issue if passed a pyarrow.Table which contains an Extension array:

>>> table = pa.table([tensor_array], names=["vals"])
>>> polars.from_arrow(table)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/swnelson/code/3p/polars/py-polars/polars/convert.py", line 599, in from_arrow
    return pl.DataFrame._from_arrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/swnelson/code/3p/polars/py-polars/polars/dataframe/frame.py", line 612, in _from_arrow
    arrow_to_pydf(
  File "/Users/swnelson/code/3p/polars/py-polars/polars/utils/_construction.py", line 1338, in arrow_to_pydf
    pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exceptions.ComputeError: cannot create series from Extension("arrow.fixed_shape_tensor", FixedSizeList(Field { name: "item", data_type: Int64, is_nullable: true, metadata: {} }, 3), Some("{\"shape\":[3]}"))

@kylebarron
Copy link
Contributor Author

Just a note that the geospatial types would need to have metadata on both fixed-size and variable-size arrays, as a series of Polygons or LineStrings would only have the Array type as its most inner type.

@mschrader15
Copy link

I second the use case from #9112 (comment)

@kylebarron
Copy link
Contributor Author

Are there any thoughts about where the extension metadata should be stored in Polars? In Arrow, extension metadata is stored on the field (though arrow2 and presumably also polars-arrow has a DataType::Extension() variant which also can store extension metadata). In that case the metadata would be repeated for each chunk of a ChunkedArray/Series (though potentially under an Arc?).

Would it be preferred to store extension metadata on the ChunkedArray or on the Series? If it's stored on ChunkedArray, it would have to be added in more places in the code? I don't know whether adding it onto the Series itself would make sense, because you couldn't use it from the typed representation?

- pub struct Series(pub Arc<dyn SeriesTrait>);
+ pub struct Series(pub Arc<dyn SeriesTrait>, SeriesMetadata);

My motivation for extension types is to support extended logical types in pyo3-polars. It looks like those APIs always accept and return a Series, and so either approach would work for that.

@rok
Copy link

rok commented May 8, 2024

Arrow's growing canonical extension list (fixed shape tensor, variable shape tensor, UUID, JSON) would be another argument for adding extension types.

@adienes
Copy link

adienes commented Jul 6, 2024

the lack of support (and in particular, not ignoring unknown extensions) is making multi-language workflows more difficult. see for example apache/arrow-julia#508

@AlexanderNenninger
Copy link

AlexanderNenninger commented Aug 6, 2024

I recently wrote an issue to support such use cases: #15689. The idea is for Polars to provide very basic infrastructure and let the rest be handled by extensions and plugins.

@DeflateAwning
Copy link
Contributor

I'd be more than happy to try to take a stab at this. Can anyone give a guide on what's required though? Is this an insanely big undertaking? It's really too bad it's blocking extensions like GeoPolars.

@deanm0000
Copy link
Collaborator

@DeflateAwning I think the 0th step would be for @ritchie46 to say PRs welcome on an implementation since the last thing he said wasn't an endorsement.

If we're talking about blockers to geopolars, it isn't just extension types, it's also union types. Extension types let you store metadata (such as which geometry type, crs) while union types let you have a geometry column with more than just one geometry at a time.

Here's a little bit of extra info geoarrow/geoarrow-rs#731.

Note that since then the Arrow2Arrow trait has been deleted so it's even more difficult to round trip between the two.

Assuming the support was there, you could look at the implementation of Array #8943 and try to use that as a very rough template of how to add the new dtype. Both Union and Extension already exist in polars-arrow just as FixedSizeList already existed before the Array type was added.

@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 17, 2025

I doubt Polars would want to fully support union types, potentially ever, and given the complexity of union types I don't blame them.

I think the most realistic way forward is the LOAD-PASSTHROUGH option described in #7288 (comment), which would support arbitrary and any Arrow data types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype Area: data types in general enhancement New feature or an improvement of an existing feature wish Features we would ideally want to support, but not right now
Projects
None yet
Development

No branches or pull requests

10 participants