-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Arrow Extension types #9112
Comments
I am very interested in FixedShapeTensor support. In case it helps guide implementation, I'll explain more about how I use them, which is to store covariance matrixes. These covariance matrixes are typically 6x6, in my case, but sometimes 3x3. I'm interested in getting their diagonal elements, and in selecting specific off-axis elements. Nullability of individual elements is very important to me as well. |
We can store a metadata flag on the I have to check, but I also don't think that's needed. Keeping the meta-data around should be enough. |
As of 6fbc142, >>> import polars
>>> import pyarrow as pa
>>> import numpy as np
>>> tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
>>> polars.from_arrow(tensor_array)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/swnelson/code/3p/polars/py-polars/polars/convert.py", line 605, in from_arrow
data=pl.Series._from_arrow(name, data, rechunk=rechunk),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/swnelson/code/3p/polars/py-polars/polars/series/series.py", line 325, in _from_arrow
return cls._from_pyseries(arrow_to_pyseries(name, values, rechunk))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/swnelson/code/3p/polars/py-polars/polars/utils/_construction.py", line 178, in arrow_to_pyseries
pys = PySeries.from_arrow(name, array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exceptions.ComputeError: cannot create series from Extension("arrow.fixed_shape_tensor", FixedSizeList(Field { name: "item", data_type: Int64, is_nullable: true, metadata: {} }, 3), Some("{\"shape\":[3]}")) Similar issue if passed a pyarrow.Table which contains an Extension array: >>> table = pa.table([tensor_array], names=["vals"])
>>> polars.from_arrow(table)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/swnelson/code/3p/polars/py-polars/polars/convert.py", line 599, in from_arrow
return pl.DataFrame._from_arrow(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/swnelson/code/3p/polars/py-polars/polars/dataframe/frame.py", line 612, in _from_arrow
arrow_to_pydf(
File "/Users/swnelson/code/3p/polars/py-polars/polars/utils/_construction.py", line 1338, in arrow_to_pydf
pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exceptions.ComputeError: cannot create series from Extension("arrow.fixed_shape_tensor", FixedSizeList(Field { name: "item", data_type: Int64, is_nullable: true, metadata: {} }, 3), Some("{\"shape\":[3]}")) |
Just a note that the geospatial types would need to have metadata on both fixed-size and variable-size arrays, as a series of |
I second the use case from #9112 (comment) |
Are there any thoughts about where the extension metadata should be stored in Polars? In Arrow, extension metadata is stored on the field (though arrow2 and presumably also polars-arrow has a Would it be preferred to store extension metadata on the ChunkedArray or on the Series? If it's stored on ChunkedArray, it would have to be added in more places in the code? I don't know whether adding it onto the Series itself would make sense, because you couldn't use it from the typed representation? - pub struct Series(pub Arc<dyn SeriesTrait>);
+ pub struct Series(pub Arc<dyn SeriesTrait>, SeriesMetadata); My motivation for extension types is to support extended logical types in pyo3-polars. It looks like those APIs always accept and return a |
Arrow's growing canonical extension list (fixed shape tensor, variable shape tensor, UUID, JSON) would be another argument for adding extension types. |
the lack of support (and in particular, not ignoring unknown extensions) is making multi-language workflows more difficult. see for example apache/arrow-julia#508 |
I recently wrote an issue to support such use cases: #15689. The idea is for Polars to provide very basic infrastructure and let the rest be handled by extensions and plugins. |
I'd be more than happy to try to take a stab at this. Can anyone give a guide on what's required though? Is this an insanely big undertaking? It's really too bad it's blocking extensions like GeoPolars. |
@DeflateAwning I think the 0th step would be for @ritchie46 to say PRs welcome on an implementation since the last thing he said wasn't an endorsement. If we're talking about blockers to geopolars, it isn't just extension types, it's also union types. Extension types let you store metadata (such as which geometry type, crs) while union types let you have a geometry column with more than just one geometry at a time. Here's a little bit of extra info geoarrow/geoarrow-rs#731. Note that since then the Arrow2Arrow trait has been deleted so it's even more difficult to round trip between the two. Assuming the support was there, you could look at the implementation of Array #8943 and try to use that as a very rough template of how to add the new dtype. Both Union and Extension already exist in polars-arrow just as FixedSizeList already existed before the Array type was added. |
I doubt Polars would want to fully support union types, potentially ever, and given the complexity of union types I don't blame them. I think the most realistic way forward is the |
Problem description
(Creating this as a stub after closing #4014)
Now that the
Array
data type has been implemented, there may be interest in supporting Arrow extension types, which allow storing other pieces of metadata on the column. Some examples:arrow.fixed_shape_tensor
which store shape and array orderThe text was updated successfully, but these errors were encountered: