Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PyArrow's ExtensionType. #15213

Closed
caseyclements opened this issue Mar 21, 2024 · 4 comments
Closed

Support for PyArrow's ExtensionType. #15213

caseyclements opened this issue Mar 21, 2024 · 4 comments
Labels
A-io Area: reading and writing data enhancement New feature or an improvement of an existing feature

Comments

@caseyclements
Copy link

caseyclements commented Mar 21, 2024

Description

Add support for PyArrow Extension Types.

Context / Motivation

Here are some details on extending pyarrow.

The following binary example comes straight from the PyArrow documentation. We recently added Polars support to PyMongoArrow, and used this to create an ObjectIdType. It's implementation is almost identical to the example.

In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key.
If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.
The MongoDB documentation on ObjectIds.

Until Polars supports pyarrow.ExtensionTypes, we must cast them to their base Arrow classes.

To reproduce the issue.

import pyarrow as pa
import polars as pl
import uuid


class UuidType(pa.ExtensionType):
    """For example, we could define a custom UUID type for 128-bit numbers
    which can be represented as FixedSizeBinary type with 16 bytes
    """

    def __init__(self):
        super().__init__(pa.binary(16), "my_package.uuid")

    def __arrow_ext_serialize__(self):
        # Since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        # Sanity checks, not required but illustrate the method signature.
        assert storage_type == pa.binary(16)
        assert serialized == b''
        # Return an instance of this subclass given the serialized
        # metadata.
        return UuidType()


uuid_type = UuidType()
storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
extension_arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)

print(f"{pl.from_arrow(storage_array) = }")
try:
    print(f"{pl.from_arrow(extension_arr) = }")
except pl.exceptions.ComputeError as exc:
    print(f"{exc = }")
@caseyclements caseyclements added the enhancement New feature or an improvement of an existing feature label Mar 21, 2024
@caseyclements
Copy link
Author

caseyclements commented Mar 21, 2024

Here's the output of the example above.
See also test_polars.py in pymongoarrow.

(polars) ~/src/mongo-arrow/sandbox (main)
$ python polars_extensiontypes.py
pl_arr = shape: (4,)

Series: '' [binary]
[
	b"\xd1\x91\x0b\xae\xad&J\x99\xb0\x1f\xfe\x04E\xd0}\x99"
	b"\xf4\%\xe4\xc7\xeeMA\xb6\xd4s\xd8d\xaa8R"
	b"\xfb\xd3aa\x87\xf5M\xda\xae\x8e\xf1\xa5]\x04\xb4E"
	b"\x08G\xcc\x86\x1cW@\xf8\x8c:X\x9e#=\xa2F"
]
exc = ComputeError('cannot create series from Extension("my_package.uuid", FixedSizeBinary(16), Some(""))')

@alexander-beedie alexander-beedie added the A-io Area: reading and writing data label Mar 21, 2024
@deanm0000
Copy link
Collaborator

I would like to see it, not sure if this is the same as #9112. Definitely not the same as this one but highly related #9373.

I'd also like to see this in conjunction with Union Types #10827

@caseyclements
Copy link
Author

caseyclements commented Mar 22, 2024

I didn't see #9112 when I did a search of existing issues. This is a duplicate.

@caseyclements
Copy link
Author

Closing as duplicate of #9112

@caseyclements caseyclements closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants