Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sedona arrow udf example #1859

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Sedona arrow udf example #1859

wants to merge 15 commits into from

Conversation

Imbruced
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [SEDONA-XXX] my subject.

  • Yes, and the PR name follows the format [GH-XXX] my subject.

  • No:

    • this is a documentation update. The PR name follows the format [DOCS] my subject
    • this is a CI update. The PR name follows the format [CI] my subject

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

  • Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
  • Yes, I have updated the documentation.
  • No, this PR does not affect any public API so no need to change the documentation.

@Imbruced
Copy link
Member Author

need to adjust it before I ll reopen it again

@Imbruced Imbruced reopened this Mar 16, 2025
@Imbruced
Copy link
Member Author

need to add docs for this one


val batchIter = if (batchSize > 0) new BatchIterator(iter, batchSize) else Iterator(iter)

val columnarBatchIter = new ArrowPythonRunner(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a battle for this particular PR, but do we get to choose what the Python function is evaluating on or are we leaning on built-in Spark things such that we are forced to have this be a function of a pandas series? (if it could be a function of, for example, two numpy arrays for points or Arrow buffers more generally, it would open up some options in terms of speed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super opting this solution as well. I just wanted to unlock the arrow based udf in Sedona. I totally agree that we can do better. Right now based on my internal tests it's 2 times faster than normal udf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! At some point my Spark/Scala will be good enough to see if there's any room to improve on that 🙂

Copy link
Member Author

@Imbruced Imbruced Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to help on that 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants