Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.concat handle multiple integer, float and null types #812

Merged
merged 4 commits into from
Jan 17, 2024

Conversation

lkarthee
Copy link
Member

@lkarthee lkarthee commented Jan 11, 2024

This is an attempt handle concatenation of multiple {:s, _}, {:u, _}, {:f, _} and new null type. Two issues are being addressed:

  • type casting / promotion
  • concatenation involving a :null series

Concatenation of null series

  • Type ++ Null - allowed
  • Null ++ Type - allowed
  • Null ++ Null - allowed

Type casting

Type casting is allowed only when there is no loss of information.

  • When same numeric types are involved, it is casted to largest type among the series. For example, {:s, 8}, {:s, 16} -> {:s, 8} will be casted to {:s, 16} . In case of {:u, 16}, {:u, 32}, {:u, 16} will be casted to {:u, 32}
  • When different numeric types are involved, it is casted to {:f, 64}. I have not changed the existing behaviour. Will update this behaviour later.
  • numeric and other types - error

As the discussion of type casting is going on, I have not written tests.

@lkarthee
Copy link
Member Author

ss = Series.from_list(["a", "b", "c"])
sn = Series.from_list([nil, nil, nil], dtype: :null)
s8 = Series.from_list([1, 2, 3], dtype: {:s, 8})
s16 = Series.from_list([16, 17, 18], dtype: {:s, 16})
f32 = Series.from_list([32.0, 33.0, 34.0], dtype: {:f, 32})
f64 = Series.from_list([64.0, 65.0, 6.4], dtype: {:f, 64})
u8 = Series.from_list([10, 11, 12], dtype: {:u, 8})
Series.concat([s8, sn])
Series.concat([ss, sn]) 
# Series.concat([ss, s8]) # fails 
# Series.concat([ss, f32]) # fails 
Series.concat([sn, f32])
Series.concat([f32, sn])
Series.concat([sn, s8, s16])
Series.concat([sn, s8, f32])
Series.concat([u8, s8])  # cast to {:f, 64}

@josevalim
Copy link
Member

Polars logic is that it must preserve the type on the left side, rather than trying to find a magic type that satisfies both sides, so it performs fewer casting. I'd say that's a sane we can choose to mimic if we want to. Overall, we have to choose between:

  1. Define our own casting logic that aims to unify both sides (this PR)
  2. Align with Polars and document that the type of the left side is preserved

Any opinions?

@lkarthee
Copy link
Member Author

lkarthee commented Jan 11, 2024

Null ++ Type and Type ++ Null can be done. We can cast to a same higher type if there is no loss of information. For example, {:s, _} to {:u, _} casting should be explicit.

I am not sure about casting from ints to floats.

@cigrainger
Copy link
Member

I'm definitely tempted to cast in such a way that we preserve information. That being said, it can end up being surprising, even when well documented. There's something to be said for the simplicity of 'always left side' vs. 'here's a hierarchy'. I'm not entirely sure. I think the hierarchy/preservation approach can become confusing if, say, you're reducing over a number of concatenations. But the opposite is also true: let's say you have a number of Series and one of them has default to null because it was empty and you didn't know (this happens plenty for me). I would find it surprising and frustrating if I was getting a null type at the end of some function and I couldn't figure out why.

All of this is to say I think there are pros and cons both ways and I'm just not sure. How complex do we think juggling a hierarchy will be going forward?

@billylanchantin
Copy link
Member

I agree with @cigrainger. I think I could be swayed in either direction, but I lean toward permissive casting.

We had to make a similar call with the duration types. For example we decided that:

  • {:duration, :millisecond} + {:duration, :nanosecond} = {:duration, :nanosecond}

We enforced this on our side:

defp enforce_highest_precision([
%Series{dtype: {left_base, left_timeunit}} = left,
%Series{dtype: {right_base, right_timeunit}} = right
])

However, my pragmatic side says that following Polars will be easier in the long term. Left-side-wins is a simple take that's easy to apply.

I found this comment on my issue about duration + date illuminating:

I view time + duration as the equivalent of time.offset_by(duration), which is not commutative obviously--you cannot offset a duration by a time. A date/datetime is a position on the time axis and a duration is a movement along the time axis.

Some (many?) Polars Rust maintainers view operations as main_object.operation(arg) instead of operation(arg1, arg2). However, the Polars Python maintainers seem to have the opposite take.

As per @alexander-beedie comment, the python side is already commutative. So it makes sense to have Rust and Python behavior be the same. And according to the "principle of least surprise" keeping addition commutative makes sense even though duration + date is yucky.

(Possibly because things like a + b in Python are well known to actually be __add__(a, b).)

So ultimately my issue got addressed because the Python API had set precedent. But it remains the case that left-side-wins is their default take, so we'll likely have to fight this battle on many fronts if we choose to.

@alexander-beedie
Copy link

alexander-beedie commented Jan 11, 2024

FYI: I added a little additional context here ;)
pola-rs/polars#11190 (comment)

@billylanchantin
Copy link
Member

Whoops I didn't mean to summon you, @alexander-beedie! But thank you for the extra context! And of course all the work on Polars, which we get to benefit from :)

@philss
Copy link

philss commented Jan 12, 2024

I was following the approach of casting to preserve information for when the user is mixing different numeric dtypes in #794.
I notice that Polars (Rust) does some casting - for example, the subtraction u8 - f64 of series results in a f64 series -, but sometimes I have to do it myself - in the case of u8 - u8, I have to cast the left side to signed int in order to perform the operation.

However, my pragmatic side says that following Polars will be easier in the long term. Left-side-wins is a simple take that's easy to apply.

I agree with @billylanchantin here! I think in the long term, it will be probably easier to maintain. But I have my doubts if this is going to lead to a pleasant experience for the users. My take - and it's not a strong opinion - is that we would want to keep casting for binary operations (like add or subtract), but we would keep APIs like concat/1 more restrictive, requiring the user to explicitly cast the series to a single dtype.

But yeah, I agree with @cigrainger that this can be confusing even if we document it - and probably more confusing mixing up the behaviours like I'm suggesting here 😅. But it may be a nicer experience IMHO.

@lkarthee
Copy link
Member Author

lkarthee commented Jan 12, 2024

Here are the main themes from the above comments:

  • Are operations commutative ? a ++ b = b ++ a
  • Surprise element or unintuitive to user.
  • Preserve information.
  • Complexity of rules.
  • Maintainability w.r.to Polars alignment.
  • Precedent for diverging from polars.
  • Effort required to implement and maintain say commutative for null & type casting.

My opinion:

  • For a user coming from elixir, it might surprise that b ++ a is not allowed ?
  • Say if nil is valid in a column, but when appending order matters for a list of nils is unintuitive.
  • Casting in this case does not result in any loss of info.
  • Same type casting to a higher rank should be allowed for convenience.
  • Can be restrictive w.r.to casting to a different numeric type {:s, n} to {:u, n} or {:s, n} to {:f, n} or {:u, n} to {:f, n} and vice versa.
  • There is a precedent of diverging from polars - date time and duration.
  • Can there be a flag to disallow/opt out of the divergent behaviour ?

@cigrainger
Copy link
Member

cigrainger commented Jan 12, 2024

Good feedback from a trusted friend who is a heavy R/dplyr user: "math not being commutative would break my %!&@ brain".

@billylanchantin
Copy link
Member

@cigrainger I agree with your friend! Although ++ isn't commutative, it is commutative WRT the output type (or ought to be IMHO).

@lkarthee Excellent summary.

Can there be a flag to disallow/opt out of the divergent behaviour ?

I vote no because 1) it would be a maintenance burden and 2) you can use explicit casting to get a different behavior if you want.

@billylanchantin
Copy link
Member

@philss I just re-read your comment. To clarify, I think we could diverge from Polars and cast. I was trying to outline the pitfalls of that decision in my comment, but I can see how it came off as me arguing for keeping in line with Polars. Sorry if I muddled the convo!

@josevalim
Copy link
Member

It seems everyone is in favor of this change, so let's go ahead! Only tests + docs missing. :) Thank you @lkarthee

@lkarthee
Copy link
Member Author

Need clarity on when types don't match - fail or cast ?

  • int to float
  • unsigned to signed
Series.concat([u8, s8]) # should fail ?
Series.concat([u8, f32]) # ??? fail or cast ?

@josevalim
Copy link
Member

U8 with S8 becomes S16. Anything with float becomes float. + probably already works like that.

@billylanchantin
Copy link
Member

@josevalim For my own edification, is the logic there that {:u, x} == {:s, 2x}?

@josevalim
Copy link
Member

@billylanchantin correct!

@josevalim josevalim merged commit b7e6048 into elixir-explorer:main Jan 17, 2024
3 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@lkarthee lkarthee deleted the fix_series_concat branch January 19, 2024 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants