-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series.concat handle multiple integer, float and null types #812
Conversation
ss = Series.from_list(["a", "b", "c"])
sn = Series.from_list([nil, nil, nil], dtype: :null)
s8 = Series.from_list([1, 2, 3], dtype: {:s, 8})
s16 = Series.from_list([16, 17, 18], dtype: {:s, 16})
f32 = Series.from_list([32.0, 33.0, 34.0], dtype: {:f, 32})
f64 = Series.from_list([64.0, 65.0, 6.4], dtype: {:f, 64})
u8 = Series.from_list([10, 11, 12], dtype: {:u, 8})
Series.concat([s8, sn])
Series.concat([ss, sn])
# Series.concat([ss, s8]) # fails
# Series.concat([ss, f32]) # fails
Series.concat([sn, f32])
Series.concat([f32, sn])
Series.concat([sn, s8, s16])
Series.concat([sn, s8, f32])
Series.concat([u8, s8]) # cast to {:f, 64} |
Polars logic is that it must preserve the type on the left side, rather than trying to find a magic type that satisfies both sides, so it performs fewer casting. I'd say that's a sane we can choose to mimic if we want to. Overall, we have to choose between:
Any opinions? |
Null ++ Type and Type ++ Null can be done. We can cast to a same higher type if there is no loss of information. For example, I am not sure about casting from ints to floats. |
I'm definitely tempted to cast in such a way that we preserve information. That being said, it can end up being surprising, even when well documented. There's something to be said for the simplicity of 'always left side' vs. 'here's a hierarchy'. I'm not entirely sure. I think the hierarchy/preservation approach can become confusing if, say, you're reducing over a number of concatenations. But the opposite is also true: let's say you have a number of Series and one of them has default to All of this is to say I think there are pros and cons both ways and I'm just not sure. How complex do we think juggling a hierarchy will be going forward? |
I agree with @cigrainger. I think I could be swayed in either direction, but I lean toward permissive casting. We had to make a similar call with the duration types. For example we decided that:
We enforced this on our side: explorer/lib/explorer/series.ex Lines 3097 to 3100 in 9c9df97
However, my pragmatic side says that following Polars will be easier in the long term. Left-side-wins is a simple take that's easy to apply. I found this comment on my issue about duration + date illuminating:
Some (many?) Polars Rust maintainers view operations as
(Possibly because things like So ultimately my issue got addressed because the Python API had set precedent. But it remains the case that left-side-wins is their default take, so we'll likely have to fight this battle on many fronts if we choose to. |
FYI: I added a little additional context here ;) |
Whoops I didn't mean to summon you, @alexander-beedie! But thank you for the extra context! And of course all the work on Polars, which we get to benefit from :) |
I was following the approach of casting to preserve information for when the user is mixing different numeric dtypes in #794.
I agree with @billylanchantin here! I think in the long term, it will be probably easier to maintain. But I have my doubts if this is going to lead to a pleasant experience for the users. My take - and it's not a strong opinion - is that we would want to keep casting for binary operations (like add or subtract), but we would keep APIs like But yeah, I agree with @cigrainger that this can be confusing even if we document it - and probably more confusing mixing up the behaviours like I'm suggesting here 😅. But it may be a nicer experience IMHO. |
Here are the main themes from the above comments:
My opinion:
|
Good feedback from a trusted friend who is a heavy R/dplyr user: "math not being commutative would break my %!&@ brain". |
@cigrainger I agree with your friend! Although @lkarthee Excellent summary.
I vote no because 1) it would be a maintenance burden and 2) you can use explicit casting to get a different behavior if you want. |
@philss I just re-read your comment. To clarify, I think we could diverge from Polars and cast. I was trying to outline the pitfalls of that decision in my comment, but I can see how it came off as me arguing for keeping in line with Polars. Sorry if I muddled the convo! |
It seems everyone is in favor of this change, so let's go ahead! Only tests + docs missing. :) Thank you @lkarthee |
Need clarity on when types don't match - fail or cast ?
Series.concat([u8, s8]) # should fail ?
Series.concat([u8, f32]) # ??? fail or cast ? |
U8 with S8 becomes S16. Anything with float becomes float. + probably already works like that. |
@josevalim For my own edification, is the logic there that |
@billylanchantin correct! |
💚 💙 💜 💛 ❤️ |
This is an attempt handle concatenation of multiple {:s, _}, {:u, _}, {:f, _} and new null type. Two issues are being addressed:
Concatenation of null series
Type casting
Type casting is allowed only when there is no loss of information.
As the discussion of type casting is going on, I have not written tests.