Series.concat handle multiple integer, float and null types #812

lkarthee · 2024-01-11T10:31:24Z

This is an attempt handle concatenation of multiple {:s, _}, {:u, _}, {:f, _} and new null type. Two issues are being addressed:

type casting / promotion
concatenation involving a :null series

Concatenation of null series

Type ++ Null - allowed
Null ++ Type - allowed
Null ++ Null - allowed

Type casting

Type casting is allowed only when there is no loss of information.

When same numeric types are involved, it is casted to largest type among the series. For example, {:s, 8}, {:s, 16} -> {:s, 8} will be casted to {:s, 16} . In case of {:u, 16}, {:u, 32}, {:u, 16} will be casted to {:u, 32}
When different numeric types are involved, it is casted to {:f, 64}. I have not changed the existing behaviour. Will update this behaviour later.
numeric and other types - error

As the discussion of type casting is going on, I have not written tests.

lkarthee · 2024-01-11T11:04:03Z

ss = Series.from_list(["a", "b", "c"])
sn = Series.from_list([nil, nil, nil], dtype: :null)
s8 = Series.from_list([1, 2, 3], dtype: {:s, 8})
s16 = Series.from_list([16, 17, 18], dtype: {:s, 16})
f32 = Series.from_list([32.0, 33.0, 34.0], dtype: {:f, 32})
f64 = Series.from_list([64.0, 65.0, 6.4], dtype: {:f, 64})
u8 = Series.from_list([10, 11, 12], dtype: {:u, 8})
Series.concat([s8, sn])
Series.concat([ss, sn]) 
# Series.concat([ss, s8]) # fails 
# Series.concat([ss, f32]) # fails 
Series.concat([sn, f32])
Series.concat([f32, sn])
Series.concat([sn, s8, s16])
Series.concat([sn, s8, f32])
Series.concat([u8, s8])  # cast to {:f, 64}

josevalim · 2024-01-11T11:06:38Z

Polars logic is that it must preserve the type on the left side, rather than trying to find a magic type that satisfies both sides, so it performs fewer casting. I'd say that's a sane we can choose to mimic if we want to. Overall, we have to choose between:

Define our own casting logic that aims to unify both sides (this PR)
Align with Polars and document that the type of the left side is preserved

Any opinions?

lkarthee · 2024-01-11T11:14:16Z

Null ++ Type and Type ++ Null can be done. We can cast to a same higher type if there is no loss of information. For example, {:s, _} to {:u, _} casting should be explicit.

I am not sure about casting from ints to floats.

cigrainger · 2024-01-11T11:24:39Z

I'm definitely tempted to cast in such a way that we preserve information. That being said, it can end up being surprising, even when well documented. There's something to be said for the simplicity of 'always left side' vs. 'here's a hierarchy'. I'm not entirely sure. I think the hierarchy/preservation approach can become confusing if, say, you're reducing over a number of concatenations. But the opposite is also true: let's say you have a number of Series and one of them has default to null because it was empty and you didn't know (this happens plenty for me). I would find it surprising and frustrating if I was getting a null type at the end of some function and I couldn't figure out why.

All of this is to say I think there are pros and cons both ways and I'm just not sure. How complex do we think juggling a hierarchy will be going forward?

billylanchantin · 2024-01-11T14:59:08Z

I agree with @cigrainger. I think I could be swayed in either direction, but I lean toward permissive casting.

We had to make a similar call with the duration types. For example we decided that:

{:duration, :millisecond} + {:duration, :nanosecond} = {:duration, :nanosecond}

We enforced this on our side:

explorer/lib/explorer/series.ex

Lines 3097 to 3100 in 9c9df97

    
           defp enforce_highest_precision([ 
        
                  %Series{dtype: {left_base, left_timeunit}} = left, 
        
                  %Series{dtype: {right_base, right_timeunit}} = right 
        
                ])

However, my pragmatic side says that following Polars will be easier in the long term. Left-side-wins is a simple take that's easy to apply.

I found this comment on my issue about duration + date illuminating:

I view time + duration as the equivalent of time.offset_by(duration), which is not commutative obviously--you cannot offset a duration by a time. A date/datetime is a position on the time axis and a duration is a movement along the time axis.

Some (many?) Polars Rust maintainers view operations as main_object.operation(arg) instead of operation(arg1, arg2). However, the Polars Python maintainers seem to have the opposite take.

As per @alexander-beedie comment, the python side is already commutative. So it makes sense to have Rust and Python behavior be the same. And according to the "principle of least surprise" keeping addition commutative makes sense even though duration + date is yucky.

(Possibly because things like a + b in Python are well known to actually be __add__(a, b).)

So ultimately my issue got addressed because the Python API had set precedent. But it remains the case that left-side-wins is their default take, so we'll likely have to fight this battle on many fronts if we choose to.

alexander-beedie · 2024-01-11T15:03:36Z

FYI: I added a little additional context here ;)
pola-rs/polars#11190 (comment)

billylanchantin · 2024-01-11T15:15:08Z

Whoops I didn't mean to summon you, @alexander-beedie! But thank you for the extra context! And of course all the work on Polars, which we get to benefit from :)

philss · 2024-01-12T04:38:58Z

I was following the approach of casting to preserve information for when the user is mixing different numeric dtypes in #794.
I notice that Polars (Rust) does some casting - for example, the subtraction u8 - f64 of series results in a f64 series -, but sometimes I have to do it myself - in the case of u8 - u8, I have to cast the left side to signed int in order to perform the operation.

However, my pragmatic side says that following Polars will be easier in the long term. Left-side-wins is a simple take that's easy to apply.

I agree with @billylanchantin here! I think in the long term, it will be probably easier to maintain. But I have my doubts if this is going to lead to a pleasant experience for the users. My take - and it's not a strong opinion - is that we would want to keep casting for binary operations (like add or subtract), but we would keep APIs like concat/1 more restrictive, requiring the user to explicitly cast the series to a single dtype.

But yeah, I agree with @cigrainger that this can be confusing even if we document it - and probably more confusing mixing up the behaviours like I'm suggesting here 😅. But it may be a nicer experience IMHO.

lkarthee · 2024-01-12T06:34:52Z

Here are the main themes from the above comments:

Are operations commutative ? a ++ b = b ++ a
Surprise element or unintuitive to user.
Preserve information.
Complexity of rules.
Maintainability w.r.to Polars alignment.
Precedent for diverging from polars.
Effort required to implement and maintain say commutative for null & type casting.

My opinion:

For a user coming from elixir, it might surprise that b ++ a is not allowed ?
Say if nil is valid in a column, but when appending order matters for a list of nils is unintuitive.
Casting in this case does not result in any loss of info.
Same type casting to a higher rank should be allowed for convenience.
Can be restrictive w.r.to casting to a different numeric type {:s, n} to {:u, n} or {:s, n} to {:f, n} or {:u, n} to {:f, n} and vice versa.
There is a precedent of diverging from polars - date time and duration.
Can there be a flag to disallow/opt out of the divergent behaviour ?

cigrainger · 2024-01-12T10:09:54Z

Good feedback from a trusted friend who is a heavy R/dplyr user: "math not being commutative would break my %!&@ brain".

billylanchantin · 2024-01-12T14:58:30Z

@cigrainger I agree with your friend! Although ++ isn't commutative, it is commutative WRT the output type (or ought to be IMHO).

@lkarthee Excellent summary.

Can there be a flag to disallow/opt out of the divergent behaviour ?

I vote no because 1) it would be a maintenance burden and 2) you can use explicit casting to get a different behavior if you want.

billylanchantin · 2024-01-12T15:07:38Z

@philss I just re-read your comment. To clarify, I think we could diverge from Polars and cast. I was trying to outline the pitfalls of that decision in my comment, but I can see how it came off as me arguing for keeping in line with Polars. Sorry if I muddled the convo!

josevalim · 2024-01-15T07:51:33Z

It seems everyone is in favor of this change, so let's go ahead! Only tests + docs missing. :) Thank you @lkarthee

lkarthee · 2024-01-15T08:38:59Z

Need clarity on when types don't match - fail or cast ?

int to float
unsigned to signed

Series.concat([u8, s8]) # should fail ?
Series.concat([u8, f32]) # ??? fail or cast ?

josevalim · 2024-01-15T08:43:42Z

U8 with S8 becomes S16. Anything with float becomes float. + probably already works like that.

billylanchantin · 2024-01-15T15:06:09Z

@josevalim For my own edification, is the logic there that {:u, x} == {:s, 2x}?

josevalim · 2024-01-15T15:43:14Z

@billylanchantin correct!

josevalim · 2024-01-17T08:30:13Z

💚 💙 💜 💛 ❤️

lkarthee added 3 commits January 11, 2024 15:04

Series.concat - {:s, _}, {:f, _} and :null

260f6cb

Update series.ex

71f82bf

remove unused _var

b875564

lkarthee mentioned this pull request Jan 11, 2024

Support Series.from_list([1, 2, 3], dtype: :null) #806

Closed

Merge branch 'elixir-explorer:main' into fix_series_concat

b9d05c2

josevalim merged commit b7e6048 into elixir-explorer:main Jan 17, 2024
3 checks passed

lkarthee deleted the fix_series_concat branch January 19, 2024 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.concat handle multiple integer, float and null types #812

Series.concat handle multiple integer, float and null types #812

lkarthee commented Jan 11, 2024 •

edited

Loading

lkarthee commented Jan 11, 2024

josevalim commented Jan 11, 2024

lkarthee commented Jan 11, 2024 •

edited

Loading

cigrainger commented Jan 11, 2024

billylanchantin commented Jan 11, 2024

alexander-beedie commented Jan 11, 2024 •

edited

Loading

billylanchantin commented Jan 11, 2024

philss commented Jan 12, 2024

lkarthee commented Jan 12, 2024 •

edited

Loading

cigrainger commented Jan 12, 2024 •

edited

Loading

billylanchantin commented Jan 12, 2024

billylanchantin commented Jan 12, 2024

josevalim commented Jan 15, 2024

lkarthee commented Jan 15, 2024

josevalim commented Jan 15, 2024

billylanchantin commented Jan 15, 2024

josevalim commented Jan 15, 2024

josevalim commented Jan 17, 2024

Series.concat handle multiple integer, float and null types #812

Series.concat handle multiple integer, float and null types #812

Conversation

lkarthee commented Jan 11, 2024 • edited Loading

Concatenation of null series

Type casting

lkarthee commented Jan 11, 2024

josevalim commented Jan 11, 2024

lkarthee commented Jan 11, 2024 • edited Loading

cigrainger commented Jan 11, 2024

billylanchantin commented Jan 11, 2024

alexander-beedie commented Jan 11, 2024 • edited Loading

billylanchantin commented Jan 11, 2024

philss commented Jan 12, 2024

lkarthee commented Jan 12, 2024 • edited Loading

cigrainger commented Jan 12, 2024 • edited Loading

billylanchantin commented Jan 12, 2024

billylanchantin commented Jan 12, 2024

josevalim commented Jan 15, 2024

lkarthee commented Jan 15, 2024

josevalim commented Jan 15, 2024

billylanchantin commented Jan 15, 2024

josevalim commented Jan 15, 2024

josevalim commented Jan 17, 2024

lkarthee commented Jan 11, 2024 •

edited

Loading

lkarthee commented Jan 11, 2024 •

edited

Loading

alexander-beedie commented Jan 11, 2024 •

edited

Loading

lkarthee commented Jan 12, 2024 •

edited

Loading

cigrainger commented Jan 12, 2024 •

edited

Loading