Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.8.0 summarise/count bug? #842

Closed
mlineen opened this issue Jan 29, 2024 · 8 comments
Closed

v0.8.0 summarise/count bug? #842

mlineen opened this issue Jan 29, 2024 · 8 comments

Comments

@mlineen
Copy link
Contributor

mlineen commented Jan 29, 2024

Not sure if this is a bug or a feature, but when a nil is present in the series, summarise with count skips the nil (this behavior changed from 0.7.2 to 0.8.0):

iex(3)> Explorer.DataFrame.new(testing: [1, nil, 3]) |> Explorer.DataFrame.summarise(count: count(testing))
#Explorer.DataFrame<
  Polars[1 x 1]
  count u32 [2]
>
iex(2)> Explorer.DataFrame.new(testing: [1, 2, 3]) |> Explorer.DataFrame.summarise(count: count(testing))
#Explorer.DataFrame<
  Polars[1 x 1]
  count u32 [3]
>
@billylanchantin
Copy link
Member

Hi @mlineen,

Thanks for the report!

Not sure if this is a bug or a feature

This looks like a bug to me. I didn't see any notes in the Polars release about changing the behavior of count. And AFAICT, the behavior should match that of Explorer.Series.count/1. It does not since:

import Explorer.Series
[1, nil, 3] |> from_list() |> count() #=> 3

I'll try to dig more into it tonight to make sure.

@billylanchantin
Copy link
Member

billylanchantin commented Jan 30, 2024

Interesting! So this isn't a "bug" per se. As of Polars 0.36.2, this is the default behavior. From the release:

💥 Breaking changes

We now need to use a different function under the hood if we want the old behavior.

WDYT @philss, @josevalim, @cigrainger?

@josevalim
Copy link
Member

@billylanchantin this is tricky. count in SQL does not include nulls indeed. And there is also a chance the behaviour of Series.count outside of a lazy query does not handle nils. So probably what we need to do is:

  • Make Series.count always discard nils, both inside and outside lazy queries
  • Make Series.size return the whole size. Inside groups, it should return the size of each group

WDYT?

@billylanchantin
Copy link
Member

I think that's a good plan. We'll want to call out that count (now) works like it does in SQL in the docs.

Will we also need to tackle Series.size not being available in lazy series? We may be able to make that work with len:

https://docs.rs/polars/latest/polars/?search=len

@benkimpel
Copy link

@billylanchantin this is tricky. count in SQL does not include nulls indeed. And there is also a chance the behaviour of Series.count outside of a lazy query does not handle nils. So probably what we need to do is:

* Make `Series.count` always discard nils, both inside and outside lazy queries

* Make `Series.size` return the whole size. Inside groups, it should return the size of each group

WDYT?

@josevalim is this an accurate summary of your proposal?

Explorer Series.count = Polars Series count
Explorer Series.size ≈ Polars Series len (≈ because I'm not sure what Polars behavior is regarding groups)

(I work with @mlineen)

@josevalim
Copy link
Member

Yes!!

@cigrainger
Copy link
Member

Yep I agree with that proposal!

@billylanchantin
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants