Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sub-day precision Date should be floored when treated as integer #116

Merged
merged 4 commits into from
Feb 6, 2025

Conversation

eitsupi
Copy link

@eitsupi eitsupi commented Feb 6, 2025

Same as apache/arrow-nanoarrow#674, duckdb/duckdb-r#981

Because the Date type in R allows sub-days, it must be rounded to -Inf before being forced to integerish.
Conventional rounding in the direction of zero would change the direction of rounding before and after Unix origin.

@gaborcsardi
Copy link
Member

Thanks! I wonder if there is a faster way to do this, because it makes the conversion ~50% slower:

days <- as.Date(rep(c(-1.1, -0.1, 0, 0.1, 1.1), 10000), origin = "1970-01-01")
f1 <- function(x) mode(x) <- "integer"
f2 <- function(x) .Date(as.integer(floor(as.numeric(x))))
bench::mark(f1(days), f2(days), check = FALSE)
# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f1(days)     95.5µs  105µs     9371.     195KB     31.3  4193    14      447ms
2 f2(days)    105.5µs  147µs     6681.     781KB     96.6  2421    35      362ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

@gaborcsardi
Copy link
Member

I don't really understand why, but this is even faster than the original:

days <- as.Date(rep(c(-1.1, -0.1, 0, 0.1, 1.1), 10000), origin = "1970-01-01")
f1 <- function(x) mode(x) <- "integer"
f2 <- function(x) .Date(as.integer(floor(as.numeric(x))))
f3 <- function(x) { cls <- class(x); x <- floor(as.numeric(x)); class(x) <- cls; x }

stopifnot(f2(days) == f3(days))

bench::mark(f1(days), f2(days), f3(days), check = FALSE)
# A tibble: 3 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f1(days)    95.5µs 104.5µs     9340.     195KB     34.0  4121    15      441ms
2 f2(days)     106µs 150.5µs     6840.     792KB    103.   2649    40      387ms
3 f3(days)    54.3µs  79.5µs    12617.     409KB     92.2  4652    34      369ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

@gaborcsardi
Copy link
Member

Very slightly faster, still:

f4 <- function(x) { cls <- class(x); x <- floor(unclass(x)); class(x) <- cls; x }

@gaborcsardi
Copy link
Member

gaborcsardi commented Feb 6, 2025

Except that I forgot to convert to integer. /o\ Never mind then.

Proper benchmark:

days <- as.Date(rep(c(-1.1, -0.1, 0, 0.1, 1.1), 10000), origin = "1970-01-01")
f2 <- function(x) .Date(as.integer(floor(as.numeric(x))))
f3 <- function(x) { cls <- class(x); x <- as.integer(floor(as.numeric(x))); class(x) <- cls; x }
f4 <- function(x) { cls <- class(x); x <- as.integer(floor(unclass(x))); class(x) <- cls; x }

waldo::compare(f2(days), f3(days))
#> ✔ No differences

waldo::compare(f2(days), f4(days))
#> ✔ No differences

bench::mark(f2(days), f3(days), f4(days))
# A tibble: 3 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f2(days)    106.1µs  143µs     6805.     781KB     89.0  2371    31      348ms
2 f3(days)     99.9µs  129µs     7783.     605KB     78.2  2986    30      384ms
3 f4(days)    145.1µs  173µs     5796.     605KB     60.2  1347    14      232ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

@gaborcsardi gaborcsardi merged commit c53fcc1 into r-lib:main Feb 6, 2025
19 checks passed
@gaborcsardi
Copy link
Member

Thank you!

@eitsupi eitsupi deleted the fix-sub-date branch February 6, 2025 13:22
@eitsupi
Copy link
Author

eitsupi commented Feb 6, 2025

Thanks for taking a look at this. I did a little digging and it seems that the time consuming part is mainly the floor() and the process of converting double to integer.
So to make it any faster, I think it would have to be implemented in C++, but I don't have the C++ skills to do it here.
(Also, I don't know if it's worth it considering the percentage of execution time to the overall process.)

In the polars package I did the processing on the Rust side.
https://github.com/pola-rs/r-polars/blob/b900d6391b82a9245d8529e93f7caeb3ccfede8e/src/rust/src/series/construction.rs#L141-L160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants