-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368
Comments
It seems > pak::pak("ivs")
✔ Updated metadata database: 2.90 MB in 6 files.
✔ Updating metadata database ... done
→ Will install 1 package.
→ Will download 1 package with unknown size.
+ ivs 0.2.0 [dl]
ℹ Getting 1 pkg with unknown size
✔ Got ivs 0.2.0 (x86_64-pc-linux-gnu-ubuntu-22.04) (412.73 kB)
✔ Downloaded 1 package (412.73 kB)in 3.5s
✔ Installed ivs 0.2.0 (62ms)
✔ 1 pkg + 5 deps: kept 5, added 1, dld 1 (412.73 kB) [18.4s]
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> library(ivs)
> t_date <- as.Date("2020-05-05")
test_df <- tibble(id = 1:5,
grp = c("a", "a", "b", "b", "b"),
start = rep(t_date+1:5),
end = rep(t_date+11:7))
# adding an iv-variable to the dataframe
test_df_iv <- test_df |>
mutate(range = ivs::iv(start, end))
> test_df_iv$range
<iv<date>[5]>
[1] [2020-05-06, 2020-05-16) [2020-05-07, 2020-05-15) [2020-05-08, 2020-05-14) [2020-05-09, 2020-05-13)
[5] [2020-05-10, 2020-05-12)
> test_df_iv$range |> class()
[1] "ivs_iv" "vctrs_rcrd" "vctrs_vctr"
> test_df_iv |> arrow::as_arrow_table()
Table
5 rows x 5 columns
$id <int32>
$grp <string>
$start <date32[day]>
$end <date32[day]>
$range <<iv<date>[0]>>
> test_df_iv |> arrow::as_arrow_table() |> _$range
ChunkedArray
<<iv<date>[0]>>
[
-- is_valid: all not null
-- child 0 type: date32[day]
[
2020-05-06,
2020-05-07,
2020-05-08,
2020-05-09,
2020-05-10
]
-- child 1 type: date32[day]
[
2020-05-16,
2020-05-15,
2020-05-14,
2020-05-13,
2020-05-12
]
] But when I try to convert this to polars I get an error. Perhaps the In other words, it's an upstream issue. > test_df_iv |> arrow::as_arrow_table() |> polars::pl$from_arrow()
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.4/src/ffi/schema.rs:501:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'explicit panic', src/rdataframe/mod.rs:82:1
Error: Execution halted with the following contexts
0: In R: in pl$from_arrow:
0: During function call [polars::pl$from_arrow(arrow::as_arrow_table(test_df_iv))]
1: user function panicked: from_arrow_record_batches When I write this data to Parquet and try to read it, DuckDB can read it successfully but Python Polars fails to read it. In [1]: import polars as pl
In [2]: pl.read_parquet("test.parquet")
---------------------------------------------------------------------------
ArrowErrorException Traceback (most recent call last)
Cell In[2], line 1
----> 1 pl.read_parquet("test.parquet")
File ~/.local/lib/python3.10/site-packages/polars/io/parquet/functions.py:132, in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, low_memory, pyarrow_options, use_statistics, rechunk)
121 import pyarrow.parquet
123 return from_arrow( # type: ignore[return-value]
124 pa.parquet.read_table(
125 source_prep,
(...)
129 )
130 )
--> 132 return pl.DataFrame._read_parquet(
133 source_prep,
134 columns=columns,
135 n_rows=n_rows,
136 parallel=parallel,
137 row_count_name=row_count_name,
138 row_count_offset=row_count_offset,
139 low_memory=low_memory,
140 use_statistics=use_statistics,
141 rechunk=rechunk,
142 )
File ~/.local/lib/python3.10/site-packages/polars/dataframe/frame.py:852, in DataFrame._read_parquet(cls, source, columns, n_rows, parallel, row_count_name, row_count_offset, low_memory, use_statistics, rechunk)
850 projection, columns = handle_projection_columns(columns)
851 self = cls.__new__(cls)
--> 852 self._df = PyDataFrame.read_parquet(
853 source,
854 columns,
855 projection,
856 n_rows,
857 parallel,
858 _prepare_row_count_args(row_count_name, row_count_offset),
859 low_memory=low_memory,
860 use_statistics=use_statistics,
861 rechunk=rechunk,
862 )
863 return self
ArrowErrorException: OutOfSpec("In <KeyValue@d8>::value(): Invalid utf-8: invalid utf-8 sequence of 1 bytes from index 83") |
Even if this bug is resolved, I think it is necessary to implement dedicated processing to convert vectors built on |
I have fix for this upcomming, but got interrupted. Will update later. |
Currently "polars" will just ignore the vctrs annotations and traits and convert as what it is: a list of two vectors. However that will give a length missmatch 2 by 5. Even though it is possible to import some ivs_iv classed vector, all the methods from the package would not know what to do with polars Series and DataFrame(s). You might want to swap to the polars pl$date_range e.g. Polars should support vctrs-vectors I think. On the occassion of this issue I have refactored the polars import of Robj's and I have also added dependency injection method as_polars_series.YourClass such that any classed Robj can be supported by polars OR tidypolars OR the final user. code example below is from PR #369 . Examples before library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(polars)
library(tidypolars)
#> Warning: package 'tidypolars' was built under R version 4.3.1
#> Registered S3 method overwritten by 'tidypolars':
#> method from
#> print.DataFrame polars
t_date <- as.Date("2020-05-05")
test_df <- tibble(id = 1:5,
grp = c("a", "a", "b", "b", "b"),
start = rep(t_date+1:5),
end = rep(t_date+11:7))
# adding an iv-variable to the dataframe
test_df_iv <- test_df |>
mutate(range = ivs::iv(start, end))
class(test_df_iv$range)
#> [1] "ivs_iv" "vctrs_rcrd" "vctrs_vctr"
unclass(test_df_iv$range)
#> $start
#> [1] "2020-05-06" "2020-05-07" "2020-05-08" "2020-05-09" "2020-05-10"
#>
#> $end
#> [1] "2020-05-16" "2020-05-15" "2020-05-14" "2020-05-13" "2020-05-12"
# importing as plain Dates by remove vctrs attribute
test_df_plain = test_df_iv
test_df_plain[,c("range_1","range_2")] = unclass(test_df_iv$range)
test_df_plain$range = NULL
pl$DataFrame(test_df_plain)
#> shape: (5, 6)
#> ┌─────┬─────┬────────────┬────────────┬────────────┬────────────┐
#> │ id ┆ grp ┆ start ┆ end ┆ range_1 ┆ range_2 │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ str ┆ date ┆ date ┆ date ┆ date │
#> ╞═════╪═════╪════════════╪════════════╪════════════╪════════════╡
#> │ 1 ┆ a ┆ 2020-05-06 ┆ 2020-05-16 ┆ 2020-05-06 ┆ 2020-05-16 │
#> │ 2 ┆ a ┆ 2020-05-07 ┆ 2020-05-15 ┆ 2020-05-07 ┆ 2020-05-15 │
#> │ 3 ┆ b ┆ 2020-05-08 ┆ 2020-05-14 ┆ 2020-05-08 ┆ 2020-05-14 │
#> │ 4 ┆ b ┆ 2020-05-09 ┆ 2020-05-13 ┆ 2020-05-09 ┆ 2020-05-13 │
#> │ 5 ┆ b ┆ 2020-05-10 ┆ 2020-05-12 ┆ 2020-05-10 ┆ 2020-05-12 │
#> └─────┴─────┴────────────┴────────────┴────────────┴────────────┘
# or make a series Struct, (a struct is pretty close a DataFrame in a Series)
pl$select(unclass(test_df_iv$range))$to_struct()$alias("range_struct")
#> polars Series: shape: (5,)
#> Series: 'range_struct' [struct[2]]
#> [
#> {2020-05-06,2020-05-16}
#> {2020-05-07,2020-05-15}
#> {2020-05-08,2020-05-14}
#> {2020-05-09,2020-05-13}
#> {2020-05-10,2020-05-12}
#> ]
# use polars date_range instead of ivs
test_df_plain$range_1 = NULL
test_df_plain$range_2 = NULL
pl$DataFrame(test_df_plain)$with_columns(
#as Date
pl$date_range(
pl$col("start"),
pl$col("end"),
interval = "1d",
explode = FALSE
)$alias("range_as_date_ranges"),
#or some Datetime
pl$date_range(
pl$col("start"),
pl$col("end"),
interval = "1d42m5s",
explode = FALSE
)$alias("range_as_datetime_ranges")
)
#> shape: (5, 6)
#> ┌─────┬─────┬────────────┬────────────┬────────────────────────────┬──────────────────────────┐
#> │ id ┆ grp ┆ start ┆ end ┆ range_as_date_ranges ┆ range_as_datetime_ranges │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ str ┆ date ┆ date ┆ list[date] ┆ list[datetime[μs]] │
#> ╞═════╪═════╪════════════╪════════════╪════════════════════════════╪══════════════════════════╡
#> │ 1 ┆ a ┆ 2020-05-06 ┆ 2020-05-16 ┆ [2020-05-06, 2020-05-07, … ┆ [2020-05-06 00:00:00, │
#> │ ┆ ┆ ┆ ┆ 2020-… ┆ 2020-05-07… │
#> │ 2 ┆ a ┆ 2020-05-07 ┆ 2020-05-15 ┆ [2020-05-07, 2020-05-08, … ┆ [2020-05-07 00:00:00, │
#> │ ┆ ┆ ┆ ┆ 2020-… ┆ 2020-05-08… │
#> │ 3 ┆ b ┆ 2020-05-08 ┆ 2020-05-14 ┆ [2020-05-08, 2020-05-09, … ┆ [2020-05-08 00:00:00, │
#> │ ┆ ┆ ┆ ┆ 2020-… ┆ 2020-05-09… │
#> │ 4 ┆ b ┆ 2020-05-09 ┆ 2020-05-13 ┆ [2020-05-09, 2020-05-10, … ┆ [2020-05-09 00:00:00, │
#> │ ┆ ┆ ┆ ┆ 2020-… ┆ 2020-05-10… │
#> │ 5 ┆ b ┆ 2020-05-10 ┆ 2020-05-12 ┆ [2020-05-10, 2020-05-11, ┆ [2020-05-10 00:00:00, │
#> │ ┆ ┆ ┆ ┆ 2020-05… ┆ 2020-05-11… │
#> └─────┴─────┴────────────┴────────────┴────────────────────────────┴──────────────────────────┘
# But ....
# from a package extending polars or some user perspective it could be ugly to handcode all this
# I have added a method to polars::as_polars_series (likely released with polars 0.8.0) where
# users or package maintainers can use to modify/extend how Robj are converted into Series
# e..g define a generic conversion for any "vctrs_rcrd"
as_polars_series.vctrs_rcrd = function(x, ...) {
pl$DataFrame(unclass(x))$to_struct()
}
# now it just works
pl$lit(test_df_iv$range)
#> polars Expr: Series
pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬─────────────────────────┐
#> │ id ┆ grp ┆ start ┆ end ┆ range │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ str ┆ date ┆ date ┆ struct[2] │
#> ╞═════╪═════╪════════════╪════════════╪═════════════════════════╡
#> │ 1 ┆ a ┆ 2020-05-06 ┆ 2020-05-16 ┆ {2020-05-06,2020-05-16} │
#> │ 2 ┆ a ┆ 2020-05-07 ┆ 2020-05-15 ┆ {2020-05-07,2020-05-15} │
#> │ 3 ┆ b ┆ 2020-05-08 ┆ 2020-05-14 ┆ {2020-05-08,2020-05-14} │
#> │ 4 ┆ b ┆ 2020-05-09 ┆ 2020-05-13 ┆ {2020-05-09,2020-05-13} │
#> │ 5 ┆ b ┆ 2020-05-10 ┆ 2020-05-12 ┆ {2020-05-10,2020-05-12} │
#> └─────┴─────┴────────────┴────────────┴─────────────────────────┘
x = test_df_iv$range
# or define a more specialized conversion for ivs_vs, where we use specificly "start" and "end"
as_polars_series.ivs_iv = function(x, ...) {
pl$DataFrame(unclass(x))$select(
pl$date_range(
pl$col("start"),
pl$col("end"),
interval = "1d",
explode = FALSE
)$alias("ivs_iv")
)$to_series()
}
pl$lit(test_df_iv$range)
#> polars Expr: Series[ivs_iv]
pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬───────────────────────────────────┐
#> │ id ┆ grp ┆ start ┆ end ┆ range │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ str ┆ date ┆ date ┆ list[date] │
#> ╞═════╪═════╪════════════╪════════════╪═══════════════════════════════════╡
#> │ 1 ┆ a ┆ 2020-05-06 ┆ 2020-05-16 ┆ [2020-05-06, 2020-05-07, … 2020-… │
#> │ 2 ┆ a ┆ 2020-05-07 ┆ 2020-05-15 ┆ [2020-05-07, 2020-05-08, … 2020-… │
#> │ 3 ┆ b ┆ 2020-05-08 ┆ 2020-05-14 ┆ [2020-05-08, 2020-05-09, … 2020-… │
#> │ 4 ┆ b ┆ 2020-05-09 ┆ 2020-05-13 ┆ [2020-05-09, 2020-05-10, … 2020-… │
#> │ 5 ┆ b ┆ 2020-05-10 ┆ 2020-05-12 ┆ [2020-05-10, 2020-05-11, 2020-05… │
#> └─────┴─────┴────────────┴────────────┴───────────────────────────────────┘
#final gotcha, select and with_column unpack a single list as input
# as it expects it is an arg list
pl$select(test_df_iv$range) # naively converts them to dates
#> shape: (5, 2)
#> ┌────────────┬────────────┐
#> │ start ┆ end │
#> │ --- ┆ --- │
#> │ date ┆ date │
#> ╞════════════╪════════════╡
#> │ 2020-05-06 ┆ 2020-05-16 │
#> │ 2020-05-07 ┆ 2020-05-15 │
#> │ 2020-05-08 ┆ 2020-05-14 │
#> │ 2020-05-09 ┆ 2020-05-13 │
#> │ 2020-05-10 ┆ 2020-05-12 │
#> └────────────┴────────────┘
#to avoid this any first and only list arg must be wrapped in a list
pl$select(list(test_df_iv$range)) # naively converts them to dates
#> shape: (5, 1)
#> ┌───────────────────────────────────┐
#> │ ivs_iv │
#> │ --- │
#> │ list[date] │
#> ╞═══════════════════════════════════╡
#> │ [2020-05-06, 2020-05-07, … 2020-… │
#> │ [2020-05-07, 2020-05-08, … 2020-… │
#> │ [2020-05-08, 2020-05-09, … 2020-… │
#> │ [2020-05-09, 2020-05-10, … 2020-… │
#> │ [2020-05-10, 2020-05-11, 2020-05… │
#> └───────────────────────────────────┘ Created on 2023-08-31 with reprex v2.0.2 |
Hi @cathblatter from released polars 0.8.0 with PR #369 above example has been enabled. Should we close this issue? |
Confirm this works like a charm - thank you very much @sorhawell & also @etiennebacher for your speedy adjustments🥳 |
I think the real problem here is that Polars doesn't support "Extension type" of Arrow. (pola-rs/polars#9112) |
Hi - came across an issue with daterange-columns from the {ivs} package in my use of {tidypolars}, as per the author's suggestion I'm posting here.
In brief:
pl$DataFrame()
fails when an interval-column (daterange) using the {ivs}-package was added to the dataframe prior to converting, with the following error:I'm aware its a bit of a special column type and I am just curious if you plan to support these type of variables at some point? I find it extremely convenient to use r-polars with large routine data for the time saved (and it often contains date range data).
let me know if you want me to provide more info - in the meantime 🙌 thanks for the work!
The text was updated successfully, but these errors were encountered: