pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368

cathblatter · 2023-08-28T16:33:27Z

Hi - came across an issue with daterange-columns from the {ivs} package in my use of {tidypolars}, as per the author's suggestion I'm posting here.

In brief: pl$DataFrame() fails when an interval-column (daterange) using the {ivs}-package was added to the dataframe prior to converting, with the following error:

library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(polars)

t_date <- as.Date("2020-05-05")

test_df <- tibble(id = 1:5, 
                   grp = c("a", "a", "b", "b", "b"),
                   start = rep(t_date+1:5),
                   end = rep(t_date+11:7))

# adding an iv-variable to the dataframe
test_df_iv <- test_df |> 
    mutate(range = ivs::iv(start, end))

pl$DataFrame(test_df_iv)
#> Error: in set_column_from_robj: ShapeMismatch(ErrString("unable to add a column of length 2 to a dataframe of height 5"))

I'm aware its a bit of a special column type and I am just curious if you plan to support these type of variables at some point? I find it extremely convenient to use r-polars with large routine data for the time saved (and it often contains date range data).

let me know if you want me to provide more info - in the meantime 🙌 thanks for the work!

The text was updated successfully, but these errors were encountered:

eitsupi · 2023-08-29T05:06:17Z

It seems ivs is based on the vctrs package, and the arrow package already support vctrs package's class. (Regardless of whether it is the intended type on Arrow Type)

> pak::pak("ivs")
✔ Updated metadata database: 2.90 MB in 6 files.                          
✔ Updating metadata database ... done                                     
                                                                           
→ Will install 1 package.
→ Will download 1 package with unknown size.
+ ivs   0.2.0 [dl]
ℹ Getting 1 pkg with unknown size
✔ Got ivs 0.2.0 (x86_64-pc-linux-gnu-ubuntu-22.04) (412.73 kB)     
✔ Downloaded 1 package (412.73 kB)in 3.5s                          
✔ Installed ivs 0.2.0  (62ms)                               
✔ 1 pkg + 5 deps: kept 5, added 1, dld 1 (412.73 kB) [18.4s]                                  

> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


> library(ivs)

> t_date <- as.Date("2020-05-05")
 
    test_df <- tibble(id = 1:5, 
                       grp = c("a", "a", "b", "b", "b"),
                       start = rep(t_date+1:5),
                       end = rep(t_date+11:7))
 
    # adding an iv-variable to the dataframe
    test_df_iv <- test_df |> 
        mutate(range = ivs::iv(start, end))

> test_df_iv$range
<iv<date>[5]>
[1] [2020-05-06, 2020-05-16) [2020-05-07, 2020-05-15) [2020-05-08, 2020-05-14) [2020-05-09, 2020-05-13)
[5] [2020-05-10, 2020-05-12)

> test_df_iv$range |> class()
[1] "ivs_iv"     "vctrs_rcrd" "vctrs_vctr"

> test_df_iv |> arrow::as_arrow_table()
Table
5 rows x 5 columns
$id <int32>
$grp <string>
$start <date32[day]>
$end <date32[day]>
$range <<iv<date>[0]>>

> test_df_iv |> arrow::as_arrow_table() |> _$range
ChunkedArray
<<iv<date>[0]>>
[
  -- is_valid: all not null
  -- child 0 type: date32[day]
    [
      2020-05-06,
      2020-05-07,
      2020-05-08,
      2020-05-09,
      2020-05-10
    ]
  -- child 1 type: date32[day]
    [
      2020-05-16,
      2020-05-15,
      2020-05-14,
      2020-05-13,
      2020-05-12
    ]
]

But when I try to convert this to polars I get an error. Perhaps the arrow2 crate does not support this type.

In other words, it's an upstream issue.

> test_df_iv |> arrow::as_arrow_table() |> polars::pl$from_arrow()
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.4/src/ffi/schema.rs:501:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'explicit panic', src/rdataframe/mod.rs:82:1
Error: Execution halted with the following contexts
   0: In R: in pl$from_arrow:
   0: During function call [polars::pl$from_arrow(arrow::as_arrow_table(test_df_iv))]
   1: user function panicked: from_arrow_record_batches

When I write this data to Parquet and try to read it, DuckDB can read it successfully but Python Polars fails to read it.

In [1]: import polars as pl

In [2]: pl.read_parquet("test.parquet")
---------------------------------------------------------------------------
ArrowErrorException                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 pl.read_parquet("test.parquet")

File ~/.local/lib/python3.10/site-packages/polars/io/parquet/functions.py:132, in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, low_memory, pyarrow_options, use_statistics, rechunk)
    121     import pyarrow.parquet
    123     return from_arrow(  # type: ignore[return-value]
    124         pa.parquet.read_table(
    125             source_prep,
   (...)
    129         )
    130     )
--> 132 return pl.DataFrame._read_parquet(
    133     source_prep,
    134     columns=columns,
    135     n_rows=n_rows,
    136     parallel=parallel,
    137     row_count_name=row_count_name,
    138     row_count_offset=row_count_offset,
    139     low_memory=low_memory,
    140     use_statistics=use_statistics,
    141     rechunk=rechunk,
    142 )

File ~/.local/lib/python3.10/site-packages/polars/dataframe/frame.py:852, in DataFrame._read_parquet(cls, source, columns, n_rows, parallel, row_count_name, row_count_offset, low_memory, use_statistics, rechunk)
    850 projection, columns = handle_projection_columns(columns)
    851 self = cls.__new__(cls)
--> 852 self._df = PyDataFrame.read_parquet(
    853     source,
    854     columns,
    855     projection,
    856     n_rows,
    857     parallel,
    858     _prepare_row_count_args(row_count_name, row_count_offset),
    859     low_memory=low_memory,
    860     use_statistics=use_statistics,
    861     rechunk=rechunk,
    862 )
    863 return self

ArrowErrorException: OutOfSpec("In <KeyValue@d8>::value(): Invalid utf-8: invalid utf-8 sequence of 1 bytes from index 83")

eitsupi · 2023-08-29T05:20:37Z

Even if this bug is resolved, I think it is necessary to implement dedicated processing to convert vectors built on vctrs such as the clock package and ivs package to the intended Arrow type.

sorhawell · 2023-08-29T11:53:02Z

I have fix for this upcomming, but got interrupted. Will update later.

sorhawell · 2023-08-30T22:19:47Z

Currently "polars" will just ignore the vctrs annotations and traits and convert as what it is: a list of two vectors. However that will give a length missmatch 2 by 5. Even though it is possible to import some ivs_iv classed vector, all the methods from the package would not know what to do with polars Series and DataFrame(s). You might want to swap to the polars pl$date_range e.g.

Polars should support vctrs-vectors I think. On the occassion of this issue I have refactored the polars import of Robj's and I have also added dependency injection method as_polars_series.YourClass such that any classed Robj can be supported by polars OR tidypolars OR the final user.

code example below is from PR #369 . Examples before as_polars_series should work in polars 0.7.0 also.

library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(polars)
library(tidypolars)
#> Warning: package 'tidypolars' was built under R version 4.3.1
#> Registered S3 method overwritten by 'tidypolars':
#>   method          from  
#>   print.DataFrame polars
t_date <- as.Date("2020-05-05")

test_df <- tibble(id = 1:5, 
                   grp = c("a", "a", "b", "b", "b"),
                   start = rep(t_date+1:5),
                   end = rep(t_date+11:7))

# adding an iv-variable to the dataframe
test_df_iv <- test_df |> 
    mutate(range = ivs::iv(start, end))

class(test_df_iv$range)
#> [1] "ivs_iv"     "vctrs_rcrd" "vctrs_vctr"
unclass(test_df_iv$range)
#> $start
#> [1] "2020-05-06" "2020-05-07" "2020-05-08" "2020-05-09" "2020-05-10"
#> 
#> $end
#> [1] "2020-05-16" "2020-05-15" "2020-05-14" "2020-05-13" "2020-05-12"


# importing as plain Dates by remove vctrs attribute
test_df_plain = test_df_iv
test_df_plain[,c("range_1","range_2")] = unclass(test_df_iv$range)
test_df_plain$range = NULL
pl$DataFrame(test_df_plain)
#> shape: (5, 6)
#> ┌─────┬─────┬────────────┬────────────┬────────────┬────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range_1    ┆ range_2    │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ date       ┆ date       │
#> ╞═════╪═════╪════════════╪════════════╪════════════╪════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ 2020-05-06 ┆ 2020-05-16 │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ 2020-05-07 ┆ 2020-05-15 │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ 2020-05-08 ┆ 2020-05-14 │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ 2020-05-09 ┆ 2020-05-13 │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ 2020-05-10 ┆ 2020-05-12 │
#> └─────┴─────┴────────────┴────────────┴────────────┴────────────┘

# or make a series Struct, (a struct is pretty close a DataFrame in a Series)
pl$select(unclass(test_df_iv$range))$to_struct()$alias("range_struct")
#> polars Series: shape: (5,)
#> Series: 'range_struct' [struct[2]]
#> [
#>  {2020-05-06,2020-05-16}
#>  {2020-05-07,2020-05-15}
#>  {2020-05-08,2020-05-14}
#>  {2020-05-09,2020-05-13}
#>  {2020-05-10,2020-05-12}
#> ]

# use polars date_range instead of ivs
test_df_plain$range_1 = NULL
test_df_plain$range_2 = NULL
pl$DataFrame(test_df_plain)$with_columns(
  #as Date
  pl$date_range(
      pl$col("start"), 
      pl$col("end"),
      interval = "1d",
      explode = FALSE
    )$alias("range_as_date_ranges"),
  
  #or some Datetime
  pl$date_range(
      pl$col("start"), 
      pl$col("end"),
      interval = "1d42m5s",
      explode = FALSE
    )$alias("range_as_datetime_ranges")
)
#> shape: (5, 6)
#> ┌─────┬─────┬────────────┬────────────┬────────────────────────────┬──────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range_as_date_ranges       ┆ range_as_datetime_ranges │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                        ┆ ---                      │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ list[date]                 ┆ list[datetime[μs]]       │
#> ╞═════╪═════╪════════════╪════════════╪════════════════════════════╪══════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ [2020-05-06, 2020-05-07, … ┆ [2020-05-06 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-07…              │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ [2020-05-07, 2020-05-08, … ┆ [2020-05-07 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-08…              │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ [2020-05-08, 2020-05-09, … ┆ [2020-05-08 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-09…              │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ [2020-05-09, 2020-05-10, … ┆ [2020-05-09 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-10…              │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ [2020-05-10, 2020-05-11,   ┆ [2020-05-10 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-05…                   ┆ 2020-05-11…              │
#> └─────┴─────┴────────────┴────────────┴────────────────────────────┴──────────────────────────┘


# But ....
# from a package extending polars or some user perspective it could be ugly to handcode all this
# I have added a method to polars::as_polars_series (likely released with polars  0.8.0) where
# users or package maintainers can use to modify/extend how Robj are converted into Series

# e..g define a generic conversion for any "vctrs_rcrd"
as_polars_series.vctrs_rcrd = function(x, ...) {
  pl$DataFrame(unclass(x))$to_struct()
}

# now it just works
pl$lit(test_df_iv$range)
#> polars Expr: Series
pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬─────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range                   │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                     │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ struct[2]               │
#> ╞═════╪═════╪════════════╪════════════╪═════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ {2020-05-06,2020-05-16} │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ {2020-05-07,2020-05-15} │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ {2020-05-08,2020-05-14} │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ {2020-05-09,2020-05-13} │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ {2020-05-10,2020-05-12} │
#> └─────┴─────┴────────────┴────────────┴─────────────────────────┘
x = test_df_iv$range


# or define a more specialized conversion for ivs_vs, where we use specificly "start" and "end"
as_polars_series.ivs_iv = function(x, ...) {
  pl$DataFrame(unclass(x))$select(
    pl$date_range(
      pl$col("start"), 
      pl$col("end"),
      interval = "1d",
      explode = FALSE
    )$alias("ivs_iv")
  )$to_series()
}

pl$lit(test_df_iv$range)
#> polars Expr: Series[ivs_iv]
pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬───────────────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range                             │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                               │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ list[date]                        │
#> ╞═════╪═════╪════════════╪════════════╪═══════════════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ [2020-05-06, 2020-05-07, … 2020-… │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ [2020-05-07, 2020-05-08, … 2020-… │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ [2020-05-08, 2020-05-09, … 2020-… │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ [2020-05-09, 2020-05-10, … 2020-… │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ [2020-05-10, 2020-05-11, 2020-05… │
#> └─────┴─────┴────────────┴────────────┴───────────────────────────────────┘


#final gotcha, select and with_column unpack a single list as input
# as it expects it is an arg list
pl$select(test_df_iv$range) # naively converts them to dates
#> shape: (5, 2)
#> ┌────────────┬────────────┐
#> │ start      ┆ end        │
#> │ ---        ┆ ---        │
#> │ date       ┆ date       │
#> ╞════════════╪════════════╡
#> │ 2020-05-06 ┆ 2020-05-16 │
#> │ 2020-05-07 ┆ 2020-05-15 │
#> │ 2020-05-08 ┆ 2020-05-14 │
#> │ 2020-05-09 ┆ 2020-05-13 │
#> │ 2020-05-10 ┆ 2020-05-12 │
#> └────────────┴────────────┘


#to avoid this any first and only list arg must be wrapped in a list
pl$select(list(test_df_iv$range)) # naively converts them to dates
#> shape: (5, 1)
#> ┌───────────────────────────────────┐
#> │ ivs_iv                            │
#> │ ---                               │
#> │ list[date]                        │
#> ╞═══════════════════════════════════╡
#> │ [2020-05-06, 2020-05-07, … 2020-… │
#> │ [2020-05-07, 2020-05-08, … 2020-… │
#> │ [2020-05-08, 2020-05-09, … 2020-… │
#> │ [2020-05-09, 2020-05-10, … 2020-… │
#> │ [2020-05-10, 2020-05-11, 2020-05… │
#> └───────────────────────────────────┘

^{Created on 2023-08-31 with reprex v2.0.2}

sorhawell · 2023-09-10T11:43:26Z

Hi @cathblatter from released polars 0.8.0 with PR #369 above example has been enabled. Should we close this issue?

cathblatter · 2023-09-11T00:17:53Z

Confirm this works like a charm - thank you very much @sorhawell & also @etiennebacher for your speedy adjustments🥳

eitsupi · 2023-12-04T12:12:50Z

But when I try to convert this to polars I get an error. Perhaps the arrow2 crate does not support this type.

I think the real problem here is that Polars doesn't support "Extension type" of Arrow. (pola-rs/polars#9112)
I will open an new issue.

sorhawell mentioned this issue Aug 30, 2023

refactor lit, col, DataFrame, Series #369

Merged

cathblatter closed this as completed Sep 11, 2023

eitsupi mentioned this issue Dec 4, 2023

Rewrite as_polars_series (for vctr package based vectors) #570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368

pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368

cathblatter commented Aug 28, 2023

eitsupi commented Aug 29, 2023 •

edited

Loading

eitsupi commented Aug 29, 2023

sorhawell commented Aug 29, 2023

sorhawell commented Aug 30, 2023

sorhawell commented Sep 10, 2023

cathblatter commented Sep 11, 2023

eitsupi commented Dec 4, 2023

pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368

pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368

Comments

cathblatter commented Aug 28, 2023

eitsupi commented Aug 29, 2023 • edited Loading

eitsupi commented Aug 29, 2023

sorhawell commented Aug 29, 2023

sorhawell commented Aug 30, 2023

sorhawell commented Sep 10, 2023

cathblatter commented Sep 11, 2023

eitsupi commented Dec 4, 2023

eitsupi commented Aug 29, 2023 •

edited

Loading