-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot read columns of Time type by Polars #121
Comments
Additional information: With Polars' new streaming engine, which is an experimental feature at this time, the column is read as Float64 type instead of errors and are recognized as zero-row data. >>> import polars as pl
>>> pl.__version__
'1.22.0'
>>> pl.scan_parquet('nanoparquet.parquet').collect(new_streaming=True)
shape: (0, 1)
┌──────┐
│ time │
│ --- │
│ f64 │
╞══════╡
└──────┘ |
Well, that column is definitely not Float64 (= ❯ read_parquet_schema("nanoparquet.parquet")
file_name name r_type type type_length repetition_type
1 nanoparquet.parquet schema <NA> <NA> NA <NA>
2 nanoparquet.parquet time hms INT32 NA REQUIRED
converted_type logical_type num_children scale precision field_id
1 <NA> 1 NA NA NA
2 TIME_MILLIS TIME, TR.... NA NA NA NA |
Cross post: pola-rs/polars#21195 |
Here is a workaround: td <- data.frame(time = hms::hms(0))
nanoparquet::write_parquet(
td,
"nanoparquet.parquet",
options = parquet_options(write_arrow_metadata=FALSE)
)
It is entirely possible that there is a bug in the nanoparquet arrow metadata writer, but even then this is still a bug in Polars, because it should not fail reading the file because of the metadata. Here is how the metadata looks like, by nanoparquet, and by arrow: ❯ parse_arrow_schema(read_parquet_metadata("nanoparquet.parquet")[["file_meta_data"]][["key_value_metadata"]][[1]][["value"]])
$columns
name type_type type nullable dictionary custom_metadata
1 time FloatingPoint DOUBLE TRUE characte....
$custom_metadata
[1] key value
<0 rows> (or 0-length row.names)
$endianness
[1] "Little"
$features
character(0)
nanoparquet probably creates the arrow schema before converting the column to integer from double, so this is a bug, probably easy to fix. |
Actually, that's not what's happening, and the double value in the arrow schema is intentional. It should probably be a |
This should be fixed now on |
Confirmed, this was fixed by e54aa46. |
I've noticed that when I try to read a column of Time type written by
nanoparquet
with Polars, I get an error.Since it is fine in Arrow C++ and DuckDB, this may be not a problem of
nanoparquet
but of Polars (or both of them have problem).In R (For comparison, a file with the same contents is created by
arrow::write_parquet()
):Try reading each one in Python Polars (non-Python Polars will give the same error):
The text was updated successfully, but these errors were encountered: