Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read columns of Time type by Polars #121

Closed
eitsupi opened this issue Feb 11, 2025 · 7 comments
Closed

Cannot read columns of Time type by Polars #121

eitsupi opened this issue Feb 11, 2025 · 7 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@eitsupi
Copy link

eitsupi commented Feb 11, 2025

I've noticed that when I try to read a column of Time type written by nanoparquet with Polars, I get an error.
Since it is fine in Arrow C++ and DuckDB, this may be not a problem of nanoparquet but of Polars (or both of them have problem).

In R (For comparison, a file with the same contents is created by arrow::write_parquet()):

data.frame(time = hms::hms(0)) |>
  nanoparquet::write_parquet("nanoparquet.parquet")

data.frame(time = hms::hms(0)) |>
  arrow::write_parquet("arrow.parquet", version = 1)

Try reading each one in Python Polars (non-Python Polars will give the same error):

>>> import polars as pl
>>> pl.read_parquet("arrow.parquet")
shape: (1, 1)
┌──────────┐
│ time     │
│ ---      │
│ time     │
╞══════════╡
│ 00:00:00 │
└──────────┘
>>> pl.read_parquet("nanoparquet.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.local/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 241, in read_parquet
    return lf.collect()
           ^^^^^^^^^^^^
  File "/home/username/.local/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2066, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: parquet: Not yet supported: reading parquet type Int32 to Float64 still not implemented
@eitsupi eitsupi changed the title Cannot read columns of type Time in Polars Cannot read columns of Time type by Polars Feb 11, 2025
@eitsupi
Copy link
Author

eitsupi commented Feb 11, 2025

Additional information: With Polars' new streaming engine, which is an experimental feature at this time, the column is read as Float64 type instead of errors and are recognized as zero-row data.

>>> import polars as pl
>>> pl.__version__
'1.22.0'
>>> pl.scan_parquet('nanoparquet.parquet').collect(new_streaming=True)
shape: (0, 1)
┌──────┐
│ time │
│ ---  │
│ f64  │
╞══════╡
└──────┘

@gaborcsardi
Copy link
Member

Well, that column is definitely not Float64 (=DOUBLE in Parquet), so I suspect that this will be a bug in Polars, but I'll see what we can do about it:

❯ read_parquet_schema("nanoparquet.parquet")
            file_name   name r_type  type type_length repetition_type
1 nanoparquet.parquet schema   <NA>  <NA>          NA            <NA>
2 nanoparquet.parquet   time    hms INT32          NA        REQUIRED
  converted_type logical_type num_children scale precision field_id
1           <NA>                         1    NA        NA       NA
2    TIME_MILLIS TIME, TR....           NA    NA        NA       NA

@eitsupi
Copy link
Author

eitsupi commented Feb 11, 2025

Cross post: pola-rs/polars#21195

@gaborcsardi
Copy link
Member

Here is a workaround:

td <- data.frame(time = hms::hms(0))
nanoparquet::write_parquet(
  td,
  "nanoparquet.parquet", 
  options = parquet_options(write_arrow_metadata=FALSE)
)
❯ python3
Python 3.13.0 (main, Oct  7 2024, 05:02:14) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.read_parquet("nanoparquet.parquet")
shape: (1, 1)
┌──────────┐
│ time     │
│ ---      │
│ time     │
╞══════════╡
│ 00:00:00 │
└──────────┘

It is entirely possible that there is a bug in the nanoparquet arrow metadata writer, but even then this is still a bug in Polars, because it should not fail reading the file because of the metadata.

Here is how the metadata looks like, by nanoparquet, and by arrow:

❯ parse_arrow_schema(read_parquet_metadata("nanoparquet.parquet")[["file_meta_data"]][["key_value_metadata"]][[1]][["value"]])
$columns
  name     type_type   type nullable dictionary custom_metadata
1 time FloatingPoint DOUBLE     TRUE               characte....

$custom_metadata
[1] key   value
<0 rows> (or 0-length row.names)

$endianness
[1] "Little"

$features
character(0)
❯ parse_arrow_schema(read_parquet_metadata("arrow.parquet")[["file_meta_data"]][["key_value_metadata"]][[1]][["value"]][[2]])
$columns
  name type_type       type nullable dictionary custom_metadata
1 time      Time SECOND, 32     TRUE               characte....

$custom_metadata
  key
1   r
                                                                                                                                                     value
1 A\n3\n263170\n197888\n5\nUTF-8\n531\n1\n531\n1\n254\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n4\ntime\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n

$endianness
[1] "Little"

$features
character(0)

nanoparquet probably creates the arrow schema before converting the column to integer from double, so this is a bug, probably easy to fix.

@gaborcsardi
Copy link
Member

Actually, that's not what's happening, and the double value in the arrow schema is intentional. It should probably be a Time value.

@gaborcsardi gaborcsardi added the bug an unexpected problem or unintended behavior label Feb 16, 2025
@gaborcsardi
Copy link
Member

This should be fixed now on main. Thanks for the report!

@eitsupi
Copy link
Author

eitsupi commented Feb 17, 2025

Confirmed, this was fixed by e54aa46.
Thanks for the update!

@eitsupi eitsupi closed this as completed Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants