Cannot read columns of Time type by Polars #121

eitsupi · 2025-02-11T15:32:20Z

I've noticed that when I try to read a column of Time type written by nanoparquet with Polars, I get an error.
Since it is fine in Arrow C++ and DuckDB, this may be not a problem of nanoparquet but of Polars (or both of them have problem).

In R (For comparison, a file with the same contents is created by arrow::write_parquet()):

data.frame(time = hms::hms(0)) |>
  nanoparquet::write_parquet("nanoparquet.parquet")

data.frame(time = hms::hms(0)) |>
  arrow::write_parquet("arrow.parquet", version = 1)

Try reading each one in Python Polars (non-Python Polars will give the same error):

>>> import polars as pl
>>> pl.read_parquet("arrow.parquet")
shape: (1, 1)
┌──────────┐
│ time     │
│ ---      │
│ time     │
╞══════════╡
│ 00:00:00 │
└──────────┘
>>> pl.read_parquet("nanoparquet.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.local/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 241, in read_parquet
    return lf.collect()
           ^^^^^^^^^^^^
  File "/home/username/.local/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2066, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: parquet: Not yet supported: reading parquet type Int32 to Float64 still not implemented

The text was updated successfully, but these errors were encountered:

eitsupi · 2025-02-11T15:51:43Z

Additional information: With Polars' new streaming engine, which is an experimental feature at this time, the column is read as Float64 type instead of errors and are recognized as zero-row data.

>>> import polars as pl
>>> pl.__version__
'1.22.0'
>>> pl.scan_parquet('nanoparquet.parquet').collect(new_streaming=True)
shape: (0, 1)
┌──────┐
│ time │
│ ---  │
│ f64  │
╞══════╡
└──────┘

gaborcsardi · 2025-02-11T16:37:09Z

Well, that column is definitely not Float64 (=DOUBLE in Parquet), so I suspect that this will be a bug in Polars, but I'll see what we can do about it:

❯ read_parquet_schema("nanoparquet.parquet")
            file_name   name r_type  type type_length repetition_type
1 nanoparquet.parquet schema   <NA>  <NA>          NA            <NA>
2 nanoparquet.parquet   time    hms INT32          NA        REQUIRED
  converted_type logical_type num_children scale precision field_id
1           <NA>                         1    NA        NA       NA
2    TIME_MILLIS TIME, TR....           NA    NA        NA       NA

eitsupi · 2025-02-11T16:52:36Z

Cross post: pola-rs/polars#21195

gaborcsardi · 2025-02-16T15:16:17Z

Here is a workaround:

td <- data.frame(time = hms::hms(0))
nanoparquet::write_parquet(
  td,
  "nanoparquet.parquet", 
  options = parquet_options(write_arrow_metadata=FALSE)
)

❯ python3
Python 3.13.0 (main, Oct  7 2024, 05:02:14) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.read_parquet("nanoparquet.parquet")
shape: (1, 1)
┌──────────┐
│ time     │
│ ---      │
│ time     │
╞══════════╡
│ 00:00:00 │
└──────────┘

It is entirely possible that there is a bug in the nanoparquet arrow metadata writer, but even then this is still a bug in Polars, because it should not fail reading the file because of the metadata.

Here is how the metadata looks like, by nanoparquet, and by arrow:

❯ parse_arrow_schema(read_parquet_metadata("nanoparquet.parquet")[["file_meta_data"]][["key_value_metadata"]][[1]][["value"]])
$columns
  name     type_type   type nullable dictionary custom_metadata
1 time FloatingPoint DOUBLE     TRUE               characte....

$custom_metadata
[1] key   value
<0 rows> (or 0-length row.names)

$endianness
[1] "Little"

$features
character(0)

❯ parse_arrow_schema(read_parquet_metadata("arrow.parquet")[["file_meta_data"]][["key_value_metadata"]][[1]][["value"]][[2]])
$columns
  name type_type       type nullable dictionary custom_metadata
1 time      Time SECOND, 32     TRUE               characte....

$custom_metadata
  key
1   r
                                                                                                                                                     value
1 A\n3\n263170\n197888\n5\nUTF-8\n531\n1\n531\n1\n254\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n4\ntime\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n

$endianness
[1] "Little"

$features
character(0)

nanoparquet probably creates the arrow schema before converting the column to integer from double, so this is a bug, probably easy to fix.

gaborcsardi · 2025-02-16T16:02:33Z

Actually, that's not what's happening, and the double value in the arrow schema is intentional. It should probably be a Time value.

gaborcsardi · 2025-02-17T09:52:49Z

This should be fixed now on main. Thanks for the report!

eitsupi · 2025-02-17T14:37:27Z

Confirmed, this was fixed by e54aa46.
Thanks for the update!

eitsupi changed the title ~~Cannot read columns of type Time in Polars~~ Cannot read columns of Time type by Polars Feb 11, 2025

eitsupi mentioned this issue Feb 11, 2025

Cannot read the Time type of a Parquet file correctly pola-rs/polars#21195

Open

2 tasks

gaborcsardi added the bug an unexpected problem or unintended behavior label Feb 16, 2025

eitsupi closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read columns of Time type by Polars #121

Cannot read columns of Time type by Polars #121

eitsupi commented Feb 11, 2025

eitsupi commented Feb 11, 2025

gaborcsardi commented Feb 11, 2025

eitsupi commented Feb 11, 2025

gaborcsardi commented Feb 16, 2025

gaborcsardi commented Feb 16, 2025

gaborcsardi commented Feb 17, 2025

eitsupi commented Feb 17, 2025

Cannot read columns of Time type by Polars #121

Cannot read columns of Time type by Polars #121

Comments

eitsupi commented Feb 11, 2025

eitsupi commented Feb 11, 2025

gaborcsardi commented Feb 11, 2025

eitsupi commented Feb 11, 2025

gaborcsardi commented Feb 16, 2025

gaborcsardi commented Feb 16, 2025

gaborcsardi commented Feb 17, 2025

eitsupi commented Feb 17, 2025