-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet reader fails when file has less columns than reader_schema #14980
Comments
I'm not sure if I'm a huge fan of just silently returning null columns for missing columns - what if you misspelled a column? Perhaps this could be some sort of opt-in option? Do you know why DataFusion and PyArrow chose this default behavior? |
I am already very happy if it's just an opt-in on plan level for the Scan execution! You likely won't ever run into a misspelled column because the Not sure why it's the default but if this wasn't possible then in theory you have to rewrite all old parquet files to simply add a null column just so you could read newer parquet files where you added an additional column. |
I also have this use case, but should be opt-in. Sometimes newer parquets have more cols and you don't want to touch old files to add them. I have this in a non-delta extract layer and currently have to read metadata in an extra step to know which cols are in which file |
The reader schema should belong to the file. If you want to add |
@ritchie46 that wouldn't work though, the way delta lake works is there is 1 single schema for the table, but each individual parquet could have a subset of the columns due to how schema evolution works. If the reader_schema requires to get the schema of each single parquet to be exactly how the parquet is structured then, you would have to query each file metadata to read it. DataFusion and PyArrow have no problem reading parquet tables with mixed schema's as long as you provide a top-level schema to read the dataset with. See here the docs on DataFusion: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html#structfield.file_schema
|
If you fetch the metadata of the file you can get the file schema. That can be used. |
My whole point is you can avoid that if you have an apriori on the correct schema |
Closed as completed via #18922 |
Checks
Reproducible example
Create 2 parquet files, one file having: {"foo":Utf8"}, the other file having {"foo":Utf8", "bar": "int64"}
Log output
Issue description
I am trying to make schema evolved delta tables readable with polars-deltalake, however Polars does not seem to automatically create null arrays when columns are missing from a parquet file when you read it with a reader schema.
Expected behavior
When you read a parquet with a reader_schema and only a subset of columns are availabe in the parquet file, then polars should create null arrays of the respective columns and it's types that are missing.
This is also the behavior of datafusion and pyarrow when you scan multiple parquets with a provided schema.
Installed versions
The text was updated successfully, but these errors were encountered: