Reader cannot handle if only part of a column chunk is dictionary encoded #110

gaborcsardi · 2025-01-29T23:52:15Z

This is apparently possible.

Example file at https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.parquet

A problematic column chunk is in column 28, between elements 12000:25000:

arrow::write_parquet(x[12000:25000,28,drop = FALSE], "/tmp/p2.parquet")
nanoparquet:::read_parquet_pages("/tmp/p2.parquet")

        file_name row_group column       page_type page_header_offset
1 /tmp/p2.parquet         0      0 DICTIONARY_PAGE                  4
2 /tmp/p2.parquet         0      0       DATA_PAGE             384970
3 /tmp/p2.parquet         0      0       DATA_PAGE             406693
  uncompressed_page_size compressed_page_size crc num_values       encoding
1                1115159               384945  NA      11246          PLAIN
2                  21538                21544  NA      12288 RLE_DICTIONARY
3                  67383                23643  NA        713          PLAIN

Originally posted by @jack-davison in #71

The text was updated successfully, but these errors were encountered:

gaborcsardi · 2025-02-06T15:23:02Z

Here is a way to create such files. With a REQUIRED column:

import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema(fields=[
    pa.field(name = 'x', type = pa.int32(), nullable = False)
])
data = [ range(2000) ]
table = pa.table(data = data, schema = schema)
pq.write_table(table, 'mixed-int32.parquet', dictionary_pagesize_limit = 400)

With an OPTIONAL one:

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'x': pa.array(range(2000), type=pa.int32(), nullable = False)})
pq.write_table(table, 'mixed-int32-miss.parquet', dictionary_pagesize_limit = 400)

[ci skip]

gaborcsardi · 2025-02-08T21:19:58Z

This is now fixed by #117:

❯ read_parquet("play_by_play_2023.parquet")
# A data frame: 49,665 × 372
   play_id game_id       old_game_id home_team away_team season_type  week posteam posteam_type defteam side_of_field yardline_100
     <dbl> <chr>         <chr>       <chr>     <chr>     <chr>       <int> <chr>   <chr>        <chr>   <chr>                <dbl>
 1       1 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 NA      NA           NA      NA                      NA
 2      39 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     ARI                     35
 3      55 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     75
 4      77 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     72
 5     102 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     66
 6     124 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     64
 7     147 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     64
 8     172 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     52
 9     197 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     51
10     220 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     51
# ℹ 49,655 more rows
# ℹ 360 more variables: game_date <chr>, quarter_seconds_remaining <dbl>, half_seconds_remaining <dbl>,
#   game_seconds_remaining <dbl>, game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>,
#   goal_to_go <int>, time <chr>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>,
#   shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, qb_kneel <dbl>, qb_spike <dbl>, qb_scramble <dbl>, pass_length <chr>,
#   pass_location <chr>, air_yards <dbl>, yards_after_catch <dbl>, run_location <chr>, run_gap <chr>, field_goal_result <chr>,
#   kick_distance <dbl>, extra_point_result <chr>, two_point_conv_result <chr>, home_timeouts_remaining <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Release is coming soon!

gaborcsardi added the bug an unexpected problem or unintended behavior label Jan 30, 2025

gaborcsardi mentioned this issue Jan 30, 2025

[feature request] read parquet from URL (or from raw vector?) #71

Closed

This comment has been minimized.

Sign in to view

gaborcsardi mentioned this issue Feb 7, 2025

Support a dict + non-dict pages mix within a column chunk #117

Merged

15 tasks

gaborcsardi added a commit that referenced this issue Feb 8, 2025

Add NEWS for #110

f7bfb2f

[ci skip]

gaborcsardi closed this as completed in #117 Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader cannot handle if only part of a column chunk is dictionary encoded #110

Reader cannot handle if only part of a column chunk is dictionary encoded #110

gaborcsardi commented Jan 29, 2025 •

edited

Loading

gaborcsardi commented Feb 6, 2025 •

edited

Loading

This comment has been minimized.

gaborcsardi commented Feb 8, 2025

Reader cannot handle if only part of a column chunk is dictionary encoded #110

Reader cannot handle if only part of a column chunk is dictionary encoded #110

Comments

gaborcsardi commented Jan 29, 2025 • edited Loading

gaborcsardi commented Feb 6, 2025 • edited Loading

This comment has been minimized.

gaborcsardi commented Feb 8, 2025

gaborcsardi commented Jan 29, 2025 •

edited

Loading

gaborcsardi commented Feb 6, 2025 •

edited

Loading