Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader cannot handle if only part of a column chunk is dictionary encoded #110

Closed
gaborcsardi opened this issue Jan 29, 2025 · 3 comments · Fixed by #117
Closed

Reader cannot handle if only part of a column chunk is dictionary encoded #110

gaborcsardi opened this issue Jan 29, 2025 · 3 comments · Fixed by #117
Labels
bug an unexpected problem or unintended behavior

Comments

@gaborcsardi
Copy link
Member

gaborcsardi commented Jan 29, 2025

This is apparently possible.

Example file at https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.parquet

A problematic column chunk is in column 28, between elements 12000:25000:

arrow::write_parquet(x[12000:25000,28,drop = FALSE], "/tmp/p2.parquet")
nanoparquet:::read_parquet_pages("/tmp/p2.parquet")
        file_name row_group column       page_type page_header_offset
1 /tmp/p2.parquet         0      0 DICTIONARY_PAGE                  4
2 /tmp/p2.parquet         0      0       DATA_PAGE             384970
3 /tmp/p2.parquet         0      0       DATA_PAGE             406693
  uncompressed_page_size compressed_page_size crc num_values       encoding
1                1115159               384945  NA      11246          PLAIN
2                  21538                21544  NA      12288 RLE_DICTIONARY
3                  67383                23643  NA        713          PLAIN

Originally posted by @jack-davison in #71

@gaborcsardi gaborcsardi added the bug an unexpected problem or unintended behavior label Jan 30, 2025
@gaborcsardi
Copy link
Member Author

gaborcsardi commented Feb 6, 2025

Here is a way to create such files. With a REQUIRED column:

import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema(fields=[
    pa.field(name = 'x', type = pa.int32(), nullable = False)
])
data = [ range(2000) ]
table = pa.table(data = data, schema = schema)
pq.write_table(table, 'mixed-int32.parquet', dictionary_pagesize_limit = 400)

With an OPTIONAL one:

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'x': pa.array(range(2000), type=pa.int32(), nullable = False)})
pq.write_table(table, 'mixed-int32-miss.parquet', dictionary_pagesize_limit = 400)

@gaborcsardi

This comment has been minimized.

@gaborcsardi
Copy link
Member Author

This is now fixed by #117:

❯ read_parquet("play_by_play_2023.parquet")
# A data frame: 49,665 × 372
   play_id game_id       old_game_id home_team away_team season_type  week posteam posteam_type defteam side_of_field yardline_100
     <dbl> <chr>         <chr>       <chr>     <chr>     <chr>       <int> <chr>   <chr>        <chr>   <chr>                <dbl>
 1       1 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 NA      NA           NA      NA                      NA
 2      39 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     ARI                     35
 3      55 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     75
 4      77 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     72
 5     102 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     66
 6     124 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     64
 7     147 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     64
 8     172 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     52
 9     197 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     51
10     220 2023_01_ARI_… 2023091007  WAS       ARI       REG             1 WAS     home         ARI     WAS                     51
# ℹ 49,655 more rows
# ℹ 360 more variables: game_date <chr>, quarter_seconds_remaining <dbl>, half_seconds_remaining <dbl>,
#   game_seconds_remaining <dbl>, game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>,
#   goal_to_go <int>, time <chr>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>,
#   shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, qb_kneel <dbl>, qb_spike <dbl>, qb_scramble <dbl>, pass_length <chr>,
#   pass_location <chr>, air_yards <dbl>, yards_after_catch <dbl>, run_location <chr>, run_gap <chr>, field_goal_result <chr>,
#   kick_distance <dbl>, extra_point_result <chr>, two_point_conv_result <chr>, home_timeouts_remaining <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Release is coming soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant