-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] read parquet from URL (or from raw vector?) #71
Comments
Yes, we could definitely do one or both of those. The challenge for the HTTP is to keep the package lean, but reading from a raw vector is pretty straightforward. Btw. we could also support reading from an R connection, then you could do read_parquet(url("https://....")) |
either of these would be great! |
Reading from a connection would be great as that's how we read rds files from url! |
To clarify, for a Parquet file, reading from a connection means that we would need to read the whole file first, save it to a temporary file, and then read it from there. Which you can also do relatively easily as a workaround. |
Dev version can read from a connection now. |
I considered this closed now, release coming soon. |
I was just playing with the new release and used Tan's data to test it out, but it seems like packageVersion("arrow")
#> [1] '18.1.0.1'
packageVersion("nanoparquet")
#> [1] '0.4.0'
path <- "https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.parquet"
# nanoparquet
con <- url(path)
nanoparquet::read_parquet(con)
#> Error in nanoparquet::read_parquet(con): No leading magic bytes, invalid Parquet file at 'C:\Users\JD38\AppData\Local\Temp\Rtmp2tFLVk\file175947d534de9.parquet' @ lib/ParquetReader.cpp:72
# arrow
arrow::read_parquet(path) |> tibble::tibble()
#> # A tibble: 49,665 × 372
#> play_id game_id old_game_id home_team away_team season_type week posteam
#> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 1 2023_01_AR… 2023091007 WAS ARI REG 1 <NA>
#> 2 39 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 3 55 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 4 77 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 5 102 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 6 124 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 7 147 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 8 172 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 9 197 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 10 220 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> # ℹ 49,655 more rows
#> # ℹ 364 more variables: posteam_type <chr>, defteam <chr>, side_of_field <chr>,
#> # yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>,
#> # half_seconds_remaining <dbl>, game_seconds_remaining <dbl>,
#> # game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>,
#> # down <dbl>, goal_to_go <int>, time <chr>, yrdln <chr>, ydstogo <dbl>,
#> # ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>, … Created on 2025-01-29 with reprex v2.1.1 |
Just as an fyi. I put the same data into a parquet file written with nanoparquet instead of arrow into a test release. This works perfectly fine. path <- "https://github.com/nflverse/nflverse-data/releases/download/test/pbp_2023_nanop.parquet"
con <- url(path)
a <- nanoparquet::read_parquet(con) |> tibble::as_tibble()
print(a)
#> # A tibble: 49,665 × 372
#> play_id game_id old_game_id home_team away_team season_type week posteam
#> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 1 2023_01_AR… 2023091007 WAS ARI REG 1 <NA>
#> 2 39 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 3 55 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 4 77 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 5 102 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 6 124 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 7 147 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 8 172 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 9 197 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> 10 220 2023_01_AR… 2023091007 WAS ARI REG 1 WAS
#> # ℹ 49,655 more rows
#> # ℹ 364 more variables: posteam_type <chr>, defteam <chr>, side_of_field <chr>,
#> # yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>,
#> # half_seconds_remaining <dbl>, game_seconds_remaining <dbl>,
#> # game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>,
#> # down <dbl>, goal_to_go <int>, time <chr>, yrdln <chr>, ydstogo <dbl>,
#> # ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>, … packageVersion("nanoparquet")
#> [1] '0.4.0.9000' Created on 2025-01-31 with reprex v2.1.1 |
Hi! Excited by the looks of this package. A frequent use case I have is reading a parquet from a URL, e.g.
Is this something that would be in-scope for nanoparquet?
The text was updated successfully, but these errors were encountered: