Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402 #122

Closed
TurnaevEvgeny opened this issue Feb 13, 2025 · 15 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@TurnaevEvgeny
Copy link

Hi,
I encountered weird bug where nanoparquet fails to write dataframe.

I narrowed it down to

bad_df = head(vm_data_df[vm_data_df$tract_id %in% TRACT_ID,c("address_id"),drop = F], n = 3)
Browse[1]> bad_df
     address_id
3367         NA
3368         NA
3369         NA

nanoparquet::write_parquet(bad_df, predict_data_file)
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402

File on disk after this operation has only PAR1 written.

Somehow if I take 2 or 1 row it works

Browse[1]> bad_df = head(vm_data_df[vm_data_df$tract_id %in% TRACT_ID,c("address_id"),drop = F], n = 2)
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file)
Browse[1]> bad_df = head(vm_data_df[vm_data_df$tract_id %in% TRACT_ID,c("address_id"),drop = F], n = 3)
Browse[1]> bad_df[3,]
[1] NA
Browse[1]> bad_df[3,"address_id"]
[1] NA
Browse[1]> class(bad_df[3,"address_id"])
[1] "integer"

If I add character column in front it works

Browse[1]> head(bad_df)
     address_id
3367         NA
3368         NA
3369         NA
Browse[1]> bad_df$some_char_col = "foo"
Browse[1]> bad_df[,c("some_char_col", "address_id")]
     some_char_col address_id
3367           foo         NA
3368           foo         NA
3369           foo         NA
Browse[1]> nanoparquet::write_parquet(bad_df[,c("some_char_col", "address_id")], predict_data_file)
Browse[1]>

It also works without compression but fails with any compression.

Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "snappy")
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "gzip")
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "zstd")
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "uncompressed")
# works

I think it fails in ParquetOutFile::write_dictionary_page and it's something about double buffer and not compression specific since it fails with all compression types.

It fails for me inside docker image and I can't re-produce on mac laptop.

System info
uname -a
Linux c342a369b96b 6.8.0-1019-aws #21~22.04.1-Ubuntu SMP Thu Nov  7 17:33:30 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] C

time zone: Etc/UTC
tzcode source: system (glibc)

other attached packages:
...
 nanoparquet_0.4.1

Default parquet options

Browse[1]> parquet_options()
$class
[1] "tbl"

$compression_level
[1] NA

$keep_row_groups
[1] FALSE

$num_rows_per_row_group
[1] 10000000

$use_arrow_metadata
[1] TRUE

$write_arrow_metadata
[1] TRUE

$write_data_page_version
[1] 1

$write_minmax_values
[1] TRUE

I've tried changing parquet options with option( ) and seems have no effect.

I can recompile package from from branch with any additional debug, please let me know where I can put debug prints to see what's going on in ParquetOutFile::write_dictionary_page.

@gaborcsardi
Copy link
Member

Can you try to create a reproducible example? Or share the data frame that triggers this? Thanks!

@TurnaevEvgeny
Copy link
Author

TurnaevEvgeny commented Feb 13, 2025

The thing is it only fails for me in docker.

Minimal example in docker.

bad_df = readRDS("./bad_df.rds")
library(nanoparquet)
nanoparquet::write_parquet(bad_df, "./foo.parq")
#Error in nanoparquet::write_parquet(bad_df, "./foo.parq") :
#  Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:349

but it works for me on laptop.
The lineno is different because in docker I have nanoparquet 0.3.1 by default. But I also tried 0.4 as you can see from original message.

bad_df.rds.zip

@TurnaevEvgeny
Copy link
Author

TurnaevEvgeny commented Feb 13, 2025

upd - reproduced on linux outside docker. ubuntu 22.04.

R

R version 4.2.2 Patched (2022-11-10 r83330) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(nanoparquet)
> bad_df = readRDS("./bad_df.rds")
> nanoparquet::write_parquet(bad_df, "./foo.parq")
Error in nanoparquet::write_parquet(bad_df, "./foo.parq") :
  Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402

@TurnaevEvgeny
Copy link
Author

added

+++ b/src/lib/ParquetOutFile.cpp
void ParquetOutFile::write_dictionary_(
   uint32_t end = file.tellp();
+  std::cout << "\n\n start = " << start << " end = " << end << "   size = " << size << "\n\n";
   if (end - start != size) {

and I see
start = 0 end = 4294967295 size = 0

So I changed to signed int32_t end_signed = file.tellp();
and see
start = 0 end = -1 size = 0

It seems tellp() returned error. Not sure why size is zero.

@gaborcsardi
Copy link
Member

Thanks, I can indeed reproduce it with

docker run -ti -v `pwd`/bad_df.rds:`pwd`/bad_df.rds -w `pwd` ghcr.io/r-hub/r-minimal/r-minimal bash

and then

bad_df = readRDS("./bad_df.rds")
library(nanoparquet)
nanoparquet::write_parquet(bad_df, "./foo.parq")

Seems like a bug that happens on Linux when writing a page that only has missing valeus.

@gaborcsardi gaborcsardi added the bug an unexpected problem or unintended behavior label Feb 14, 2025
@TurnaevEvgeny
Copy link
Author

TurnaevEvgeny commented Feb 14, 2025

Thanks. Currently I think this is something in RParquetOutFile::get_size_dictionary that gets dict size wrong.
I don't think it's about dataframe with empty values only. This bad_df I derived from much larger dataframe with more column and much more data that still fails. The common theme though is that if I add character column as first column it works. I can also produce larger dataframe with a mix of NA and integer values in it similar to bad_df that will fail.

Somehow Rf_xlength returns zero in RParquetOutFile::get_size_dictionary. I'm not familiar with R API. I suspect this is vector length.

parquet::Type::INT32
dictidx: 0x56754b8d63d0  Rf_xlength(dictidx): 0 sizeof(int): 4

@TurnaevEvgeny
Copy link
Author

TurnaevEvgeny commented Feb 14, 2025

I compared good dataframe dump with bad.
In good one calls looks like

ParquetOutFile::write_column idx 0
 ParquetOutFile::write_data_pages
ParquetOutFile::calculate_column_data_size
 ParquetOutFile::write_data_page
ParquetOutFile::calculate_column_data_size
 ParquetOutFile::write_data_
 ParquetOutFile::compress

For bad dataframe it's

ParquetOutFile::write_column idx 0
ParquetOutFile::write_dictionary_page
RParquetOutFile::get_size_dictionary called
INTSXP
idx = 0 from: 0 until: 3 sel: SchemaElement(type=INT32, type_length=<null>, repetition_type=OPTIONAL, name=address_id, num_children=<null>, converted_type=INT_32, scale=<null>, precision=<null>, field_id=<null>, logicalType=LogicalType(STRING=<null>, MAP=<null>, LIST=<null>, ENUM=<null>, DECIMAL=<null>, DATE=<null>, TIME=<null>, TIMESTAMP=<null>, INTEGER=IntType(bitWidth= , isSigned=1), UNKNOWN=<null>, JSON=<null>, BSON=<null>, UUID=<null>, FLOAT16=<null>))

0x5a21975585e0
RParquetOutFile::create_dictionary idx 0 from: 0 until = 3
parquet::Type::INT32
dictidx: 0x5a219756e3d0  Rf_xlength(dictidx): 0 sizeof(int): 4
RParquetOutFile::create_dictionary idx 0 from: 0 until = 3
ParquetOutFile::write_dictionary_


 start = 0 end = -1   size = 0

Error in nanoparquet::write_parquet(bad_df, "./foo.parq") :
  Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:418

So the difference is that this section executed for bad_df

  if (encodings[idx] == Encoding::RLE_DICTIONARY) {
    uint32_t dictionary_page_offset = pfile.tellp();
    write_dictionary_page(idx, from, until);
    cmd->__set_dictionary_page_offset(dictionary_page_offset);
  }

Maybe for bad_df address_id column encoding is set to run length and then later either in RParquetOutFile::create_dictionary or ParquetOutFile::write_dictionary_ it can't figure out length correctly?

@TurnaevEvgeny
Copy link
Author

TurnaevEvgeny commented Feb 14, 2025

Hmm, not sure if this is related. Now that I upgraded to nanoparquet_0.4.1 on mac I can't read files generated by aws Athena.
I get error

Error in nanoparquet::read_parquet(f_path) : 
  Found dictionary page instead of data page

@gaborcsardi
Copy link
Member

I can't read files generated by aws Athena.

That's probably a different thing, can you please open an issue with an example file? Thanks.

gaborcsardi added a commit that referenced this issue Feb 16, 2025
gaborcsardi added a commit that referenced this issue Feb 16, 2025
Seems like it is not allowed to have an empty buffer
without an underlying container. So now I initialize
the buffer with a small static container.

For #122.
@gaborcsardi
Copy link
Member

I am fairly sure that this is now fixed in main. Can you please try?

@TurnaevEvgeny
Copy link
Author

It works, thanks!

@al-obrien
Copy link

Given this appears to break some common operations, is another patch for CRAN planned in the coming weeks?

@gaborcsardi
Copy link
Member

@al-obrien Do you see this as well? My guess was that it should be rare, given that only one use reported it over several months, and that you'd have to be a zero-length dictionary page as the first page to write.

But I can definitely do a quick release, let me fix #132 first.

@al-obrien
Copy link

That's correct, I did see this error come up

gaborcsardi added a commit that referenced this issue Feb 22, 2025
For #122 and #132 fixes.

[ci skip]
@gaborcsardi
Copy link
Member

Release happening here: #135.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants