Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402 #122

TurnaevEvgeny · 2025-02-13T21:19:42Z

Hi,
I encountered weird bug where nanoparquet fails to write dataframe.

I narrowed it down to

bad_df = head(vm_data_df[vm_data_df$tract_id %in% TRACT_ID,c("address_id"),drop = F], n = 3)
Browse[1]> bad_df
     address_id
3367         NA
3368         NA
3369         NA

nanoparquet::write_parquet(bad_df, predict_data_file)
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402

File on disk after this operation has only PAR1 written.

Somehow if I take 2 or 1 row it works

Browse[1]> bad_df = head(vm_data_df[vm_data_df$tract_id %in% TRACT_ID,c("address_id"),drop = F], n = 2)
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file)
Browse[1]> bad_df = head(vm_data_df[vm_data_df$tract_id %in% TRACT_ID,c("address_id"),drop = F], n = 3)
Browse[1]> bad_df[3,]
[1] NA
Browse[1]> bad_df[3,"address_id"]
[1] NA
Browse[1]> class(bad_df[3,"address_id"])
[1] "integer"

If I add character column in front it works

Browse[1]> head(bad_df)
     address_id
3367         NA
3368         NA
3369         NA
Browse[1]> bad_df$some_char_col = "foo"
Browse[1]> bad_df[,c("some_char_col", "address_id")]
     some_char_col address_id
3367           foo         NA
3368           foo         NA
3369           foo         NA
Browse[1]> nanoparquet::write_parquet(bad_df[,c("some_char_col", "address_id")], predict_data_file)
Browse[1]>

It also works without compression but fails with any compression.

Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "snappy")
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "gzip")
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "zstd")
Error during wrapup: Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> nanoparquet::write_parquet(bad_df, predict_data_file, compression = "uncompressed")
# works

I think it fails in ParquetOutFile::write_dictionary_page and it's something about double buffer and not compression specific since it fails with all compression types.

It fails for me inside docker image and I can't re-produce on mac laptop.

System info
uname -a
Linux c342a369b96b 6.8.0-1019-aws #21~22.04.1-Ubuntu SMP Thu Nov  7 17:33:30 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] C

time zone: Etc/UTC
tzcode source: system (glibc)

other attached packages:
...
 nanoparquet_0.4.1

Default parquet options

Browse[1]> parquet_options()
$class
[1] "tbl"

$compression_level
[1] NA

$keep_row_groups
[1] FALSE

$num_rows_per_row_group
[1] 10000000

$use_arrow_metadata
[1] TRUE

$write_arrow_metadata
[1] TRUE

$write_data_page_version
[1] 1

$write_minmax_values
[1] TRUE

I've tried changing parquet options with option( ) and seems have no effect.

I can recompile package from from branch with any additional debug, please let me know where I can put debug prints to see what's going on in ParquetOutFile::write_dictionary_page.

The text was updated successfully, but these errors were encountered:

gaborcsardi · 2025-02-13T21:23:52Z

Can you try to create a reproducible example? Or share the data frame that triggers this? Thanks!

TurnaevEvgeny · 2025-02-13T21:42:09Z

The thing is it only fails for me in docker.

Minimal example in docker.

bad_df = readRDS("./bad_df.rds")
library(nanoparquet)
nanoparquet::write_parquet(bad_df, "./foo.parq")
#Error in nanoparquet::write_parquet(bad_df, "./foo.parq") :
#  Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:349

but it works for me on laptop.
The lineno is different because in docker I have nanoparquet 0.3.1 by default. But I also tried 0.4 as you can see from original message.

bad_df.rds.zip

TurnaevEvgeny · 2025-02-13T22:00:08Z

upd - reproduced on linux outside docker. ubuntu 22.04.

R

R version 4.2.2 Patched (2022-11-10 r83330) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(nanoparquet)
> bad_df = readRDS("./bad_df.rds")
> nanoparquet::write_parquet(bad_df, "./foo.parq")
Error in nanoparquet::write_parquet(bad_df, "./foo.parq") :
  Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402

TurnaevEvgeny · 2025-02-13T23:39:42Z

added

+++ b/src/lib/ParquetOutFile.cpp
void ParquetOutFile::write_dictionary_(
   uint32_t end = file.tellp();
+  std::cout << "\n\n start = " << start << " end = " << end << "   size = " << size << "\n\n";
   if (end - start != size) {

and I see
start = 0 end = 4294967295 size = 0

So I changed to signed int32_t end_signed = file.tellp();
and see
start = 0 end = -1 size = 0

It seems tellp() returned error. Not sure why size is zero.

gaborcsardi · 2025-02-14T05:06:11Z

Thanks, I can indeed reproduce it with

docker run -ti -v `pwd`/bad_df.rds:`pwd`/bad_df.rds -w `pwd` ghcr.io/r-hub/r-minimal/r-minimal bash

and then

bad_df = readRDS("./bad_df.rds")
library(nanoparquet)
nanoparquet::write_parquet(bad_df, "./foo.parq")

Seems like a bug that happens on Linux when writing a page that only has missing valeus.

TurnaevEvgeny · 2025-02-14T15:06:38Z

Thanks. Currently I think this is something in RParquetOutFile::get_size_dictionary that gets dict size wrong.
I don't think it's about dataframe with empty values only. This bad_df I derived from much larger dataframe with more column and much more data that still fails. The common theme though is that if I add character column as first column it works. I can also produce larger dataframe with a mix of NA and integer values in it similar to bad_df that will fail.

Somehow Rf_xlength returns zero in RParquetOutFile::get_size_dictionary. I'm not familiar with R API. I suspect this is vector length.

parquet::Type::INT32
dictidx: 0x56754b8d63d0  Rf_xlength(dictidx): 0 sizeof(int): 4

TurnaevEvgeny · 2025-02-14T16:17:31Z

I compared good dataframe dump with bad.
In good one calls looks like

ParquetOutFile::write_column idx 0
 ParquetOutFile::write_data_pages
ParquetOutFile::calculate_column_data_size
 ParquetOutFile::write_data_page
ParquetOutFile::calculate_column_data_size
 ParquetOutFile::write_data_
 ParquetOutFile::compress

For bad dataframe it's

ParquetOutFile::write_column idx 0
ParquetOutFile::write_dictionary_page
RParquetOutFile::get_size_dictionary called
INTSXP
idx = 0 from: 0 until: 3 sel: SchemaElement(type=INT32, type_length=<null>, repetition_type=OPTIONAL, name=address_id, num_children=<null>, converted_type=INT_32, scale=<null>, precision=<null>, field_id=<null>, logicalType=LogicalType(STRING=<null>, MAP=<null>, LIST=<null>, ENUM=<null>, DECIMAL=<null>, DATE=<null>, TIME=<null>, TIMESTAMP=<null>, INTEGER=IntType(bitWidth= , isSigned=1), UNKNOWN=<null>, JSON=<null>, BSON=<null>, UUID=<null>, FLOAT16=<null>))

0x5a21975585e0
RParquetOutFile::create_dictionary idx 0 from: 0 until = 3
parquet::Type::INT32
dictidx: 0x5a219756e3d0  Rf_xlength(dictidx): 0 sizeof(int): 4
RParquetOutFile::create_dictionary idx 0 from: 0 until = 3
ParquetOutFile::write_dictionary_


 start = 0 end = -1   size = 0

Error in nanoparquet::write_parquet(bad_df, "./foo.parq") :
  Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:418

So the difference is that this section executed for bad_df

  if (encodings[idx] == Encoding::RLE_DICTIONARY) {
    uint32_t dictionary_page_offset = pfile.tellp();
    write_dictionary_page(idx, from, until);
    cmd->__set_dictionary_page_offset(dictionary_page_offset);
  }

Maybe for bad_df address_id column encoding is set to run length and then later either in RParquetOutFile::create_dictionary or ParquetOutFile::write_dictionary_ it can't figure out length correctly?

TurnaevEvgeny · 2025-02-14T16:45:12Z

Hmm, not sure if this is related. Now that I upgraded to nanoparquet_0.4.1 on mac I can't read files generated by aws Athena.
I get error

Error in nanoparquet::read_parquet(f_path) : 
  Found dictionary page instead of data page

gaborcsardi · 2025-02-16T11:42:50Z

I can't read files generated by aws Athena.

That's probably a different thing, can you please open an issue with an example file? Thanks.

Seems like it is not allowed to have an empty buffer without an underlying container. So now I initialize the buffer with a small static container. For #122.

gaborcsardi · 2025-02-16T11:44:33Z

I am fairly sure that this is now fixed in main. Can you please try?

TurnaevEvgeny · 2025-02-17T14:40:13Z

It works, thanks!

al-obrien · 2025-02-21T22:13:47Z

Given this appears to break some common operations, is another patch for CRAN planned in the coming weeks?

gaborcsardi · 2025-02-22T07:34:18Z

@al-obrien Do you see this as well? My guess was that it should be rare, given that only one use reported it over several months, and that you'd have to be a zero-length dictionary page as the first page to write.

But I can definitely do a quick release, let me fix #132 first.

al-obrien · 2025-02-22T07:50:41Z

That's correct, I did see this error come up

For #122 and #132 fixes. [ci skip]

gaborcsardi · 2025-02-22T10:13:08Z

Release happening here: #135.

gaborcsardi added the bug an unexpected problem or unintended behavior label Feb 14, 2025

gaborcsardi added a commit that referenced this issue Feb 16, 2025

Add failing test for #122

a444469

gaborcsardi added a commit that referenced this issue Feb 16, 2025

Fix byte buffer tellp

c912d32

Seems like it is not allowed to have an empty buffer without an underlying container. So now I initialize the buffer with a small static container. For #122.

TurnaevEvgeny closed this as completed Feb 17, 2025

gaborcsardi added a commit that referenced this issue Feb 22, 2025

Update NEWS

6c58c56

For #122 and #132 fixes. [ci skip]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402 #122

Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402 #122

TurnaevEvgeny commented Feb 13, 2025

gaborcsardi commented Feb 13, 2025

TurnaevEvgeny commented Feb 13, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 13, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 13, 2025

gaborcsardi commented Feb 14, 2025

TurnaevEvgeny commented Feb 14, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 14, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 14, 2025 •

edited

Loading

gaborcsardi commented Feb 16, 2025

gaborcsardi commented Feb 16, 2025

TurnaevEvgeny commented Feb 17, 2025

al-obrien commented Feb 21, 2025

gaborcsardi commented Feb 22, 2025

al-obrien commented Feb 22, 2025

gaborcsardi commented Feb 22, 2025

Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402 #122

Wrong number of bytes written for parquet dictionary @lib/ParquetOutFile.cpp:402 #122

Comments

TurnaevEvgeny commented Feb 13, 2025

gaborcsardi commented Feb 13, 2025

TurnaevEvgeny commented Feb 13, 2025 • edited Loading

TurnaevEvgeny commented Feb 13, 2025 • edited Loading

TurnaevEvgeny commented Feb 13, 2025

gaborcsardi commented Feb 14, 2025

TurnaevEvgeny commented Feb 14, 2025 • edited Loading

TurnaevEvgeny commented Feb 14, 2025 • edited Loading

TurnaevEvgeny commented Feb 14, 2025 • edited Loading

gaborcsardi commented Feb 16, 2025

gaborcsardi commented Feb 16, 2025

TurnaevEvgeny commented Feb 17, 2025

al-obrien commented Feb 21, 2025

gaborcsardi commented Feb 22, 2025

al-obrien commented Feb 22, 2025

gaborcsardi commented Feb 22, 2025

TurnaevEvgeny commented Feb 13, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 13, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 14, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 14, 2025 •

edited

Loading

TurnaevEvgeny commented Feb 14, 2025 •

edited

Loading