Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Modular Encryption #3766

Closed
shicholas opened this issue Jun 22, 2022 · 5 comments
Closed

Parquet Modular Encryption #3766

shicholas opened this issue Jun 22, 2022 · 5 comments

Comments

@shicholas
Copy link

Describe your feature request

When turning a Polars DataFrame into a Parquet file, I'd like to be able to pass in some options for modular encryption (https://github.com/apache/parquet-format/blob/master/Encryption.md).

Similarly, when turning a Parquet file into a Polars DataFrame, I'd like to be able to decrypt the Parquet and create the Polars DataFrame in the same function.

This is a good article explaining the benefits and ease-of-use of modular encryption.

@jorgecarleitao
Copy link
Collaborator

Interesting request! I went through the document and seems to be a software development exercise in parquet2 and arrow2 (when not using pyarrow).

Does pyarrow support this already?

@shicholas
Copy link
Author

thanks for your interest here! It looks like PyArrow describes the Parquet Modular encryption format - https://arrow.apache.org/docs/dev/python/generated/pyarrow.parquet.encryption.EncryptionConfiguration.html
note in this method you can specify encryption keys for just the columns.

Does your comment mean we'd need to first get arrow2 to support this encryption? If you like the idea, how should I proceed with helping? Thanks!

@jorgecarleitao
Copy link
Collaborator

Yeap. I am not familiar with it, so take it with a grain. I think we need to support that on parquet2; arrow2 cares how pages are deserialized to arrow2 - I imagine that parquet2 would provide a fallible streaming iterator adapter to decript the pages (prior to decompressing), and decript metadata prior to passing it to thrift so that arrow2 and other consumers do not need to worry about encryption.

If you like the idea, how should I proceed with helping? Thanks!

I definitely love the idea :)

I created jorgecarleitao/parquet2#154 and jorgecarleitao/parquet2#155 about this there.

@alexander-beedie
Copy link
Collaborator

Deferring to upstream issue: jorgecarleitao/parquet2#154 👍

@dridk
Copy link

dridk commented Apr 18, 2024

Duckdb 0.10 has released the parquet encryption feature. With pola.rs, it is really easy to encrypt a parquet file.

import polars as pl 
import duckdb 

df = pl.DataFrame({"name": range(10)})

# Add key 
duckdb.sql("""PRAGMA add_parquet_key('key128', '0123456789112345') """)

# encrypt
duckdb.sql("COPY df TO 'test.parquet' ( ENCRYPTION_CONFIG {footer_key: 'key128'})")

#decrypt 
df = duckdb.sql("SELECT * FROM read_parquet('test.parquet', encryption_config = {footer_key: 'key128'})").pl()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants