title |
---|
Overview |
Today, a vast amount of data is stored in
- File formats like Parquet or CSV
- Data lakes like S3 or GCS
- Table formats like Delta Lake or Iceberg
ParadeDB's integrations that make it easy to ingest this data without data processing engines or ETL tools, which can be complex and error-prone.
In this example, we will query and copy a Parquet file stored in S3 to Postgres. The Parquet file contains 3 million NYC taxi trips from January 2024, hosted in a public S3 bucket provided by ParadeDB.
To begin, let's create a Postgres foreign data wrapper, which is how ParadeDB connects to S3.
CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;
CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
Next, let's query the foreign table trips
. You'll notice that the column names and types of this table are automatically
inferred from the Parquet file.
SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
Queries over this table are powered by DuckDB, an in-process analytical query engine. This means that you can run fast analytical queries over data lakes from ParadeDB.
SELECT COUNT(*) FROM trips;
Finally, let's copy this table into a Postgres heap table. For demonstration, we will copy over the first 100 rows.
CREATE TABLE trips_copy AS SELECT * FROM trips LIMIT 100;
That's it! Please refer to the other sections for instructions on how to ingest from other file and table formats and object stores.