Skip to content

Commit

Permalink
42.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Dandandan committed Feb 11, 2025
1 parent a43ce8b commit b6a6998
Show file tree
Hide file tree
Showing 628 changed files with 33,706 additions and 41,685 deletions.
12 changes: 7 additions & 5 deletions .github/actions/setup-builder/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,18 @@ runs:
- name: Install Build Dependencies
shell: bash
run: |
apt-get update
apt-get install -y protobuf-compiler
RETRY="ci/scripts/retry"
"${RETRY}" apt-get update
"${RETRY}" apt-get install -y protobuf-compiler
- name: Setup Rust toolchain
shell: bash
# rustfmt is needed for the substrait build script
run: |
RETRY="ci/scripts/retry"
echo "Installing ${{ inputs.rust-version }}"
rustup toolchain install ${{ inputs.rust-version }}
rustup default ${{ inputs.rust-version }}
rustup component add rustfmt
"${RETRY}" rustup toolchain install ${{ inputs.rust-version }}
"${RETRY}" rustup default ${{ inputs.rust-version }}
"${RETRY}" rustup component add rustfmt
- name: Configure rust runtime env
uses: ./.github/actions/setup-rust-runtime
- name: Fixup git permissions
Expand Down
11 changes: 8 additions & 3 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -521,7 +521,7 @@ jobs:
run: taplo format --check

config-docs-check:
name: check configs.md is up-to-date
name: check configs.md and ***_functions.md is up-to-date
needs: [ linux-build-lib ]
runs-on: ubuntu-latest
container:
Expand All @@ -542,6 +542,11 @@ jobs:
# If you encounter an error, run './dev/update_config_docs.sh' and commit
./dev/update_config_docs.sh
git diff --exit-code
- name: Check if any of the ***_functions.md has been modified
run: |
# If you encounter an error, run './dev/update_function_docs.sh' and commit
./dev/update_function_docs.sh
git diff --exit-code
# Verify MSRV for the crates which are directly used by other projects:
# - datafusion
Expand Down Expand Up @@ -569,9 +574,9 @@ jobs:
#
# To reproduce:
# 1. Install the version of Rust that is failing. Example:
# rustup install 1.76.0
# rustup install 1.79.0
# 2. Run the command that failed with that version. Example:
# cargo +1.76.0 check -p datafusion
# cargo +1.79.0 check -p datafusion
#
# To resolve, either:
# 1. Change your code to use older Rust features,
Expand Down
34 changes: 17 additions & 17 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ members = [
"datafusion/functions-aggregate-common",
"datafusion/functions-nested",
"datafusion/functions-window",
"datafusion/functions-window-common",
"datafusion/optimizer",
"datafusion/physical-expr",
"datafusion/physical-expr-common",
Expand Down Expand Up @@ -69,26 +70,25 @@ version = "42.2.0"
ahash = { version = "0.8", default-features = false, features = [
"runtime-rng",
] }
arrow = { version = "53.1.0", features = [
arrow = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", features = [
"prettyprint",
] }
arrow-array = { version = "53.1.0", default-features = false, features = [
arrow-array = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false, features = [
"chrono-tz",
] }
arrow-buffer = { version = "53.1.0", default-features = false }
arrow-flight = { version = "53.1.0", features = [
"flight-sql-experimental",
] }
arrow-ipc = { version = "53.1.0", default-features = false, features = [
arrow-buffer = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false }
arrow-flight = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", features = ["flight-sql-experimental"]}

arrow-ipc = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false, features = [
"lz4",
] }
arrow-ord = { version = "53.1.0", default-features = false }
arrow-schema = { version = "53.1.0", default-features = false }
arrow-string = { version = "53.1.0", default-features = false }
arrow-ord = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false }
arrow-schema = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false }
arrow-string = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false }
async-trait = "0.1.73"
bigdecimal = "=0.4.1"
bytes = "1.4"
chrono = { version = "0.4.34", default-features = false }
chrono = { version = "0.4.38", default-features = false }
ctor = "0.2.0"
dashmap = "6.0.1"
datafusion = { path = "datafusion/core", version = "42.2.0", default-features = false }
Expand All @@ -103,6 +103,7 @@ datafusion-functions-aggregate = { path = "datafusion/functions-aggregate", vers
datafusion-functions-aggregate-common = { path = "datafusion/functions-aggregate-common", version = "42.2.0" }
datafusion-functions-nested = { path = "datafusion/functions-nested", version = "42.2.0" }
datafusion-functions-window = { path = "datafusion/functions-window", version = "42.2.0" }
datafusion-functions-window-common = { path = "datafusion/functions-window-common", version = "42.2.0" }
datafusion-optimizer = { path = "datafusion/optimizer", version = "42.2.0", default-features = false }
datafusion-physical-expr = { path = "datafusion/physical-expr", version = "42.2.0", default-features = false }
datafusion-physical-expr-common = { path = "datafusion/physical-expr-common", version = "42.2.0", default-features = false }
Expand All @@ -124,20 +125,20 @@ log = "^0.4"
num_cpus = "1.13.0"
object_store = { version = "0.11.0", default-features = false }
parking_lot = "0.12"
parquet = { version = "53.1.0", default-features = false, features = [
parquet = { git = "https://github.com/Coralogix/arrow-rs.git", tag = "v53.3.0-tonic-downgrade", default-features = false, features = [
"arrow",
"async",
"object_store",
] }
pbjson = { version = "0.7.0" }
# Should match arrow-flight's version of prost.
prost = "0.13.1"
prost-derive = "0.13.1"
prost = "0.12"
prost-derive = "0.12"
rand = "0.8"
regex = "1.8"
rstest = "0.22.0"
rstest = "0.23.0"
serde_json = "1"
sqlparser = { version = "0.50.0", features = ["visitor"] }
sqlparser = { version = "0.51.0", features = ["visitor"] }
tempfile = "3"
thiserror = "1.0.44"
tokio = { version = "1.36", features = ["macros", "rt", "sync"] }
Expand Down Expand Up @@ -167,4 +168,3 @@ large_futures = "warn"

[workspace.lints.rust]
unexpected_cfgs = { level = "warn", check-cfg = ["cfg(tarpaulin)"] }
unused_imports = "deny"
28 changes: 21 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,23 @@
</a>

DataFusion is an extensible query engine written in [Rust] that
uses [Apache Arrow] as its in-memory format. DataFusion's target users are
developers building fast and feature rich database and analytic systems,
customized to particular workloads. See [use cases] for examples.
uses [Apache Arrow] as its in-memory format.

"Out of the box," DataFusion offers [SQL] and [`Dataframe`] APIs,
excellent [performance], built-in support for CSV, Parquet, JSON, and Avro,
extensive customization, and a great community.
[Python Bindings] are also available.
This crate provides libraries and binaries for developers building fast and
feature rich database and analytic systems, customized to particular workloads.
See [use cases] for examples. The following related subprojects target end users:

- [DataFusion Python](https://github.com/apache/datafusion-python/) offers a Python interface for SQL and DataFrame
queries.
- [DataFusion Ray](https://github.com/apache/datafusion-ray/) provides a distributed version of DataFusion that scales
out on Ray clusters.
- [DataFusion Comet](https://github.com/apache/datafusion-comet/) is an accelerator for Apache Spark based on
DataFusion.

"Out of the box,"
DataFusion offers [SQL] and [`Dataframe`] APIs, excellent [performance],
built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and
a great community.

DataFusion features a full query planner, a columnar, streaming, multi-threaded,
vectorized execution engine, and partitioned data sources. You can
Expand Down Expand Up @@ -125,3 +134,8 @@ For example, given the releases `1.78.0`, `1.79.0`, `1.80.0`, `1.80.1` and `1.81
If a hotfix is released for the minimum supported Rust version (MSRV), the MSRV will be the minor version with all hotfixes, even if it surpasses the four-month window.

We enforce this policy using a [MSRV CI Check](https://github.com/search?q=repo%3Aapache%2Fdatafusion+rust-version+language%3ATOML+path%3A%2F%5ECargo.toml%2F&type=code)

## DataFusion API evolution policy

Public methods in Apache DataFusion are subject to evolve as part of the API lifecycle.
Deprecated methods will be phased out in accordance with the [policy](https://datafusion.apache.org/library-user-guide/api-health.html), ensuring the API is stable and healthy.
38 changes: 38 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,16 @@ steps.
The tests sort the entire dataset using several different sort
orders.

## IMDB

Run Join Order Benchmark (JOB) on IMDB dataset.

The Internet Movie Database (IMDB) dataset contains real-world movie data. Unlike synthetic datasets like TPCH, which assume uniform data distribution and uncorrelated columns, the IMDB dataset includes skewed data and correlated columns (which are common for real dataset), making it more suitable for testing query optimizers, particularly for cardinality estimation.

This benchmark is derived from [Join Order Benchmark](https://github.com/gregrahn/join-order-benchmark).

See paper [How Good Are Query Optimizers, Really](http://www.vldb.org/pvldb/vol9/p204-leis.pdf) for more details.

## TPCH

Run the tpch benchmark.
Expand All @@ -342,6 +352,34 @@ This benchmarks is derived from the [TPC-H][1] version
[2]: https://github.com/databricks/tpch-dbgen.git,
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf

## External Aggregation

Run the benchmark for aggregations with limited memory.

When the memory limit is exceeded, the aggregation intermediate results will be spilled to disk, and finally read back with sort-merge.

External aggregation benchmarks run several aggregation queries with different memory limits, on TPCH `lineitem` table. Queries can be found in [`external_aggr.rs`](src/bin/external_aggr.rs).

This benchmark is inspired by [DuckDB's external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI.

### External Aggregation Example Runs
1. Run all queries with predefined memory limits:
```bash
# Under 'benchmarks/' directory
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json'
```

2. Run a query with specific memory limit:
```bash
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' --query 1 --memory-limit 30M
```

3. Run all queries with `bench.sh` script:
```bash
./bench.sh data external_aggr
./bench.sh run external_aggr
```


# Older Benchmarks

Expand Down
Loading

0 comments on commit b6a6998

Please sign in to comment.