Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script and documentation for regenerating sqlite test files #14290

Merged
merged 6 commits into from
Feb 6, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions datafusion/sqllogictest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,13 @@ database engine. The output is a full script that is a copy of the prototype scr

You can update the tests / generate expected output by passing the `--complete` argument.

To regenerate and complete the sqlite test suite's files in datafusion-testing/data/sqlite/ please refer to the
'./regenerate_sqlite_files.sh' file.

_WARNING_: The regenerate_sqlite_files.sh is experimental and should be understood and run with an abundance of caution.
When run the script will clone a remote repository locally, replace the location of a dependency with a custom git
version, will replace an existing .rs file with one from a github gist and will run various commands locally.

```shell
# Update ddl.slt with output from running
cargo test --test sqllogictests -- ddl --complete
Expand Down
179 changes: 179 additions & 0 deletions datafusion/sqllogictest/regenerate_sqlite_files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

echo "This script is experimental! Please read the following completely to understand
what this script does before running it.

This script is designed to regenerate the .slt files in datafusion-testing/data/sqlite/
from source files obtained from a git repository. To do that the following steps are
performed:

- Verify required commands are installed
- Clone the remote git repository into /tmp/sqlitetesting
- Delete all existing .slt files in datafusion-testing/data/sqlite/ folder
- Copy the .test files from the cloned git repo into datafusion-testing
- Remove a few directories and files from the copied files (not relevant to DataFusion)
- Rename the .test files to .slt and cleanses the files. Cleansing involves:
- dos2unix
- removing all references to mysql, mssql and postgresql
- adds in a new 'control resultmode valuewise' statement at the top of all files
- updates the files to change 'AS REAL' to 'AS FLOAT8'
- Replace the sqllogictest-rs dependency in the Cargo.toml with a version to
a git repository that has been custom written to properly complete the files
with comparison of datafusion results with postgresql
- Replace the sqllogictest.rs file with a customized version from a github gist
that will work with the customized sqllogictest-rs dependency
- Run the sqlite test with completion (takes > 1 hr)
- Update a few results to ignore known issues
- Run sqlite test to verify results
- Perform cleanup to revert changes to the Cargo.toml file/sqllogictest.rs file
- Delete backup files and the /tmp/sqlitetesting directory
"
read -r -p "Do you understand and accept the risk? (yes/no): " acknowledgement

if [ "${acknowledgement,,}" != "yes" ]; then
exit 0
else
echo "Ok, Proceeding..."
fi

if [ ! -x "$(command -v sd)" ]; then
echo "This script required 'sd' to be installed. Install via 'cargo install sd' or using your local package manager"
exit 0
else
echo "sd command is installed, proceeding..."
fi

DF_HOME=$(realpath "../../")

# location where we'll clone sql-logic-test repos
if [ ! -d "/tmp/sqlitetesting" ]; then
mkdir /tmp/sqlitetesting
fi

if [ ! -d "/tmp/sqlitetesting/sql-logic-test" ]; then
echo "Cloning sql-logic-test into /tmp/sqlitetesting/"
cd /tmp/sqlitetesting/ || exit;
git clone https://github.com/hydromatic/sql-logic-test.git
fi

echo "Removing all existing .slt files from datafusion-testing/data/sqlite/ directory"

cd "$DF_HOME/datafusion-testing/data/sqlite/" || exit;
find ./ -type f -name "*.slt" -exec rm {} \;

echo "Copying .test files from sql-logic-test to datafusion-testing/data/sqlite/"

cp -r /tmp/sqlitetesting/sql-logic-test/src/main/resources/test/* ./

echo "Removing 'evidence/*' and 'index/delete/*' directories from datafusion-testing"

find ./evidence/ -type f -name "*.test" -exec rm {} \;
rm -rf ./index/delete/1
rm -rf ./index/delete/10
rm -rf ./index/delete/100
rm -rf ./index/delete/1000
rm -rf ./index/delete/10000
# this file is empty and causes the sqllogictest-rs code to fail
rm ./index/view/10/slt_good_0.test

echo "Renaming .test files to .slt and cleansing the files ..."

# add hash-threshold lines into these 3 files as they were missing
sed -i '1s/^/hash-threshold 8\n\n/' ./select1.test
sed -i '1s/^/hash-threshold 8\n\n/' ./select4.test
sed -i '1s/^/hash-threshold 8\n\n/' ./select5.test
# rename
find ./ -type f -name "*.test" -exec rename -f 's/\.test/\.slt/' {} \;
# gotta love windows :/
find ./ -type f -name "*.slt" -exec dos2unix --quiet {} \;
# add in control resultmode
find ./ -type f -name "*.slt" -exec sd -f i 'hash-threshold 8\n' 'hash-threshold 8\ncontrol resultmode valuewise\n' {} \;
# remove mysql tests and skipif lines
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'skipif mysql.+\n' '' {} \;
# remove postgres skipif
find ./ -type f -name "*.slt" -exec sd -f i 'skipif postgresql(.+)\n' '' {} \;
# remove mssql tests
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'skipif mssql # not compatible\n' '' {} \;
# change REAL datatype to FLOAT8
find ./ -type f -name "*.slt" -exec sd -f i 'AS REAL' 'AS FLOAT8' {} \;

echo "Updating the datafusion/sqllogictest/Cargo.toml file with an updated sqllogictest dependency"

# update the sqllogictest Cargo.toml with the new repo for sqllogictest-rs (tied to a specific hash)
cd "$DF_HOME" || exit;
sed -i -e 's~^sqllogictest.*~sqllogictest = { git = "https://github.com/Omega359/sqllogictest-rs.git", rev = "1cd933d" }~' datafusion/sqllogictest/Cargo.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this needed for? Some sort of update to the runner? I wonder if you can publish it to crates.io or something (sqllogictests-omega359 🤔 ) Not sure if that would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That dependency is hacked to all hell to do the completion the way I wanted by using a comparison db (pg) to compare results to if the df results don't match the sqlite results. I sure don't want that thing published anywhere :)


echo "Replacing the datafusion/sqllogictest/bin/sqllogictests.rs file with a custom version required for running completion"

# replace the sqllogictest.rs with a customized version. This is tied to a specific version of the gist
curl --silent https://gist.githubusercontent.com/Omega359/5e6548a096fbe0c36ce14c547776db56/raw/3ccd0ee8049657c496d5068f56baa8408c1f00a3/sqllogictest.rs > datafusion/sqllogictest/bin/sqllogictests.rs

echo "Running the sqllogictests with sqlite completion. This will take approximately an hour to run"

cargo test --profile release-nonlto --features postgres --test sqllogictests -- --include-sqlite --postgres-runner --complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this needs to also use the --postgres-runner command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does so that the pg db is started. I'll take a look at my custom version of the sqllogictests files and see. I could just required that PG is started and available via PG_URI or something.


if [ $? -eq 0 ]; then
echo "Applying patches for #13784 (https://github.com/apache/datafusion/issues/13784)"

sd -f i 'query I rowsort label-2475\n' '# Datafusion - #13784 - https://github.com/apache/datafusion/issues/13784\nskipif Datafusion\nquery I rowsort label-2475\n' datafusion-testing/data/sqlite/random/aggregates/slt_good_102.slt
sd -f i 'query I rowsort label-3738\n' '# Datafusion - #13784 - https://github.com/apache/datafusion/issues/13784\nskipif Datafusion\nquery I rowsort label-3738\n' datafusion-testing/data/sqlite/random/aggregates/slt_good_112.slt

echo "Running the sqllogictests with sqlite files. This will take approximately 20 minutes to run"

cargo test --profile release-nonlto --test sqllogictests -- --include-sqlite

if [ $? -eq 0 ]; then
echo "Sqlite tests completed successfully. The datafusion-testing git submodule is ready to be pushed to a new remote and a PR created in https://github.com/apache/datafusion-testing/"

else
echo "Completion of sqlite test files failed. Please correct the issues in the .slt files and run the test again using the command 'cargo test --profile release-nonlto --test sqllogictests -- --include-sqlite'"
fi
else
echo "Completion of sqlite test files failed!"
fi

echo "Cleaning up source code changes and temporary files and directories"

cd "$DF_HOME" || exit;
find ./datafusion-testing/data/sqlite/ -type f -name "*.bak" -exec rm {} \;
git checkout datafusion/sqllogictest/Cargo.toml
git checkout datafusion/sqllogictest/bin/sqllogictests.rs
rm -rf /tmp/sqlitetesting
2 changes: 2 additions & 0 deletions docs/source/contributor-guide/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ DataFusion's SQL implementation is tested using [sqllogictest](https://github.co

Like similar systems such as [DuckDB](https://duckdb.org/dev/testing), DataFusion has chosen to trade off a slightly higher barrier to contribution for longer term maintainability.

DataFusion has integrated [sqlite's test suite](https://sqlite.org/sqllogictest/doc/trunk/about.wiki) as a supplemental test suite that is run whenever a PR is merged into DataFusion. To run it manually please refer to the [README](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md#running-tests-sqlite) file for instructions.

## Rust Integration Tests

There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/datafusion/tree/main/datafusion/core/tests) directory.
Expand Down
Loading