Skip to content

Commit

Permalink
Add CI (#1)
Browse files Browse the repository at this point in the history
* Add CI
  • Loading branch information
milesgranger authored Feb 25, 2024
1 parent 130a076 commit ced24c7
Show file tree
Hide file tree
Showing 7 changed files with 252 additions and 1 deletion.
132 changes: 132 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
name: CI

on:
pull_request:
release:
types:
- released
- prereleased

jobs:
# macos:
# runs-on: macos-latest
# strategy:
# matrix:
# python-version: ['3.9']
# steps:
# - uses: actions/checkout@v3
# with:
# submodules: recursive
# - uses: actions/cache@v3
# with:
# path: |
# ~/.cargo/bin/
# ~/.cargo/registry/index/
# ~/.cargo/registry/cache/
# ~/.cargo/git/db/
# target/
# key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
# - uses: actions/setup-python@v4
# with:
# python-version: ${{ matrix.python-version }}
# - name: Install Rust toolchain
# uses: dtolnay/rust-toolchain@stable
# with:
# targets: aarch64-apple-darwin
# - name: Build
# run: cargo build --release
# - name: Tests
# run: cargo test --no-default-features --release

# windows:
# runs-on: windows-latest
# strategy:
# matrix:
# python-version: ['3.9']
# target: [x64]
# steps:
# - uses: actions/checkout@v3
# with:
# submodules: recursive
# - uses: actions/cache@v3
# with:
# path: |
# ~/.cargo/bin/
# ~/.cargo/registry/index/
# ~/.cargo/registry/cache/
# ~/.cargo/git/db/
# target/
# key: ${{ runner.os }}-${{ matrix.target }}-cargo-${{ hashFiles('**/Cargo.lock') }}
# - uses: actions/setup-python@v4
# with:
# python-version: ${{ matrix.python-version }}
# architecture: ${{ matrix.target }}
# - name: Install Rust toolchain
# uses: dtolnay/rust-toolchain@stable
# - name: Build
# if: matrix.target == 'x64'
# run: cargo build --release
# - name: Tests
# if: matrix.target == 'x64'
# run: cargo test --no-default-features --release

linux:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
target: [x86_64]
steps:
- uses: actions/checkout@v3
with:
submodules: recursive
- uses: actions/cache@v3
with:
path: |
~/.cargo/bin/
~/.cargo/registry/index/
~/.cargo/registry/cache/
~/.cargo/git/db/
target/
key: ${{ runner.os }}-${{ matrix.target }}-cargo-${{ hashFiles('**/Cargo.lock') }}
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
- name: Build
run: cargo build --release
- name: Tests
run: cargo test --release -- --test-threads 1 # TODO: dbgen not thread-safe
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
target: ${{ matrix.target }}
sccache: true
args: -i python --release --out dist
- name: Install wheel
run: pip install .
- name: Python Test
run: python -c "import pytpch" # TODO: proper python side tests
- name: Upload wheels
uses: actions/upload-artifact@v3
with:
name: wheels
path: dist

pypi-publish:
name: Upload pytpch release to PyPI
runs-on: ubuntu-latest
if: "startsWith(github.ref, 'refs/tags/')"
needs: [ linux ]
environment:
name: pypi
url: https://pypi.org/p/pytpch
permissions:
id-token: write
steps:
- uses: actions/download-artifact@v3
with:
name: wheels
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
skip-existing: true
packages-dir: ./
1 change: 1 addition & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[submodule "tpch-dbgen"]
path = src/tpch-dbgen
url = git@github.com:milesgranger/libdbgen.git
branch = master
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
name = "pytpch"
version = "0.1.0"
edition = "2021"
authors = ["Miles Granger <miles59923@gmail.com>"]
license = "MIT"
description = "bindings to libdbgen / tpch-dbgen"
readme = "README.md"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[lib]
Expand Down
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2020 Miles Granger

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Ergonomically create [TPC-H](https://www.tpc.org/tpch/) data thru Python as Arrow tables.

```python

import pytpch
import pyarrow as pa

# Generate TPC-H data at scale 1 (~1GB)
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1)

# Generate a single table at scale 1
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1, table=pytpch.Table.Nation)

# Generate a single chunk out of n chunks of a single table
# this is wildly helpful when generating larger scale factors as you can make
# subsets of the data and store them or join them after some sort of parallelism.
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1, table=pytpch.Table.Nation)


# NOTE! As mentioned in the docs for this function, it is NOT thread-safe.
# If you want to generate data in parallel, you must do so in other processes for now
# by using things like `multiprocessing` or `concurrent.futures.ProcessPoolExecutor`.
# This is a TODO, as the original C code uses copious amounts of global and static function
# variables to maintain state, and while the state is reset between function calls from refactoring
# in milesgranger/libdbgen, these shared global states are not removed so thus not thread-safe.
#
# Example of generating data in parallel:
from concurrent.futures import ProcessPoolExecutor, wait

n_chunks = 10 # 10 total chunks

def gen_step(step):
return pytpch.dbgen(sf=10, n_chunks=n_chunks, nth_step=step)

with ThreadPoolExecutor() as executor:
jobs: list[dict[str, pa.Table]] = list(executor.map(gen_step, range(n_chunks)))


# Default reference queries provided (1-22) as:
print(pytpch.QUERY_1)
```

---

### Tell me more...

Python bindings (thru Rust, b/c why not) to [libdbgen](https://github.com/milesgranger/libdbgen)
which is a fork of [databricks/tpch-dbgen](https://github.com/databricks/tpch-dbgen) for generating
[TPC-H data](https://www.tpc.org/tpch/).

tpch-dbgen is originally a CLI to generate CSV files for TPC-H data. I wanted to make it into an ergonomic
Python API for use in other projects.

TODOS (roughly in order of priority):
- [ ] Support for more than Linux x86_64 (mostly just adapting C lib and updating CI)
- [ ] Write directly to Arrow, removing CSV writing (w/ nanoarrow probably)
- [ ] Make thread safe (remove global and static function variables in C lib, and remove changing of CWD)
- [ ] Separate out the Rust stuff into it's own crate.

### Build from source...

Roughly:

- `git clone --recursive git@github.com:milesgranger/pytpch.git`
- `python -m pip install maturin`
- `maturin build --release`

That'll only work if you're on x86_64 linux for now, you can try adapting `build.rs` but good luck with that. For now.
25 changes: 25 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[project]
name = "pytpch"
keywords = ["tpc-h"]
requires-python = ">=3.8"
license = { file = "LICENSE" }
dependencies = ["pyarrow"]

[project.urls]
homepage = "https://github.com/milesgranger/pytpch"
documentation = "https://github.com/milesgranger/pytpch"
repository = "https://github.com/milesgranger/pytpch"

[build-system]
requires = ["maturin>=0.14"]
build-backend = "maturin"

[tool.maturin]
strip = true

[project.optional-dependencies]
dev = [
"black==22.3.0",
"pytest>=5.30",
"pytest-xdist",
]
2 changes: 1 addition & 1 deletion src/tpch-dbgen
Submodule tpch-dbgen updated 10 files
+364 −430 bm_utils.c
+9 −0 compile_flags.txt
+90 −91 config.h
+689 −717 driver.c
+1 −0 driver.h
+364 −368 dss.h
+147 −156 dsstypes.h
+405 −367 print.c
+242 −285 text.c
+1 −1 varsub.c

0 comments on commit ced24c7

Please sign in to comment.