Skip to content

Commit

Permalink
Merge pull request #25 from KaveIO/inplace_build
Browse files Browse the repository at this point in the history
Update setup to make it compatible with --use-feature=in-tree-build
  • Loading branch information
mbaak authored Jul 26, 2021
2 parents 53cd8f7 + 303272f commit 6543aae
Show file tree
Hide file tree
Showing 32 changed files with 2,104 additions and 805 deletions.
40 changes: 40 additions & 0 deletions .github/workflows/inplace_build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: In tree build

on:
workflow_dispatch:
pull_request:
push:
branches:
- master

jobs:
build:
name: ${{ matrix.platform }}
strategy:
fail-fast: false
matrix:
platform: [windows-latest, macos-latest, ubuntu-latest]

runs-on: ${{ matrix.platform }}

steps:
- uses: actions/checkout@v2
with:
submodules: true

- uses: actions/setup-python@v2
with:
python-version: "3.8"

- name: Add requirements
run: |
python -m pip install --upgrade pip wheel setuptools jupyter
- name: Build and install
run: pip install --use-feature=in-tree-build --verbose ".[test]"

- name: Unit test
run: pytest tests/phik_python/test_phik.py -v

- name: Integration test
run: pytest tests/phik_python/integration/test_notebooks.py -v
8 changes: 6 additions & 2 deletions .github/workflows/test_matrix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ on:

jobs:
build:
name: ${{ matrix.platform }} Python ${{ matrix.python-version }}
strategy:
fail-fast: false
matrix:
Expand All @@ -33,5 +34,8 @@ jobs:
- name: Build and install
run: pip install --verbose ".[test]"

- name: Test
run: pytest
- name: Unit test
run: pytest tests/phik_python/test_phik.py -v

- name: Integration test
run: pytest tests/phik_python/integration/test_notebooks.py -v
1 change: 1 addition & 0 deletions .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ jobs:
- name: Build wheel
run: python -m cibuildwheel --output-dir wheelhouse
env:
CIBW_ENVIRONMENT: MACOSX_DEPLOYMENT_TARGET=10.13
CIBW_BUILD: 'cp36-* cp37-* cp38-* cp39-*'
CIBW_TEST_EXTRAS: test
CIBW_TEST_COMMAND: pytest {project}/tests/phik_python/test_phik.py
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.so
*egg-info*
80 changes: 80 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
=============
Release notes
=============

Version 0.12.0, July 2021
-------------------------

C++ Extension
~~~~~~~~~~~~~

Phi_K contains an optional C++ extension to compute the significance matrix using the `hypergeometric` method
(also called the`Patefield` method).

Note that the PyPi distributed wheels contain a pre-build extension for Linux, MacOS and Windows.

A manual (pip) setup will attempt to build and install the extension, if it fails it will install without the extension.
If so, using the `hypergeometric` method without the extension will trigger a
NotImplementedError.

Compiler requirements through Pybind11:

- Clang/LLVM 3.3 or newer (for Apple Xcode's clang, this is 5.0.0 or newer)
- GCC 4.8 or newer
- Microsoft Visual Studio 2015 Update 3 or newer
- Intel classic C++ compiler 18 or newer (ICC 20.2 tested in CI)
- Cygwin/GCC (previously tested on 2.5.1)
- NVCC (CUDA 11.0 tested in CI)
- NVIDIA PGI (20.9 tested in CI)


Other
~~~~~

* You can now manually set the number of parallel jobs in the evaluation of Phi_K or its statistical significance
(when using MC simulations). For example, to use 4 parallel jobs do:

.. code-block:: python
df.phik_matrix(njobs = 4)
df.significance_matrix(njobs = 4)
The default value is -1, in which case all available cores are used. When using ``njobs=1`` no parallel processing
is applied.

* Phi_K can now be calculated with an independent expectation histogram:

.. code-block:: python
from phik.phik import phik_from_hist2d
cols = ["mileage", "car_size"]
interval_cols = ["mileage"]
observed = df1[["feature1", "feature2"]].hist2d()
expected = df2[["feature1", "feature2"]].hist2d()
phik_value = phik_from_hist2d(observed=observed, expected=expected)
The expected histogram is taken to be (relatively) large in number of counts
compared with the observed histogram.

Or can compare two (pre-binned) datasets against each other directly. Again the expected dataset
is assumed to be relatively large:

.. code-block:: python
from phik.phik import phik_observed_vs_expected_from_rebinned_df
phik_matrix = phik_observed_vs_expected_from_rebinned_df(df1_binned, df2_binned)
* Added links in the readme to the basic and advanced Phi_K tutorials on google colab.
* Migrated the spark example Phi_K notebook from popmon to directly using histogrammar for histogram creation.




Older versions
--------------

* Please see documentation for full details: https://phik.readthedocs.io
5 changes: 5 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
include NOTICE
include LICENSE
include CMakeLists.txt
include phik/simcore/CMakeLists.txt
recursive-include phik *.hpp
recursive-include phik *.cpp

global-include README.rst
global-include CMakeLists.txt
global-exclude *.py[cod] __pycache__ *.so
exclude docs tests .readthedocs.yml
recursive-exclude tests *.py
Expand Down
61 changes: 23 additions & 38 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,40 @@
Phi_K Correlation Analyzer Library
==================================

* Version: 0.11.2. Released: Mar 2021
* Documentation: https://phik.readthedocs.io
* Version: 0.12.0. Released: Jul 2021
* Release notes: https://github.com/KaveIO/PhiK/blob/master/CHANGES.rst
* Repository: https://github.com/kaveio/phik
* Documentation: https://phik.readthedocs.io
* Publication: `[offical] <https://www.sciencedirect.com/science/article/abs/pii/S0167947320301341>`_ `[arxiv pre-print] <https://arxiv.org/abs/1811.11440>`_

Phi_K is a practical correlation constant that works consistently between categorical, ordinal and interval variables.
It is based on several refinements to Pearson's hypothesis test of independence of two variables.
It is based on several refinements to Pearson's hypothesis test of independence of two variables. Essentially, the
contingency test statistic of two variables is interpreted as coming from a rotated bi-variate normal distribution,
where the tilt is interpreted as Phi_K.

The combined features of Phi_K form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables.
Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution.
These are useful features when studying the correlation matrix of variables with mixed types.

The presented algorithms are easy to use and available through this public Python library: the correlation analyzer package.
Emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships
For details on the methodology behind the calculations, please see our publication. Emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships
in a contingency table, in particular in case of low statistics samples.
The presented algorithms are easy to use and available through this public Python library.

For example, the Phi_K correlation analyzer package has been used to study surveys, insurance claims, correlograms, etc.
For details on the methodology behind the calculations, please see our publication.
Example notebooks
=================

.. list-table::
:widths: 60 40
:header-rows: 1

* - Static link
- Google Colab link
* - `basic tutorial <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_basic.ipynb>`_
- `basic on colab <https://colab.research.google.com/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_basic.ipynb>`_
* - `advanced tutorial (detailed configuration) <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_advanced.ipynb>`_
- `advanced on colab <https://colab.research.google.com/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_advanced.ipynb>`_
* - `spark tutorial <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_spark.ipynb>`_
- no spark available

Documentation
=============
Expand All @@ -29,7 +44,6 @@ The entire Phi_K documentation including tutorials can be found at `read-the-doc
See the tutorials for detailed examples on how to run the code with pandas. We also have one example on how
calculate the Phi_K correlation matrix for a spark dataframe.


Check it out
============

Expand All @@ -56,35 +70,6 @@ You can now use the package in Python with:
**Congratulations, you are now ready to use the PhiK correlation analyzer library!**

Speedups
--------

Phi_K can use the Numba JIT library for faster computation of certain operations.
You can either install Numba separately or use the `numba` extra specifier while installing:

.. code-block:: bash
$ pip install phik[numba]
C++ Extension
-------------

Phi_K contains an optional C++ extension to compute the significance matrix using the `hypergeometric` method.

Note that the PyPi distributed wheels contain a pre-build extension for Linux, MacOS and Windows.

The setup will attempt to build and install the extension, if it fails it will install without the extension.
Using the `hypergeometric` method without the extension will trigger a NotImplementedError.

Compiler requirements through Pybind11:

- Clang/LLVM 3.3 or newer (for Apple Xcode's clang, this is 5.0.0 or newer)
- GCC 4.8 or newer
- Microsoft Visual Studio 2015 Update 3 or newer
- Intel classic C++ compiler 18 or newer (ICC 20.2 tested in CI)
- Cygwin/GCC (previously tested on 2.5.1)
- NVCC (CUDA 11.0 tested in CI)
- NVIDIA PGI (20.9 tested in CI)

Quick run
=========
Expand Down Expand Up @@ -136,4 +121,4 @@ Contact and support

* Issues and Ideas: https://github.com/kaveio/phik/issues

Please note that KPMG provides support only on a best-effort basis.
Please note that support is (only) provided on a best-effort basis.
2 changes: 1 addition & 1 deletion docs/source/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This section contains materials on how to use the Phi_K correlation analysis cod
There are additional side notes on how certain aspects work and where to find parts of the code.
For more in depth explanations on the functionality of the code-base, try the `API docs <phik_index.html>`_.

The tutorials are available in the ``python/phik/notebooks`` directory. We have:
The tutorials are available in the ``phik/notebooks`` directory. We have:

* A basic tutorial: this covers the basics of calculating Phi_K, the statistical significance, and interpreting the correlation.
* An advanced tutorial: this shows how to use the advanced features of the ``PhiK`` library.
Expand Down
Loading

0 comments on commit 6543aae

Please sign in to comment.