Merge pull request #25 from KaveIO/inplace_build

Update setup to make it compatible with --use-feature=in-tree-build
KaveIO · Jul 26, 2021 · 6543aae · 6543aae
2 parents 53cd8f7 + 303272f
commit 6543aae
Show file tree

Hide file tree

Showing 32 changed files with 2,104 additions and 805 deletions.
diff --git a/.github/workflows/inplace_build.yml b/.github/workflows/inplace_build.yml
@@ -0,0 +1,40 @@
+name: In tree build
+
+on:
+  workflow_dispatch:
+  pull_request:
+  push:
+    branches:
+      - master
+
+jobs:
+  build:
+    name: ${{ matrix.platform }}
+    strategy:
+      fail-fast: false
+      matrix:
+        platform: [windows-latest, macos-latest, ubuntu-latest]
+
+    runs-on: ${{ matrix.platform }}
+
+    steps:
+    - uses: actions/checkout@v2
+      with:
+        submodules: true
+
+    - uses: actions/setup-python@v2
+      with:
+        python-version: "3.8"
+
+    - name: Add requirements
+      run: |
+        python -m pip install --upgrade pip wheel setuptools jupyter
+
+    - name: Build and install
+      run: pip install --use-feature=in-tree-build --verbose ".[test]"
+
+    - name: Unit test
+      run: pytest tests/phik_python/test_phik.py -v
+
+    - name: Integration test
+      run: pytest tests/phik_python/integration/test_notebooks.py -v
diff --git a/.github/workflows/test_matrix.yml b/.github/workflows/test_matrix.yml
@@ -9,6 +9,7 @@ on:
 
 jobs:
   build:
+    name: ${{ matrix.platform }} Python ${{ matrix.python-version }}
     strategy:
       fail-fast: false
       matrix:
@@ -33,5 +34,8 @@ jobs:
     - name: Build and install
       run: pip install --verbose ".[test]"
 
-    - name: Test
-      run: pytest
+    - name: Unit test
+      run: pytest tests/phik_python/test_phik.py -v
+
+    - name: Integration test
+      run: pytest tests/phik_python/integration/test_notebooks.py -v
diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
@@ -56,6 +56,7 @@ jobs:
     - name: Build wheel
       run: python -m cibuildwheel --output-dir wheelhouse
       env:
+        CIBW_ENVIRONMENT: MACOSX_DEPLOYMENT_TARGET=10.13
         CIBW_BUILD: 'cp36-* cp37-* cp38-* cp39-*'
         CIBW_TEST_EXTRAS: test
         CIBW_TEST_COMMAND: pytest {project}/tests/phik_python/test_phik.py

diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+*.so
+*egg-info*
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -0,0 +1,80 @@
+=============
+Release notes
+=============
+
+Version 0.12.0, July 2021
+-------------------------
+
+C++ Extension
+~~~~~~~~~~~~~
+
+Phi_K contains an optional C++ extension to compute the significance matrix using the `hypergeometric` method
+(also called the`Patefield` method).
+
+Note that the PyPi distributed wheels contain a pre-build extension for Linux, MacOS and Windows.
+
+A manual (pip) setup will attempt to build and install the extension, if it fails it will install without the extension.
+If so, using the `hypergeometric` method without the extension will trigger a
+NotImplementedError.
+
+Compiler requirements through Pybind11:
+
+- Clang/LLVM 3.3 or newer (for Apple Xcode's clang, this is 5.0.0 or newer)
+- GCC 4.8 or newer
+- Microsoft Visual Studio 2015 Update 3 or newer
+- Intel classic C++ compiler 18 or newer (ICC 20.2 tested in CI)
+- Cygwin/GCC (previously tested on 2.5.1)
+- NVCC (CUDA 11.0 tested in CI)
+- NVIDIA PGI (20.9 tested in CI)
+
+
+Other
+~~~~~
+
+* You can now manually set the number of parallel jobs in the evaluation of Phi_K or its statistical significance
+  (when using MC simulations). For example, to use 4 parallel jobs do:
+
+  .. code-block:: python
+
+    df.phik_matrix(njobs = 4)
+    df.significance_matrix(njobs = 4)
+
+  The default value is -1, in which case all available cores are used. When using ``njobs=1`` no parallel processing
+  is applied.
+
+* Phi_K can now be calculated with an independent expectation histogram:
+
+  .. code-block:: python
+
+    from phik.phik import phik_from_hist2d
+
+    cols = ["mileage", "car_size"]
+    interval_cols = ["mileage"]
+
+    observed = df1[["feature1", "feature2"]].hist2d()
+    expected = df2[["feature1", "feature2"]].hist2d()
+
+    phik_value = phik_from_hist2d(observed=observed, expected=expected)
+
+  The expected histogram is taken to be (relatively) large in number of counts
+  compared with the observed histogram.
+
+  Or can compare two (pre-binned) datasets against each other directly. Again the expected dataset
+  is assumed to be relatively large:
+
+  .. code-block:: python
+
+    from phik.phik import phik_observed_vs_expected_from_rebinned_df
+
+    phik_matrix = phik_observed_vs_expected_from_rebinned_df(df1_binned, df2_binned)
+
+* Added links in the readme to the basic and advanced Phi_K tutorials on google colab.
+* Migrated the spark example Phi_K notebook from popmon to directly using histogrammar for histogram creation.
+
+
+
+
+Older versions
+--------------
+
+* Please see documentation for full details: https://phik.readthedocs.io
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,7 +1,12 @@
 include NOTICE
 include LICENSE
+include CMakeLists.txt
+include phik/simcore/CMakeLists.txt
+recursive-include phik *.hpp
+recursive-include phik *.cpp
 
 global-include README.rst
+global-include CMakeLists.txt
 global-exclude *.py[cod] __pycache__ *.so
 exclude docs tests .readthedocs.yml
 recursive-exclude tests *.py

diff --git a/README.rst b/README.rst
@@ -2,25 +2,40 @@
 Phi_K Correlation Analyzer Library
 ==================================
 
-* Version: 0.11.2. Released: Mar 2021
-* Documentation: https://phik.readthedocs.io
+* Version: 0.12.0. Released: Jul 2021
+* Release notes: https://github.com/KaveIO/PhiK/blob/master/CHANGES.rst
 * Repository: https://github.com/kaveio/phik
+* Documentation: https://phik.readthedocs.io
 * Publication: `[offical] <https://www.sciencedirect.com/science/article/abs/pii/S0167947320301341>`_ `[arxiv pre-print] <https://arxiv.org/abs/1811.11440>`_
 
 Phi_K is a practical correlation constant that works consistently between categorical, ordinal and interval variables.
-It is based on several refinements to Pearson's hypothesis test of independence of two variables.
+It is based on several refinements to Pearson's hypothesis test of independence of two variables. Essentially, the
+contingency test statistic of two variables is interpreted as coming from a rotated bi-variate normal distribution,
+where the tilt is interpreted as Phi_K.
 
 The combined features of Phi_K form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables.
 Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution.
 These are useful features when studying the correlation matrix of variables with mixed types.
 
-The presented algorithms are easy to use and available through this public Python library: the correlation analyzer package.
-Emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships
+For details on the methodology behind the calculations, please see our publication. Emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships
 in a contingency table, in particular in case of low statistics samples.
+The presented algorithms are easy to use and available through this public Python library.
 
-For example, the Phi_K correlation analyzer package has been used to study surveys, insurance claims, correlograms, etc.
-For details on the methodology behind the calculations, please see our publication.
+Example notebooks
+=================
 
+.. list-table::
+   :widths: 60 40
+   :header-rows: 1
+
+   * - Static link
+     - Google Colab link
+   * - `basic tutorial <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_basic.ipynb>`_
+     - `basic on colab <https://colab.research.google.com/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_basic.ipynb>`_
+   * - `advanced tutorial (detailed configuration) <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_advanced.ipynb>`_
+     - `advanced on colab <https://colab.research.google.com/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_advanced.ipynb>`_
+   * - `spark tutorial <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_spark.ipynb>`_
+     - no spark available
 
 Documentation
 =============
@@ -29,7 +44,6 @@ The entire Phi_K documentation including tutorials can be found at `read-the-doc
 See the tutorials for detailed examples on how to run the code with pandas. We also have one example on how
 calculate the Phi_K correlation matrix for a spark dataframe.
 
-
 Check it out
 ============
 
@@ -56,35 +70,6 @@ You can now use the package in Python with:
 
 **Congratulations, you are now ready to use the PhiK correlation analyzer library!**
 
-Speedups
---------
-
-Phi_K can use the Numba JIT library for faster computation of certain operations.
-You can either install Numba separately or use the `numba` extra specifier while installing:
-
-.. code-block:: bash
-
-  $ pip install phik[numba]
-
-C++ Extension
--------------
-
-Phi_K contains an optional C++ extension to compute the significance matrix using the `hypergeometric` method.
-
-Note that the PyPi distributed wheels contain a pre-build extension for Linux, MacOS and Windows.
-
-The setup will attempt to build and install the extension, if it fails it will install without the extension.
-Using the `hypergeometric` method without the extension will trigger a NotImplementedError.
-
-Compiler requirements through Pybind11:
-
-- Clang/LLVM 3.3 or newer (for Apple Xcode's clang, this is 5.0.0 or newer)
-- GCC 4.8 or newer
-- Microsoft Visual Studio 2015 Update 3 or newer
-- Intel classic C++ compiler 18 or newer (ICC 20.2 tested in CI)
-- Cygwin/GCC (previously tested on 2.5.1)
-- NVCC (CUDA 11.0 tested in CI)
-- NVIDIA PGI (20.9 tested in CI)
 
 Quick run
 =========
@@ -136,4 +121,4 @@ Contact and support
 
 * Issues and Ideas: https://github.com/kaveio/phik/issues
 
-Please note that KPMG provides support only on a best-effort basis.
+Please note that support is (only) provided on a best-effort basis.
diff --git a/docs/source/tutorials.rst b/docs/source/tutorials.rst
@@ -6,7 +6,7 @@ This section contains materials on how to use the Phi_K correlation analysis cod
 There are additional side notes on how certain aspects work and where to find parts of the code.
 For more in depth explanations on the functionality of the code-base, try the `API docs <phik_index.html>`_.
 
-The tutorials are available in the ``python/phik/notebooks`` directory. We have:
+The tutorials are available in the ``phik/notebooks`` directory. We have:
 
 * A basic tutorial: this covers the basics of calculating Phi_K, the statistical significance, and interpreting the correlation. 
 * An advanced tutorial: this shows how to use the advanced features of the ``PhiK`` library.