Merge pull request #136 from yzhao062/development

V 0.7.5
yzhao062 · Oct 13, 2019 · da68f50 · da68f50
2 parents 03ec97c + 850b89a
commit da68f50
Show file tree

Hide file tree

Showing 16 changed files with 176 additions and 95 deletions.
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -79,6 +79,10 @@ v<0.7.3>, <06/12/2019> -- Fix bugs in SO_GAAL and MO_GAAL.
 v<0.7.4>, <07/10/2019> -- Fix bugs and update documentation.
 v<0.7.4>, <07/17/2019> -- Update dependency (six and joblib).
 v<0.7.4>, <07/19/2019> -- Update deprecation information.
+v<0.7.5>, <09/24/2019> -- Fix one dimensional data error in LSCP.
+v<0.7.5>, <10/13/2019> -- Document kNN and Isolation Forest's incoming changes.
+v<0.7.5>, <10/13/2019> -- SOD optimization (created by John-Almardeny in June).
+v<0.7.5>, <10/13/2019> -- Documentation updates.
 
 
 

diff --git a/README.rst b/README.rst
@@ -53,7 +53,8 @@ Python Outlier Detection (PyOD)
 
 
 .. image:: https://circleci.com/gh/yzhao062/pyod.svg?style=svg
-    :target: https://circleci.com/gh/yzhao062/pyod
+   :target: https://circleci.com/gh/yzhao062/pyod
+   :alt: Circle CI
 
 
 .. image:: https://coveralls.io/repos/github/yzhao062/pyod/badge.svg
@@ -71,18 +72,14 @@ Python Outlier Detection (PyOD)
    :alt: License
 
 
-.. image:: https://img.shields.io/badge/link-996.icu-red.svg
-   :target: https://github.com/996icu/996.ICU
-   :alt: 996.ICU
-
 -----
 
 PyOD is a comprehensive and scalable **Python toolkit** for **detecting outlying objects** in 
 multivariate data. This exciting yet challenging field is commonly referred as 
 `Outlier Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_
 or `Anomaly Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_.
 Since 2017, PyOD has been successfully used in various academic researches and
-commercial products [#Ramakrishnan2019Anomaly]_ [#Krishnan2019AlphaClean]_ [#Zhao2018DCSO]_ [#Zhao2019LSCP]_.
+commercial products [#Li2019MADGAN]_ [#Zhao2019LSCP]_.
 It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including
 `Analytics Vidhya <https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/>`_,
 `KDnuggets <https://www.kdnuggets.com/2019/02/outlier-detection-methods-cheat-sheet.html>`_,
@@ -591,18 +588,16 @@ Reference
 
 .. [#Kriegel2009Outlier] Kriegel, H.P., Kröger, P., Schubert, E. and Zimek, A., 2009, April. Outlier detection in axis-parallel subspaces of high dimensional data. In *Pacific-Asia Conference on Knowledge Discovery and Data Mining*\ , pp. 831-838. Springer, Berlin, Heidelberg.
 
-.. [#Krishnan2019AlphaClean] Krishnan, S. and Wu, E., 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arXiv preprint arXiv:1904.11827.
-
 .. [#Lazarevic2005Feature] Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In *KDD '05*. 2005.
 
+.. [#Li2019MADGAN] Li, D., Chen, D., Jin, B., Shi, L., Goh, J. and Ng, S.K., 2019, September. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In *International Conference on Artificial Neural Networks* (pp. 703-716). Springer, Cham.
+
 .. [#Liu2008Isolation] Liu, F.T., Ting, K.M. and Zhou, Z.H., 2008, December. Isolation forest. In *International Conference on Data Mining*\ , pp. 413-422. IEEE.
 
 .. [#Liu2019Generative] Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M. and He, X., 2019. Generative adversarial active learning for unsupervised outlier detection. *IEEE Transactions on Knowledge and Data Engineering*.
 
 .. [#Papadimitriou2003LOCI] Papadimitriou, S., Kitagawa, H., Gibbons, P.B. and Faloutsos, C., 2003, March. LOCI: Fast outlier detection using the local correlation integral. In *ICDE '03*, pp. 315-326. IEEE.
 
-.. [#Ramakrishnan2019Anomaly] Ramakrishnan, J., Shaabani, E., Li, C. and Sustik, M.A., 2019. Anomaly Detection for an E-commerce Pricing System. arXiv preprint arXiv:1902.09566.
-
 .. [#Ramaswamy2000Efficient] Ramaswamy, S., Rastogi, R. and Shim, K., 2000, May. Efficient algorithms for mining outliers from large data sets. *ACM Sigmod Record*\ , 29(2), pp. 427-438.
 
 .. [#Rousseeuw1999A] Rousseeuw, P.J. and Driessen, K.V., 1999. A fast algorithm for the minimum covariance determinant estimator. *Technometrics*\ , 41(3), pp.212-223.
@@ -613,8 +608,6 @@ Reference
 
 .. [#Tang2002Enhancing] Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In *Pacific-Asia Conference on Knowledge Discovery and Data Mining*, pp. 535-548. Springer, Berlin, Heidelberg.
 
-.. [#Zhao2018DCSO] Zhao, Y. and Hryniewicki, M.K. DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles. *ACM SIGKDD Workshop on Outlier Detection De-constructed (ODD v5.0)*\ , 2018.
-
 .. [#Zhao2018XGBOD] Zhao, Y. and Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. *IEEE International Joint Conference on Neural Networks*\ , 2018.
 
 .. [#Zhao2019LSCP] Zhao, Y., Nasrullah, Z., Hryniewicki, M.K. and Li, Z., 2019, May. LSCP: Locally selective combination in parallel outlier ensembles. In *Proceedings of the 2019 SIAM International Conference on Data Mining (SDM)*, pp. 585-593. Society for Industrial and Applied Mathematics.
diff --git a/docs/about.rst b/docs/about.rst
@@ -5,14 +5,23 @@ About us
 Core Development Team
 ---------------------
 
-Yue Zhao (initialized the project in 2017): `Homepage <https://www.yuezhao.me/>`_
+Yue Zhao (Ph.D. Student @ Carnegie Mellon University):
 
-Zain Nasrullah (joined in 2018):
-`LinkedIn (Zain Nasrullah) <https://www.linkedin.com/in/zain-nasrullah-097a2b85>`_
+- initialized the project in 2017
+- `Homepage <https://www.andrew.cmu.edu/user/yuezhao2/>`_
 
-Winston (Zheng) Li (joined in 2018):
-`LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl>`_
+Zain Nasrullah (Data Scientist at RBC; MSc in Computer Science):
 
-Yahya Almardeny (joined in 2019):
-`LinkedIn (Yahya Almardeny) <https://www.linkedin.com/in/yahya-almardeny/>`_
+- joined in 2018
+- `LinkedIn (Zain Nasrullah) <https://www.linkedin.com/in/zain-nasrullah-097a2b85>`_
+
+Winston (Zheng) Li (Founder of `arima <https://www.arimadata.com/>`_, Stat Ph.D., Instructor @ Northeastern U):
+
+- joined in 2018
+- `LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl>`_
+
+Yahya Almardeny (Software Systems & Machine Learning Engineer @ TSSG):
+
+- joined in 2019
+- `LinkedIn (Yahya Almardeny) <https://www.linkedin.com/in/yahya-almardeny/>`_
 
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -8,7 +8,7 @@ What is the Next?
 
 This is the central place to track important things to be fixed/added:
 
-- GPU support
+- GPU support (it is noted that keras with TensorFlow backend will automatically run on GPU; auto_encoder_example.py takes around 96.95 seconds on a RTX 2060 GPU).
 - Installation efficiency improvement, such as using docker
 - Add contact channel with `Gitter <https://gitter.im>`_
 - Support additional languages, see `Manage Translations <https://docs.readthedocs.io/en/latest/guides/manage-translations.html>`_

diff --git a/docs/index.rst b/docs/index.rst
@@ -58,7 +58,8 @@ Welcome to PyOD documentation!
 
 
 .. image:: https://circleci.com/gh/yzhao062/pyod.svg?style=svg
-    :target: https://circleci.com/gh/yzhao062/pyod
+   :target: https://circleci.com/gh/yzhao062/pyod
+   :alt: Circle CI
 
 
 .. image:: https://coveralls.io/repos/github/yzhao062/pyod/badge.svg
@@ -76,20 +77,14 @@ Welcome to PyOD documentation!
    :alt: License
 
 
-.. image:: https://img.shields.io/badge/link-996.icu-red.svg
-   :target: https://github.com/996icu/996.ICU
-   :alt: 996.ICU
-
-
 ----
 
 PyOD is a comprehensive and scalable **Python toolkit** for **detecting outlying objects** in
 multivariate data. This exciting yet challenging field is commonly referred as
 `Outlier Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_
 or `Anomaly Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_.
 Since 2017, PyOD :cite:`a-zhao2019pyod` has been successfully used in various
-academic researches and commercial products
-:cite:`a-ramakrishnan2019anomaly,a-krishnan2019alphaclean,a-zhao2018dcso,a-zhao2019lscp`.
+academic researches and commercial products :cite:`a-li2019mad,a-zhao2019lscp`.
 It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including
 `Analytics Vidhya <https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/>`_,
 `Towards Data Science <https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1>`_,

diff --git a/docs/pubs.rst b/docs/pubs.rst
@@ -39,20 +39,30 @@ We are appreciated that PyOD has been increasingly referred and cited in scienti
 
 **2019**
 
+Amorim, M., Bortoloti, F.D., Ciarelli, P.M., Salles, E.O. and Cavalieri, D.C., 2019. Novelty Detection in Social Media by Fusing Text and Image Into a Single Structure. *IEEE Access*, 7, pp.132786-132802.
+
+Li, D., Chen, D., Jin, B., Shi, L., Goh, J. and Ng, S.K., 2019, September. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In *International Conference on Artificial Neural Networks* (pp. 703-716). Springer, Cham.
+
 Ishii, Y. and Takanashi, M., 2019. Low-cost Unsupervised Outlier Detection by Autoencoders with Robust Estimation. *Journal of Information Processing*, 27, pp.335-339.
 
+Ramakrishnan, J., Shaabani, E., Li, C. and Sustik, M.A., 2019. *Anomaly detection for an e-commerce pricing system. arXiv preprint arXiv:1902.09566.
+
 Klaeger, T., Schult, A. and Oehm, L., 2019. Using anomaly detection to support classification of fast running (packaging) processes. arXiv preprint arXiv:1906.02473.
 
 Krishnan, S. and Wu, E., 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arXiv preprint arXiv:1904.11827.
 
 Kumar Das, S., Kumar Mishra, A. and Roy, P., 2019. Automatic Diabetes Prediction Using Tree Based Ensemble Learners. *International Journal of Computational Intelligence & IoT*, 2(2).
 
+Li, Y., Zha, D., Zou, N. and Hu, X., 2019. PyODDS: An End-to-End Outlier Detection System. arXiv preprint arXiv:1910.02575.
+
 Ramakrishnan, J., Shaabani, E., Li, C. and Sustik, M.A., 2019. Anomaly Detection for an E-commerce Pricing System. arXiv preprint arXiv:1902.09566.
 
 Trinh, H.D., Giupponi, L. and Dini, P., 2019. Urban Anomaly Detection by processing Mobile Traffic Traces with LSTM Neural Networks. *IEEE International Conference on Sensing, Communication and Networking (IEEE SECON)*.
 
 Wan, C., Li, Z. and Zhao, Y., 2019. SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula. arXiv preprint arXiv:1904.07998.
 
+Wang, H., Bah, M.J. and Hammad, M., 2019. Progress in Outlier Detection Techniques: A Survey. *IEEE Access*, 7, pp.107964-108000.
+
 Weng, Y., Zhang, N. and Xia, C., 2019. Multi-Agent-Based Unsupervised Detection of Energy Consumption Anomalies on Smart Campus. *IEEE Access*, 7, pp.2169-2178.
 
 Zhao, Y., Hryniewicki, M.K., Nasrullah, Z., and Li, Z., 2019. LSCP: Locally Selective Combination in Parallel Outlier Ensembles. *SIAM International Conference on Data Mining (SDM)*, SIAM.

diff --git a/docs/zreferences.bib b/docs/zreferences.bib
@@ -200,9 +200,12 @@ @article{liu2019generative
 }
 
 @article{zhao2019pyod,
-  title={PyOD: A Python Toolbox for Scalable Outlier Detection},
+  title={PyOD: A python toolbox for scalable outlier detection},
   author={Zhao, Yue and Nasrullah, Zain and Li, Zheng},
-  journal={arXiv preprint arXiv:1901.01588},
+  journal={Journal of Machine Learning Research},
+  volume={20},
+  number={96},
+  pages={1--7},
   year={2019}
 }
 
@@ -256,4 +259,13 @@ @inproceedings{kriegel2009outlier
   pages={831--838},
   year={2009},
   organization={Springer}
+}
+
+@inproceedings{li2019mad,
+  title={MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks},
+  author={Li, Dan and Chen, Dacheng and Jin, Baihong and Shi, Lei and Goh, Jonathan and Ng, See-Kiong},
+  booktitle={International Conference on Artificial Neural Networks},
+  pages={703--716},
+  year={2019},
+  organization={Springer}
 }
diff --git a/pyod/models/iforest.py b/pyod/models/iforest.py
@@ -17,7 +17,34 @@
 from ..utils.utility import _sklearn_version_20
 
 
-# TODO: behavior of Isolation Forest will change in sklearn 0.22, to update.
+# TODO: behavior of Isolation Forest will change in sklearn 0.22. See below.
+# in 0.22, scikit learn will start adjust decision_function values by
+# offset to make the values below zero as outliers. In other words, it is
+# an absolute shift, which SHOULD NOT affect the result of PyOD at all as
+# the order is still preserved.
+
+# Behaviour of the decision_function which can be either ‘old’ or ‘new’.
+# Passing behaviour='new' makes the decision_function change to match other
+# anomaly detection algorithm API which will be the default behaviour in the
+# future. As explained in details in the offset_ attribute documentation,
+# the decision_function becomes dependent on the contamination parameter,
+# in such a way that 0 becomes its natural threshold to detect outliers.
+
+# offset_ : float
+# Offset used to define the decision function from the raw scores.
+# We have the relation: decision_function = score_samples - offset_.
+# Assuming behaviour == ‘new’, offset_ is defined as follows.
+# When the contamination parameter is set to “auto”,
+# the offset is equal to -0.5 as the scores of inliers are close to 0 and the
+# scores of outliers are close to -1. When a contamination parameter different
+# than “auto” is provided, the offset is defined in such a way we obtain the
+# expected number of outliers (samples with decision function < 0) in training.
+# Assuming the behaviour parameter is set to ‘old’,
+# we always have offset_ = -0.5, making the decision function independent from
+# the contamination parameter.
+
+# check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html for more information
+
 
 class IForest(BaseDetector):
     """Wrapper of scikit-learn Isolation Forest with more functionalities.

diff --git a/pyod/models/knn.py b/pyod/models/knn.py
@@ -6,6 +6,8 @@
 from __future__ import division
 from __future__ import print_function
 
+from warnings import warn
+
 import numpy as np
 from sklearn.neighbors import NearestNeighbors
 from sklearn.neighbors import BallTree
@@ -14,6 +16,9 @@
 
 from .base import BaseDetector
 
+# TODO: algorithm parameter is deprecated and will be removed in 0.7.6.
+# Warning has been turned on.
+# TODO: since Ball_tree is used by default, may introduce its parameters.
 
 class KNN(BaseDetector):
     # noinspection PyPep8
@@ -62,8 +67,12 @@ class KNN(BaseDetector):
         Note: fitting on sparse input will override the setting of
         this parameter, using brute force.
 
+        .. deprecated:: 0.74
+           ``algorithm`` is deprecated in PyOD 0.7.4 and will not be
+           possible in 0.7.6. It has to use BallTree for consistency.
+
     leaf_size : int, optional (default = 30)
-        Leaf size passed to BallTree or KDTree.  This can affect the
+        Leaf size passed to BallTree. This can affect the
         speed of the construction and query, as well as the memory
         required to store the tree.  The optimal value depends on the
         nature of the problem.
@@ -144,6 +153,11 @@ def __init__(self, contamination=0.1, n_neighbors=5, method='largest',
         self.metric_params = metric_params
         self.n_jobs = n_jobs
 
+        if self.algorithm != 'auto' and self.algorithm != 'ball_tree':
+            warn('algorithm parameter is deprecated and will be removed '
+                 'in version 0.7.6. By default, ball_tree will be used.',
+                 FutureWarning)
+
         self.neigh_ = NearestNeighbors(n_neighbors=self.n_neighbors,
                                        radius=self.radius,
                                        algorithm=self.algorithm,

diff --git a/pyod/models/lscp.py b/pyod/models/lscp.py
@@ -296,18 +296,30 @@ def _get_local_region(self, X_test_norm):
                 "Local max features greater than 1.0, reducing to 1.0")
             self.local_max_features = 1.0
 
+        if self.X_train_norm_.shape[1] * self.local_min_features < 1:
+            warnings.warn(
+                "Local min features smaller than 1, increasing to 1.0")
+            self.local_min_features = 1.0
+
         # perform multiple iterations
         for _ in range(self.local_region_iterations):
 
-            # randomly generate feature subspaces
-            features = generate_bagging_indices(
-                self.random_state,
-                bootstrap_features=False,
-                n_features=self.X_train_norm_.shape[1],
-                min_features=int(
-                    self.X_train_norm_.shape[1] * self.local_min_features),
-                max_features=int(
-                    self.X_train_norm_.shape[1] * self.local_max_features))
+            # if min and max are the same, then use all features
+            if self.local_max_features == self.local_min_features:
+                features = range(0, self.X_train_norm_.shape[1])
+                warnings.warn("Local min features equals local max features; "
+                              "use all features instead.")
+
+            else:
+                # randomly generate feature subspaces
+                features = generate_bagging_indices(
+                    self.random_state,
+                    bootstrap_features=False,
+                    n_features=self.X_train_norm_.shape[1],
+                    min_features=int(
+                        self.X_train_norm_.shape[1] * self.local_min_features),
+                    max_features=int(
+                        self.X_train_norm_.shape[1] * self.local_max_features))
 
             # build KDTree out of training subspace
             tree = KDTree(self.X_train_norm_[:, features])

diff --git a/pyod/models/sklearn_base.py b/pyod/models/sklearn_base.py
@@ -29,17 +29,6 @@ def _get_n_jobs(n_jobs):
     -------
     n_jobs : int
         The actual number of jobs as positive integer.
-    Examples
-    --------
-    >>> from sklearn.utils import _get_n_jobs
-    >>> _get_n_jobs(4)
-    4
-    >>> jobs = _get_n_jobs(-2)
-    >>> assert jobs == max(cpu_count() - 1, 1)
-    >>> _get_n_jobs(0)
-    Traceback (most recent call last):
-    ...
-    ValueError: Parameter n_jobs == 0 has no meaning.
     """
     if n_jobs < 0:
         return max(cpu_count() + 1 + n_jobs, 1)