Skip to content

Commit

Permalink
Merge pull request #136 from yzhao062/development
Browse files Browse the repository at this point in the history
V 0.7.5
  • Loading branch information
yzhao062 authored Oct 13, 2019
2 parents 03ec97c + 850b89a commit da68f50
Show file tree
Hide file tree
Showing 16 changed files with 176 additions and 95 deletions.
4 changes: 4 additions & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,10 @@ v<0.7.3>, <06/12/2019> -- Fix bugs in SO_GAAL and MO_GAAL.
v<0.7.4>, <07/10/2019> -- Fix bugs and update documentation.
v<0.7.4>, <07/17/2019> -- Update dependency (six and joblib).
v<0.7.4>, <07/19/2019> -- Update deprecation information.
v<0.7.5>, <09/24/2019> -- Fix one dimensional data error in LSCP.
v<0.7.5>, <10/13/2019> -- Document kNN and Isolation Forest's incoming changes.
v<0.7.5>, <10/13/2019> -- SOD optimization (created by John-Almardeny in June).
v<0.7.5>, <10/13/2019> -- Documentation updates.



Expand Down
17 changes: 5 additions & 12 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ Python Outlier Detection (PyOD)


.. image:: https://circleci.com/gh/yzhao062/pyod.svg?style=svg
:target: https://circleci.com/gh/yzhao062/pyod
:target: https://circleci.com/gh/yzhao062/pyod
:alt: Circle CI


.. image:: https://coveralls.io/repos/github/yzhao062/pyod/badge.svg
Expand All @@ -71,18 +72,14 @@ Python Outlier Detection (PyOD)
:alt: License


.. image:: https://img.shields.io/badge/link-996.icu-red.svg
:target: https://github.com/996icu/996.ICU
:alt: 996.ICU

-----

PyOD is a comprehensive and scalable **Python toolkit** for **detecting outlying objects** in
multivariate data. This exciting yet challenging field is commonly referred as
`Outlier Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_
or `Anomaly Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_.
Since 2017, PyOD has been successfully used in various academic researches and
commercial products [#Ramakrishnan2019Anomaly]_ [#Krishnan2019AlphaClean]_ [#Zhao2018DCSO]_ [#Zhao2019LSCP]_.
commercial products [#Li2019MADGAN]_ [#Zhao2019LSCP]_.
It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including
`Analytics Vidhya <https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/>`_,
`KDnuggets <https://www.kdnuggets.com/2019/02/outlier-detection-methods-cheat-sheet.html>`_,
Expand Down Expand Up @@ -591,18 +588,16 @@ Reference
.. [#Kriegel2009Outlier] Kriegel, H.P., Kröger, P., Schubert, E. and Zimek, A., 2009, April. Outlier detection in axis-parallel subspaces of high dimensional data. In *Pacific-Asia Conference on Knowledge Discovery and Data Mining*\ , pp. 831-838. Springer, Berlin, Heidelberg.
.. [#Krishnan2019AlphaClean] Krishnan, S. and Wu, E., 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arXiv preprint arXiv:1904.11827.
.. [#Lazarevic2005Feature] Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In *KDD '05*. 2005.
.. [#Li2019MADGAN] Li, D., Chen, D., Jin, B., Shi, L., Goh, J. and Ng, S.K., 2019, September. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In *International Conference on Artificial Neural Networks* (pp. 703-716). Springer, Cham.
.. [#Liu2008Isolation] Liu, F.T., Ting, K.M. and Zhou, Z.H., 2008, December. Isolation forest. In *International Conference on Data Mining*\ , pp. 413-422. IEEE.
.. [#Liu2019Generative] Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M. and He, X., 2019. Generative adversarial active learning for unsupervised outlier detection. *IEEE Transactions on Knowledge and Data Engineering*.
.. [#Papadimitriou2003LOCI] Papadimitriou, S., Kitagawa, H., Gibbons, P.B. and Faloutsos, C., 2003, March. LOCI: Fast outlier detection using the local correlation integral. In *ICDE '03*, pp. 315-326. IEEE.
.. [#Ramakrishnan2019Anomaly] Ramakrishnan, J., Shaabani, E., Li, C. and Sustik, M.A., 2019. Anomaly Detection for an E-commerce Pricing System. arXiv preprint arXiv:1902.09566.
.. [#Ramaswamy2000Efficient] Ramaswamy, S., Rastogi, R. and Shim, K., 2000, May. Efficient algorithms for mining outliers from large data sets. *ACM Sigmod Record*\ , 29(2), pp. 427-438.
.. [#Rousseeuw1999A] Rousseeuw, P.J. and Driessen, K.V., 1999. A fast algorithm for the minimum covariance determinant estimator. *Technometrics*\ , 41(3), pp.212-223.
Expand All @@ -613,8 +608,6 @@ Reference
.. [#Tang2002Enhancing] Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In *Pacific-Asia Conference on Knowledge Discovery and Data Mining*, pp. 535-548. Springer, Berlin, Heidelberg.
.. [#Zhao2018DCSO] Zhao, Y. and Hryniewicki, M.K. DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles. *ACM SIGKDD Workshop on Outlier Detection De-constructed (ODD v5.0)*\ , 2018.
.. [#Zhao2018XGBOD] Zhao, Y. and Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. *IEEE International Joint Conference on Neural Networks*\ , 2018.
.. [#Zhao2019LSCP] Zhao, Y., Nasrullah, Z., Hryniewicki, M.K. and Li, Z., 2019, May. LSCP: Locally selective combination in parallel outlier ensembles. In *Proceedings of the 2019 SIAM International Conference on Data Mining (SDM)*, pp. 585-593. Society for Industrial and Applied Mathematics.
23 changes: 16 additions & 7 deletions docs/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,23 @@ About us
Core Development Team
---------------------

Yue Zhao (initialized the project in 2017): `Homepage <https://www.yuezhao.me/>`_
Yue Zhao (Ph.D. Student @ Carnegie Mellon University):

Zain Nasrullah (joined in 2018):
`LinkedIn (Zain Nasrullah) <https://www.linkedin.com/in/zain-nasrullah-097a2b85>`_
- initialized the project in 2017
- `Homepage <https://www.andrew.cmu.edu/user/yuezhao2/>`_

Winston (Zheng) Li (joined in 2018):
`LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl>`_
Zain Nasrullah (Data Scientist at RBC; MSc in Computer Science):

Yahya Almardeny (joined in 2019):
`LinkedIn (Yahya Almardeny) <https://www.linkedin.com/in/yahya-almardeny/>`_
- joined in 2018
- `LinkedIn (Zain Nasrullah) <https://www.linkedin.com/in/zain-nasrullah-097a2b85>`_

Winston (Zheng) Li (Founder of `arima <https://www.arimadata.com/>`_, Stat Ph.D., Instructor @ Northeastern U):

- joined in 2018
- `LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl>`_

Yahya Almardeny (Software Systems & Machine Learning Engineer @ TSSG):

- joined in 2019
- `LinkedIn (Yahya Almardeny) <https://www.linkedin.com/in/yahya-almardeny/>`_

2 changes: 1 addition & 1 deletion docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ What is the Next?

This is the central place to track important things to be fixed/added:

- GPU support
- GPU support (it is noted that keras with TensorFlow backend will automatically run on GPU; auto_encoder_example.py takes around 96.95 seconds on a RTX 2060 GPU).
- Installation efficiency improvement, such as using docker
- Add contact channel with `Gitter <https://gitter.im>`_
- Support additional languages, see `Manage Translations <https://docs.readthedocs.io/en/latest/guides/manage-translations.html>`_
Expand Down
11 changes: 3 additions & 8 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ Welcome to PyOD documentation!


.. image:: https://circleci.com/gh/yzhao062/pyod.svg?style=svg
:target: https://circleci.com/gh/yzhao062/pyod
:target: https://circleci.com/gh/yzhao062/pyod
:alt: Circle CI


.. image:: https://coveralls.io/repos/github/yzhao062/pyod/badge.svg
Expand All @@ -76,20 +77,14 @@ Welcome to PyOD documentation!
:alt: License


.. image:: https://img.shields.io/badge/link-996.icu-red.svg
:target: https://github.com/996icu/996.ICU
:alt: 996.ICU


----

PyOD is a comprehensive and scalable **Python toolkit** for **detecting outlying objects** in
multivariate data. This exciting yet challenging field is commonly referred as
`Outlier Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_
or `Anomaly Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_.
Since 2017, PyOD :cite:`a-zhao2019pyod` has been successfully used in various
academic researches and commercial products
:cite:`a-ramakrishnan2019anomaly,a-krishnan2019alphaclean,a-zhao2018dcso,a-zhao2019lscp`.
academic researches and commercial products :cite:`a-li2019mad,a-zhao2019lscp`.
It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including
`Analytics Vidhya <https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/>`_,
`Towards Data Science <https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1>`_,
Expand Down
10 changes: 10 additions & 0 deletions docs/pubs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,30 @@ We are appreciated that PyOD has been increasingly referred and cited in scienti

**2019**

Amorim, M., Bortoloti, F.D., Ciarelli, P.M., Salles, E.O. and Cavalieri, D.C., 2019. Novelty Detection in Social Media by Fusing Text and Image Into a Single Structure. *IEEE Access*, 7, pp.132786-132802.

Li, D., Chen, D., Jin, B., Shi, L., Goh, J. and Ng, S.K., 2019, September. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In *International Conference on Artificial Neural Networks* (pp. 703-716). Springer, Cham.

Ishii, Y. and Takanashi, M., 2019. Low-cost Unsupervised Outlier Detection by Autoencoders with Robust Estimation. *Journal of Information Processing*, 27, pp.335-339.

Ramakrishnan, J., Shaabani, E., Li, C. and Sustik, M.A., 2019. *Anomaly detection for an e-commerce pricing system. arXiv preprint arXiv:1902.09566.
Klaeger, T., Schult, A. and Oehm, L., 2019. Using anomaly detection to support classification of fast running (packaging) processes. arXiv preprint arXiv:1906.02473.

Krishnan, S. and Wu, E., 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arXiv preprint arXiv:1904.11827.

Kumar Das, S., Kumar Mishra, A. and Roy, P., 2019. Automatic Diabetes Prediction Using Tree Based Ensemble Learners. *International Journal of Computational Intelligence & IoT*, 2(2).

Li, Y., Zha, D., Zou, N. and Hu, X., 2019. PyODDS: An End-to-End Outlier Detection System. arXiv preprint arXiv:1910.02575.

Ramakrishnan, J., Shaabani, E., Li, C. and Sustik, M.A., 2019. Anomaly Detection for an E-commerce Pricing System. arXiv preprint arXiv:1902.09566.

Trinh, H.D., Giupponi, L. and Dini, P., 2019. Urban Anomaly Detection by processing Mobile Traffic Traces with LSTM Neural Networks. *IEEE International Conference on Sensing, Communication and Networking (IEEE SECON)*.

Wan, C., Li, Z. and Zhao, Y., 2019. SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula. arXiv preprint arXiv:1904.07998.

Wang, H., Bah, M.J. and Hammad, M., 2019. Progress in Outlier Detection Techniques: A Survey. *IEEE Access*, 7, pp.107964-108000.

Weng, Y., Zhang, N. and Xia, C., 2019. Multi-Agent-Based Unsupervised Detection of Energy Consumption Anomalies on Smart Campus. *IEEE Access*, 7, pp.2169-2178.

Zhao, Y., Hryniewicki, M.K., Nasrullah, Z., and Li, Z., 2019. LSCP: Locally Selective Combination in Parallel Outlier Ensembles. *SIAM International Conference on Data Mining (SDM)*, SIAM.
Expand Down
16 changes: 14 additions & 2 deletions docs/zreferences.bib
Original file line number Diff line number Diff line change
Expand Up @@ -200,9 +200,12 @@ @article{liu2019generative
}

@article{zhao2019pyod,
title={PyOD: A Python Toolbox for Scalable Outlier Detection},
title={PyOD: A python toolbox for scalable outlier detection},
author={Zhao, Yue and Nasrullah, Zain and Li, Zheng},
journal={arXiv preprint arXiv:1901.01588},
journal={Journal of Machine Learning Research},
volume={20},
number={96},
pages={1--7},
year={2019}
}

Expand Down Expand Up @@ -256,4 +259,13 @@ @inproceedings{kriegel2009outlier
pages={831--838},
year={2009},
organization={Springer}
}

@inproceedings{li2019mad,
title={MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks},
author={Li, Dan and Chen, Dacheng and Jin, Baihong and Shi, Lei and Goh, Jonathan and Ng, See-Kiong},
booktitle={International Conference on Artificial Neural Networks},
pages={703--716},
year={2019},
organization={Springer}
}
29 changes: 28 additions & 1 deletion pyod/models/iforest.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,34 @@
from ..utils.utility import _sklearn_version_20


# TODO: behavior of Isolation Forest will change in sklearn 0.22, to update.
# TODO: behavior of Isolation Forest will change in sklearn 0.22. See below.
# in 0.22, scikit learn will start adjust decision_function values by
# offset to make the values below zero as outliers. In other words, it is
# an absolute shift, which SHOULD NOT affect the result of PyOD at all as
# the order is still preserved.

# Behaviour of the decision_function which can be either ‘old’ or ‘new’.
# Passing behaviour='new' makes the decision_function change to match other
# anomaly detection algorithm API which will be the default behaviour in the
# future. As explained in details in the offset_ attribute documentation,
# the decision_function becomes dependent on the contamination parameter,
# in such a way that 0 becomes its natural threshold to detect outliers.

# offset_ : float
# Offset used to define the decision function from the raw scores.
# We have the relation: decision_function = score_samples - offset_.
# Assuming behaviour == ‘new’, offset_ is defined as follows.
# When the contamination parameter is set to “auto”,
# the offset is equal to -0.5 as the scores of inliers are close to 0 and the
# scores of outliers are close to -1. When a contamination parameter different
# than “auto” is provided, the offset is defined in such a way we obtain the
# expected number of outliers (samples with decision function < 0) in training.
# Assuming the behaviour parameter is set to ‘old’,
# we always have offset_ = -0.5, making the decision function independent from
# the contamination parameter.

# check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html for more information


class IForest(BaseDetector):
"""Wrapper of scikit-learn Isolation Forest with more functionalities.
Expand Down
16 changes: 15 additions & 1 deletion pyod/models/knn.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
from __future__ import division
from __future__ import print_function

from warnings import warn

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import BallTree
Expand All @@ -14,6 +16,9 @@

from .base import BaseDetector

# TODO: algorithm parameter is deprecated and will be removed in 0.7.6.
# Warning has been turned on.
# TODO: since Ball_tree is used by default, may introduce its parameters.

class KNN(BaseDetector):
# noinspection PyPep8
Expand Down Expand Up @@ -62,8 +67,12 @@ class KNN(BaseDetector):
Note: fitting on sparse input will override the setting of
this parameter, using brute force.
.. deprecated:: 0.74
``algorithm`` is deprecated in PyOD 0.7.4 and will not be
possible in 0.7.6. It has to use BallTree for consistency.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or KDTree. This can affect the
Leaf size passed to BallTree. This can affect the
speed of the construction and query, as well as the memory
required to store the tree. The optimal value depends on the
nature of the problem.
Expand Down Expand Up @@ -144,6 +153,11 @@ def __init__(self, contamination=0.1, n_neighbors=5, method='largest',
self.metric_params = metric_params
self.n_jobs = n_jobs

if self.algorithm != 'auto' and self.algorithm != 'ball_tree':
warn('algorithm parameter is deprecated and will be removed '
'in version 0.7.6. By default, ball_tree will be used.',
FutureWarning)

self.neigh_ = NearestNeighbors(n_neighbors=self.n_neighbors,
radius=self.radius,
algorithm=self.algorithm,
Expand Down
30 changes: 21 additions & 9 deletions pyod/models/lscp.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,18 +296,30 @@ def _get_local_region(self, X_test_norm):
"Local max features greater than 1.0, reducing to 1.0")
self.local_max_features = 1.0

if self.X_train_norm_.shape[1] * self.local_min_features < 1:
warnings.warn(
"Local min features smaller than 1, increasing to 1.0")
self.local_min_features = 1.0

# perform multiple iterations
for _ in range(self.local_region_iterations):

# randomly generate feature subspaces
features = generate_bagging_indices(
self.random_state,
bootstrap_features=False,
n_features=self.X_train_norm_.shape[1],
min_features=int(
self.X_train_norm_.shape[1] * self.local_min_features),
max_features=int(
self.X_train_norm_.shape[1] * self.local_max_features))
# if min and max are the same, then use all features
if self.local_max_features == self.local_min_features:
features = range(0, self.X_train_norm_.shape[1])
warnings.warn("Local min features equals local max features; "
"use all features instead.")

else:
# randomly generate feature subspaces
features = generate_bagging_indices(
self.random_state,
bootstrap_features=False,
n_features=self.X_train_norm_.shape[1],
min_features=int(
self.X_train_norm_.shape[1] * self.local_min_features),
max_features=int(
self.X_train_norm_.shape[1] * self.local_max_features))

# build KDTree out of training subspace
tree = KDTree(self.X_train_norm_[:, features])
Expand Down
11 changes: 0 additions & 11 deletions pyod/models/sklearn_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,6 @@ def _get_n_jobs(n_jobs):
-------
n_jobs : int
The actual number of jobs as positive integer.
Examples
--------
>>> from sklearn.utils import _get_n_jobs
>>> _get_n_jobs(4)
4
>>> jobs = _get_n_jobs(-2)
>>> assert jobs == max(cpu_count() - 1, 1)
>>> _get_n_jobs(0)
Traceback (most recent call last):
...
ValueError: Parameter n_jobs == 0 has no meaning.
"""
if n_jobs < 0:
return max(cpu_count() + 1 + n_jobs, 1)
Expand Down
Loading

0 comments on commit da68f50

Please sign in to comment.