Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The problrm of RRCF training data to get the model #78

Open
Zhoulinfeng0510 opened this issue Sep 18, 2020 · 7 comments
Open

The problrm of RRCF training data to get the model #78

Zhoulinfeng0510 opened this issue Sep 18, 2020 · 7 comments

Comments

@Zhoulinfeng0510
Copy link

Can RRCF obtain a model from the training set data, and then use this model to detect anomalies in the new data stream?

@mdbartos
Copy link
Member

Yes. In this case you would:

  • Construct a forest from a fixed training set
  • For each new point in the data stream:
    • Insert the new point into each tree
    • Compute the codisp score of the new point for each tree
    • Delete the new point from each tree

You can also use a similar approach for classification:
https://klabum.github.io/rrcf/classification.html

@Zhoulinfeng0510
Copy link
Author

yep! I want to know more about the method of obtaining such a model. My current understanding is to use the to_dict function in the API interface. I wonder if this is correct? If so, can you please give me a specific code here? Thank you very much for your reply.

@mdbartos
Copy link
Member

This should work:

Train model (same example as in README)

import numpy as np
import pandas as pd
import rrcf

# Set parameters
np.random.seed(0)
n = 2010
d = 3
num_trees = 10
tree_size = 10

# Generate data
X = np.zeros((n, d))
X[:1000,0] = 5
X[1000:2000,0] = -5
X += 0.01*np.random.randn(*X.shape)

# Construct forest
forest = []
while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(n, size=(n // tree_size, tree_size),
                           replace=False)
    # Add sampled trees to forest
    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)

Save forest to json file

# Write learned model to json file
import json

# Convert forest to list of dictionaries
out_json = [tree.to_dict() for tree in forest]

# Write forest to file
with open('forest.json', 'w') as outfile:
    json.dump(out_json, outfile)

Read forest from json file

# Read json file into new forest
with open('forest.json', 'r') as infile:
    forest_obj = json.load(infile)
    
new_forest = []
for tree_obj in forest_obj:
    tree = rrcf.RCTree.from_dict(tree_obj)
    new_forest.append(tree)

Compare:

>>> forest[0]

>>>+
 ├───+
 │   ├──(6)
 │   └───+
 │       ├───+
 │       │   ├──(1)
 │       │   └──(4)
 │       └──(8)
 └───+
     ├───+
     │   ├──(0)
     │   └───+
     │       ├───+
     │       │   ├──(9)
     │       │   └──(5)
     │       └──(2)
     └───+
         ├──(3)
         └──(7)
>>> new_forest[0]

>>>+
 ├───+
 │   ├──(6)
 │   └───+
 │       ├───+
 │       │   ├──(1)
 │       │   └──(4)
 │       └──(8)
 └───+
     ├───+
     │   ├──(0)
     │   └───+
     │       ├───+
     │       │   ├──(9)
     │       │   └──(5)
     │       └──(2)
     └───+
         ├──(3)
         └──(7)

@Zhoulinfeng0510
Copy link
Author

Okay, I think I already understand how RRCF works like this! Thank you very much! :)
After further research, I found another problem: For multi-dimensional streaming data, calculating codisp will be a problem. I used shingle to create a sliding window. This data format is m x n, but the insert_piont function will only process 1 x d data.
In this regard, rrcf will have a better way to calculate the anomaly scores of multidimensional and sliding window data?

@mdbartos
Copy link
Member

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]
...
[x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)]
...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

@Zhoulinfeng0510
Copy link
Author

Thank you very much for your sincere reply, I have solved the above problem perfectly. However, I have the following problems when using RRCF. In Figure 1, it can be seen that there is a segment in the middle of the data (orange line) with obvious abnormalities. However, in the second picture, the highest anomaly score of the anomaly segment is only 0.25, and the anomaly score of the later segments with little anomaly is occasionally 0.25. This makes me very confused.
Figure_1
Figure_2

@yasirroni
Copy link

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]
...
[x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)]
...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

This should be added to the doc example (didn't see it, either I miss it or not documented).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants