The problrm of RRCF training data to get the model #78

Zhoulinfeng0510 · 2020-09-18T12:11:55Z

Can RRCF obtain a model from the training set data, and then use this model to detect anomalies in the new data stream?

mdbartos · 2020-09-18T22:53:40Z

Yes. In this case you would:

Construct a forest from a fixed training set
For each new point in the data stream:
- Insert the new point into each tree
- Compute the codisp score of the new point for each tree
- Delete the new point from each tree

You can also use a similar approach for classification:
https://klabum.github.io/rrcf/classification.html

Zhoulinfeng0510 · 2020-09-21T10:59:38Z

yep! I want to know more about the method of obtaining such a model. My current understanding is to use the to_dict function in the API interface. I wonder if this is correct? If so, can you please give me a specific code here? Thank you very much for your reply.

mdbartos · 2020-09-21T19:55:32Z

This should work:

Train model (same example as in README)

import numpy as np
import pandas as pd
import rrcf

# Set parameters
np.random.seed(0)
n = 2010
d = 3
num_trees = 10
tree_size = 10

# Generate data
X = np.zeros((n, d))
X[:1000,0] = 5
X[1000:2000,0] = -5
X += 0.01*np.random.randn(*X.shape)

# Construct forest
forest = []
while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(n, size=(n // tree_size, tree_size),
                           replace=False)
    # Add sampled trees to forest
    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)

Save forest to json file

# Write learned model to json file
import json

# Convert forest to list of dictionaries
out_json = [tree.to_dict() for tree in forest]

# Write forest to file
with open('forest.json', 'w') as outfile:
    json.dump(out_json, outfile)

Read forest from json file

# Read json file into new forest
with open('forest.json', 'r') as infile:
    forest_obj = json.load(infile)
    
new_forest = []
for tree_obj in forest_obj:
    tree = rrcf.RCTree.from_dict(tree_obj)
    new_forest.append(tree)

Compare:

>>> forest[0]

>>> 
─+
 ├───+
 │   ├──(6)
 │   └───+
 │       ├───+
 │       │   ├──(1)
 │       │   └──(4)
 │       └──(8)
 └───+
     ├───+
     │   ├──(0)
     │   └───+
     │       ├───+
     │       │   ├──(9)
     │       │   └──(5)
     │       └──(2)
     └───+
         ├──(3)
         └──(7)

>>> new_forest[0]

>>>
─+
 ├───+
 │   ├──(6)
 │   └───+
 │       ├───+
 │       │   ├──(1)
 │       │   └──(4)
 │       └──(8)
 └───+
     ├───+
     │   ├──(0)
     │   └───+
     │       ├───+
     │       │   ├──(9)
     │       │   └──(5)
     │       └──(2)
     └───+
         ├──(3)
         └──(7)

Zhoulinfeng0510 · 2020-09-28T12:48:28Z

Okay, I think I already understand how RRCF works like this！ Thank you very much! :)
After further research, I found another problem: For multi-dimensional streaming data, calculating codisp will be a problem. I used shingle to create a sliding window. This data format is m x n, but the insert_piont function will only process 1 x d data.
In this regard, rrcf will have a better way to calculate the anomaly scores of multidimensional and sliding window data？

mdbartos · 2020-09-30T21:56:08Z

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]
...
[x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)]
...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

Zhoulinfeng0510 · 2020-10-15T07:01:51Z

Thank you very much for your sincere reply, I have solved the above problem perfectly. However, I have the following problems when using RRCF. In Figure 1, it can be seen that there is a segment in the middle of the data (orange line) with obvious abnormalities. However, in the second picture, the highest anomaly score of the anomaly segment is only 0.25, and the anomaly score of the later segments with little anomaly is occasionally 0.25. This makes me very confused.

yasirroni · 2020-10-26T14:35:52Z

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]
...
[x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)]
...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

This should be added to the doc example (didn't see it, either I miss it or not documented).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The problrm of RRCF training data to get the model #78

The problrm of RRCF training data to get the model #78

Zhoulinfeng0510 commented Sep 18, 2020

mdbartos commented Sep 18, 2020

Zhoulinfeng0510 commented Sep 21, 2020

mdbartos commented Sep 21, 2020

Zhoulinfeng0510 commented Sep 28, 2020

mdbartos commented Sep 30, 2020

Zhoulinfeng0510 commented Oct 15, 2020

yasirroni commented Oct 26, 2020

The problrm of RRCF training data to get the model #78

The problrm of RRCF training data to get the model #78

Comments

Zhoulinfeng0510 commented Sep 18, 2020

mdbartos commented Sep 18, 2020

Zhoulinfeng0510 commented Sep 21, 2020

mdbartos commented Sep 21, 2020

Train model (same example as in README)

Save forest to json file

Read forest from json file

Compare:

Zhoulinfeng0510 commented Sep 28, 2020

mdbartos commented Sep 30, 2020

Zhoulinfeng0510 commented Oct 15, 2020

yasirroni commented Oct 26, 2020