Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the threshold of the Anomaly score dynamic for the Streaming detection? #81

Open
shihyulee opened this issue Nov 16, 2020 · 1 comment

Comments

@shihyulee
Copy link

On the streaming data, is it necessary to design the dynamic threshold of the anomaly score to adapt over time as the input data changes?

Thank you.

@yasirroni
Copy link

yasirroni commented Jan 13, 2021

In my case, static model with static threshold is prefered. Here is the code:

# Set tree parameters
NUM_TREES = 100 # around 100 trees or more is recommended
SHINGLE_SIZE = 10 # number of data treated as single data
NUM_SHINGLE = n_points - SHINGLE_SIZE + 1 # number of shingle
TREE_SIZE = NUM_SHINGLE # number of points in the tree for fifo, minutes in a day are 1440
SPROUT_PROBABILITY = 0.8 # the probability of leaf grow on single tree independently, lower means better generality

TREE_FULL_FLAG = False # flag to mark tree status

# Create a forest of empty trees
forest = [None] * NUM_TREES
for i in range(NUM_TREES):
    tree = rrcf.RCTree()
    forest[i] = tree

# Use the "shingle" generator to create rolling window
points = rrcf.shingle(X, size=SHINGLE_SIZE)

# transpose
def shingle_transpose(shingle):
    for val in shingle:
        yield val.T.flatten()

points = shingle_transpose(points)

# create a dict to store anomaly score of each point
avg_codisp = pd.Series(0.0, index=np.arange(NUM_SHINGLE))

# create a dict to store anomaly score of each point
avg_codisp = pd.Series(0.0, index=np.arange(NUM_SHINGLE))

# for each shingle...
for index, point in enumerate(points):
    # record number of tree that save the shingle
    freq = 0
    # for each tree in the forest...
    for tree in forest:
        # roll dice to determine point become a leave or not
        if np.random.random() <= SPROUT_PROBABILITY:
            # insert the new point into the tree
            tree.insert_point(point, index=index)
            
            # add 1 to freq record
            freq += 1
            
        # if tree already full
        if len(tree.leaves) == TREE_SIZE: 
            TREE_FULL_FLAG = True
            
    # if no tree save current shingle
    if freq == 0:
        # chose any tree
        tree = np.random.choice(forest)

        # insert the new point into the tree
        tree.insert_point(point, index=index)
        
        # if tree already full
        if len(tree.leaves) == TREE_SIZE: # if TREE_SIZE
            TREE_FULL_FLAG = True
    
    # if any tree is above permitted size...
    if TREE_FULL_FLAG:
        break

# Compute codisp
freq = np.zeros(index + 1)
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf) for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp
    np.add.at(freq, codisp.index.values, 1)

for i in range(index + 1):
    if freq[i] >= 1:
        avg_codisp[i] = avg_codisp[i] / freq[i]
        
# After any tree is above permitted size and if points haven't empty, for each shingle...
for point in points:
    # Continue index counter
    index += 1
    
    codisp = 0
    freq = 0
    for tree in forest:
        # Roll dice to plant or not
        if np.random.random() <= SPROUT_PROBABILITY:
            # Check if leave can sprout
            if len(tree.leaves) == TREE_SIZE:
                # Drop the oldest point (FIFO)
                tree.forget_point(min(tree.leaves.keys()))

            # Insert the new point into the tree
            tree.insert_point(point, index=index)
            
            codisp += tree.codisp(index)
            freq += 1

    if freq == 0: # leaf not sprouted at any tree
        tree = np.random.choice(forest)
        # Check if leave can sprout
        if len(tree.leaves) == TREE_SIZE:
            # Drop the oldest point (FIFO)
            tree.forget_point(min(tree.leaves.keys()))

        # Insert the new point into the tree
        tree.insert_point(point, index=index)

        codisp += tree.codisp(index)
        freq += 1
        
    avg_codisp[index] = codisp / freq

plt.plot(avg_codisp)
plt.show()

Then, I will double (based on your case) the maximum value of train CoDisp as the threshold, and you can use insert and delete the new point in testing or deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants