Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeDecay & High-dimensional shingling #58

Open
orenefraty opened this issue Jun 4, 2019 · 4 comments
Open

timeDecay & High-dimensional shingling #58

orenefraty opened this issue Jun 4, 2019 · 4 comments

Comments

@orenefraty
Copy link

orenefraty commented Jun 4, 2019

Does the implementation support in timeDecay?
Meaning how much of recent past to consider when computing an anomaly?
Same question about Attribution score (how much a dimension contributed to anomaly, where sum of all attributions equal to record anomaly score)

Also, regarding examples - can you please add an example of high-dimensional time-series with shingling?

Thank you!

@nerooooooz
Copy link

Same question on timeDecay...and I also wonder if this one is good for trending data because there are a lot of points with high codisp in my test for trending data (no anomalys)...

@mdbartos
Copy link
Member

mdbartos commented Jul 3, 2019

There are a lot of different ways one could implement time decay. Here's one way using batch mode. I just used a series of rolling gaussian pdfs to define the sampling frequency.

import numpy as np
import pandas as pd
import scipy.stats
import rrcf
from matplotlib import pyplot as plt
import seaborn as sns

# Read data
taxi = pd.read_csv('../resources/nyc_taxi.csv', index_col=0)
taxi.index = pd.to_datetime(taxi.index)
data = taxi['value'].astype(float).values
n = data.shape[0]

# Set tree parameters
num_trees = 200
shingle_size = 48
tree_size = 1000

# Use the "shingle" generator to create rolling window
points = rrcf.shingle(data, size=shingle_size)
points = np.vstack([point for point in points])
num_points = points.shape[0]

# Chunk data into segments
num_chunks = 40
chunk_size = num_points // num_chunks
chunks = (chunk * chunk_size for chunk in range(num_chunks))

forest = []
pdfs = []
for chunk in chunks:
    # Create probability density function for sample
    g = scipy.stats.norm(loc=chunk, scale=400).pdf(np.linspace(0, num_points, num_points))
    g *= (1/g.sum())
    # Sample data with probability density function defined by g
    ixs = np.random.choice(num_points, size=(num_points // tree_size, tree_size), 
                                             replace=False, p=g)
    # Create trees from sampled points
    trees = [rrcf.RCTree(points[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)
    # Store pdf for plotting purposes
    pdfs.append(g)
    
# Compute CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(num_points))
index = np.zeros(num_points)
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf)
                        for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp
    np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
avg_codisp.index = taxi.iloc[(shingle_size - 1):].index

Plotting:

# Plot results
fig, ax = plt.subplots(3, figsize=(10, 9))
(taxi['value'] / 1000).plot(ax=ax[0], color='0.5', alpha=0.8)
avg_codisp.plot(ax=ax[2], color='#E8685D', alpha=0.8, label='RRCF')
for pdf in pdfs[1:]:
    pd.Series(pdf, index=avg_codisp.index).plot(ax=ax[1], c='0.6')    
ax[0].set_title('Taxi passenger timeseries')
ax[1].set_title('Rolling probability density functions')
ax[2].set_title('Anomaly score')
ax[0].set_ylabel('Thousands of passengers')
ax[1].set_ylabel('p(x)')
ax[2].set_ylabel('CoDisp')
ax[0].set_xlabel('')
ax[1].set_xlabel('')
ax[2].set_xlabel('')
plt.tight_layout()
plt.savefig('decay_example.png')

decay_example

@vidbap
Copy link

vidbap commented Feb 18, 2020

Awesome example, thank you @mdbartos
#1. Doesn't the shingle size determine the time decay - how much of the past data to use defined by this rolling window? So, why would we do batching as proposed and advantages of doing a sampling on it?

#2. Attribution Score - I would also be interested in adding this as a feature. AWS has a random cut forest implementation in Kinesis Stream Analytics with "explanation" that does exactly this, however, we want to use it in our datacenter to track operations anomalies.

Best regards.

@Aubindb
Copy link

Aubindb commented Mar 6, 2020

Hello,

I would also be grateful for an example in high dimension. I managed to adapt the shingling to get multidimensional rolling windows I'm struggling to adapt the rest of the batch example.

Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants