timeDecay & High-dimensional shingling #58

orenefraty · 2019-06-04T12:45:15Z

Does the implementation support in timeDecay?
Meaning how much of recent past to consider when computing an anomaly?
Same question about Attribution score (how much a dimension contributed to anomaly, where sum of all attributions equal to record anomaly score)

Also, regarding examples - can you please add an example of high-dimensional time-series with shingling?

Thank you!

nerooooooz · 2019-07-02T16:48:07Z

Same question on timeDecay...and I also wonder if this one is good for trending data because there are a lot of points with high codisp in my test for trending data (no anomalys)...

mdbartos · 2019-07-03T22:22:01Z

There are a lot of different ways one could implement time decay. Here's one way using batch mode. I just used a series of rolling gaussian pdfs to define the sampling frequency.

import numpy as np
import pandas as pd
import scipy.stats
import rrcf
from matplotlib import pyplot as plt
import seaborn as sns

# Read data
taxi = pd.read_csv('../resources/nyc_taxi.csv', index_col=0)
taxi.index = pd.to_datetime(taxi.index)
data = taxi['value'].astype(float).values
n = data.shape[0]

# Set tree parameters
num_trees = 200
shingle_size = 48
tree_size = 1000

# Use the "shingle" generator to create rolling window
points = rrcf.shingle(data, size=shingle_size)
points = np.vstack([point for point in points])
num_points = points.shape[0]

# Chunk data into segments
num_chunks = 40
chunk_size = num_points // num_chunks
chunks = (chunk * chunk_size for chunk in range(num_chunks))

forest = []
pdfs = []
for chunk in chunks:
    # Create probability density function for sample
    g = scipy.stats.norm(loc=chunk, scale=400).pdf(np.linspace(0, num_points, num_points))
    g *= (1/g.sum())
    # Sample data with probability density function defined by g
    ixs = np.random.choice(num_points, size=(num_points // tree_size, tree_size), 
                                             replace=False, p=g)
    # Create trees from sampled points
    trees = [rrcf.RCTree(points[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)
    # Store pdf for plotting purposes
    pdfs.append(g)
    
# Compute CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(num_points))
index = np.zeros(num_points)
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf)
                        for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp
    np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
avg_codisp.index = taxi.iloc[(shingle_size - 1):].index

Plotting:

# Plot results
fig, ax = plt.subplots(3, figsize=(10, 9))
(taxi['value'] / 1000).plot(ax=ax[0], color='0.5', alpha=0.8)
avg_codisp.plot(ax=ax[2], color='#E8685D', alpha=0.8, label='RRCF')
for pdf in pdfs[1:]:
    pd.Series(pdf, index=avg_codisp.index).plot(ax=ax[1], c='0.6')    
ax[0].set_title('Taxi passenger timeseries')
ax[1].set_title('Rolling probability density functions')
ax[2].set_title('Anomaly score')
ax[0].set_ylabel('Thousands of passengers')
ax[1].set_ylabel('p(x)')
ax[2].set_ylabel('CoDisp')
ax[0].set_xlabel('')
ax[1].set_xlabel('')
ax[2].set_xlabel('')
plt.tight_layout()
plt.savefig('decay_example.png')

vidbap · 2020-02-18T16:36:38Z

Awesome example, thank you @mdbartos
#1. Doesn't the shingle size determine the time decay - how much of the past data to use defined by this rolling window? So, why would we do batching as proposed and advantages of doing a sampling on it?

#2. Attribution Score - I would also be interested in adding this as a feature. AWS has a random cut forest implementation in Kinesis Stream Analytics with "explanation" that does exactly this, however, we want to use it in our datacenter to track operations anomalies.

Best regards.

Aubindb · 2020-03-06T10:09:38Z

Hello,

I would also be grateful for an example in high dimension. I managed to adapt the shingling to get multidimensional rolling windows I'm struggling to adapt the rest of the batch example.

Thank you!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeDecay & High-dimensional shingling #58

timeDecay & High-dimensional shingling #58

orenefraty commented Jun 4, 2019 •

edited

Loading

nerooooooz commented Jul 2, 2019

mdbartos commented Jul 3, 2019

vidbap commented Feb 18, 2020

Aubindb commented Mar 6, 2020

timeDecay & High-dimensional shingling #58

timeDecay & High-dimensional shingling #58

Comments

orenefraty commented Jun 4, 2019 • edited Loading

nerooooooz commented Jul 2, 2019

mdbartos commented Jul 3, 2019

vidbap commented Feb 18, 2020

Aubindb commented Mar 6, 2020

orenefraty commented Jun 4, 2019 •

edited

Loading