-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeDecay & High-dimensional shingling #58
Comments
Same question on timeDecay...and I also wonder if this one is good for trending data because there are a lot of points with high codisp in my test for trending data (no anomalys)... |
There are a lot of different ways one could implement time decay. Here's one way using batch mode. I just used a series of rolling gaussian pdfs to define the sampling frequency. import numpy as np
import pandas as pd
import scipy.stats
import rrcf
from matplotlib import pyplot as plt
import seaborn as sns
# Read data
taxi = pd.read_csv('../resources/nyc_taxi.csv', index_col=0)
taxi.index = pd.to_datetime(taxi.index)
data = taxi['value'].astype(float).values
n = data.shape[0]
# Set tree parameters
num_trees = 200
shingle_size = 48
tree_size = 1000
# Use the "shingle" generator to create rolling window
points = rrcf.shingle(data, size=shingle_size)
points = np.vstack([point for point in points])
num_points = points.shape[0]
# Chunk data into segments
num_chunks = 40
chunk_size = num_points // num_chunks
chunks = (chunk * chunk_size for chunk in range(num_chunks))
forest = []
pdfs = []
for chunk in chunks:
# Create probability density function for sample
g = scipy.stats.norm(loc=chunk, scale=400).pdf(np.linspace(0, num_points, num_points))
g *= (1/g.sum())
# Sample data with probability density function defined by g
ixs = np.random.choice(num_points, size=(num_points // tree_size, tree_size),
replace=False, p=g)
# Create trees from sampled points
trees = [rrcf.RCTree(points[ix], index_labels=ix) for ix in ixs]
forest.extend(trees)
# Store pdf for plotting purposes
pdfs.append(g)
# Compute CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(num_points))
index = np.zeros(num_points)
for tree in forest:
codisp = pd.Series({leaf : tree.codisp(leaf)
for leaf in tree.leaves})
avg_codisp[codisp.index] += codisp
np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
avg_codisp.index = taxi.iloc[(shingle_size - 1):].index Plotting: # Plot results
fig, ax = plt.subplots(3, figsize=(10, 9))
(taxi['value'] / 1000).plot(ax=ax[0], color='0.5', alpha=0.8)
avg_codisp.plot(ax=ax[2], color='#E8685D', alpha=0.8, label='RRCF')
for pdf in pdfs[1:]:
pd.Series(pdf, index=avg_codisp.index).plot(ax=ax[1], c='0.6')
ax[0].set_title('Taxi passenger timeseries')
ax[1].set_title('Rolling probability density functions')
ax[2].set_title('Anomaly score')
ax[0].set_ylabel('Thousands of passengers')
ax[1].set_ylabel('p(x)')
ax[2].set_ylabel('CoDisp')
ax[0].set_xlabel('')
ax[1].set_xlabel('')
ax[2].set_xlabel('')
plt.tight_layout()
plt.savefig('decay_example.png') |
Awesome example, thank you @mdbartos #2. Attribution Score - I would also be interested in adding this as a feature. AWS has a random cut forest implementation in Kinesis Stream Analytics with "explanation" that does exactly this, however, we want to use it in our datacenter to track operations anomalies. Best regards. |
Hello, I would also be grateful for an example in high dimension. I managed to adapt the shingling to get multidimensional rolling windows I'm struggling to adapt the rest of the batch example. Thank you!! |
Does the implementation support in timeDecay?
Meaning how much of recent past to consider when computing an anomaly?
Same question about Attribution score (how much a dimension contributed to anomaly, where sum of all attributions equal to record anomaly score)
Also, regarding examples - can you please add an example of high-dimensional time-series with shingling?
Thank you!
The text was updated successfully, but these errors were encountered: