What is the best way of running an aggregation over a subset of the file? #333

Bluetopia · 2024-02-27T14:29:24Z

Bluetopia
Feb 27, 2024

If I wanted to run my Aggregation over just a subset of the data based on DateTimeStamp start and end parameters, how would I go about that?

Originally, I thought about adding a check in the aggregation code for a start/end date, but realized that the Aggregation object is created by the jvm.getAggregation() call, so I wouldn't have the opportunity to pass the start/end date parameters in before creation.

For chart points, this was easy enough to do as post-filtering is not a problem; the data points don't need to change. However, in this particular case I want to run some calculations on just the events that occur during the specified timeframe. Post-filtering won't work unless I end up storing a copy of the events in my summary object in the aggregation, then run the calculations against those objects. This seems like it's counter to the design of your library, so I'm second-guessing myself.

I'm still relatively green at some of the functional concepts that you're using throughout this library, so I'm hoping you might be able to shed light on a concept I might not know about yet that I've been missing.

Thank you,
Bluetopia

Answered by Bluetopia

Mar 4, 2024

So if I understand the suggestion correctly, it would be:

Instantiate the aggregation that performs date filtering with start/end parameters specified.
Load the aggregation using GCToolKit.loadAggregation
Analyze the file using GCToolKit.analyze()
Retrieve the result using jvm.getAggregation().

If this is the case then I may have misunderstood the way this library is used. I thought I'd be able to "replay" the full event sequence against an aggregator, such that I could perform the same operation against a smaller data set. Rather, it seems that all the aggregations get run once during GCToolKit.analyze(), so if I want to re-process data, I'll need to cache the values of interest in the…

View full answer

karianna · 2024-03-02T20:42:13Z

karianna
Mar 2, 2024
Maintainer

@dsgrieve or @kcpeppe Are you able to help?

0 replies

kcpeppe · 2024-03-02T21:35:07Z

kcpeppe
Mar 2, 2024
Maintainer

@Bluetopia, Interesting use case and as you've noted, there is no API to directly support this use case. However, you can manually load an Aggregation with this API

public void loadAggregation(Aggregation aggregation) {
    registeredAggregations.add(aggregation);
}

You could create an Aggregation, configure the filtering you want and the hand register it before calling GCToolKit.analyize(DataSource<?> dataSource).

The other way you might consider is to setup your own JVMEventChannel. We use Vert.x and while it does offer us some interesting functionality, it's mainly in use due to historical reasons and it could be easily replaced. The idea here is that you could inject event filters into the VertxJVMEventChannel. If you go down this route do know that the parsers are feed from the GC log through an DataSourceVerticle. You can inspect the code in the vertx module to see how to do this. The code base is very minimal

3 replies

Bluetopia Mar 4, 2024
Author

So if I understand the suggestion correctly, it would be:

Instantiate the aggregation that performs date filtering with start/end parameters specified.
Load the aggregation using GCToolKit.loadAggregation
Analyze the file using GCToolKit.analyze()
Retrieve the result using jvm.getAggregation().

If this is the case then I may have misunderstood the way this library is used. I thought I'd be able to "replay" the full event sequence against an aggregator, such that I could perform the same operation against a smaller data set. Rather, it seems that all the aggregations get run once during GCToolKit.analyze(), so if I want to re-process data, I'll need to cache the values of interest in the aggregation and do post-filtering for my use case.

Building my own event channel is likely overkill for the limited scope of what I want to do, but I'll keep it in mind if another similar use case comes up in the future.

Thanks for the sanity check on this, and thank you for making this library available!

Answer selected by karianna

dsgrieve Mar 4, 2024
Maintainer

so if I want to re-process data, I'll need to cache the values of interest in the aggregation and do post-filtering for my use case.

@Bluetopia - This is the way.

kcpeppe Mar 4, 2024
Maintainer

An alternative way would be to drop the events into a "time series" collection and reprocess them that way. I have considered using a time series db in the past however all implementations use some text format (JSON...) to store the data and this becomes very memory expensive. I'm still open to adding this functionality as an optional Aggregator/DataSource if someone pointed out a reasonable "non-JSON" TS DB implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way of running an aggregation over a subset of the file? #333

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the best way of running an aggregation over a subset of the file? #333

Bluetopia Feb 27, 2024

Replies: 2 comments · 3 replies

karianna Mar 2, 2024 Maintainer

kcpeppe Mar 2, 2024 Maintainer

Bluetopia Mar 4, 2024 Author

dsgrieve Mar 4, 2024 Maintainer

kcpeppe Mar 4, 2024 Maintainer

Bluetopia
Feb 27, 2024

Replies: 2 comments 3 replies

karianna
Mar 2, 2024
Maintainer

kcpeppe
Mar 2, 2024
Maintainer

Bluetopia Mar 4, 2024
Author

dsgrieve Mar 4, 2024
Maintainer

kcpeppe Mar 4, 2024
Maintainer