I was looking to implement a search described in this article: threathunting-spl/Detecting_Beaconing.md at master · inodee/threathunting-spl · GitHub
TLDR: The above link shows a search that allows for a data source like firewall connections data to be used to identify connections that are suspiciously uniform in terms of interval. It uses both streamstats and eventstats to calculate the standard deviation for the time intervals for connections between unique combinations of src and dst IPs.
The issue is that the data I would be using is enormous - I'm looking to do the above search on 24 hours worth of data but it fails due to memory limits. What I have in place so far, I do have an accelerated data model for the filtered firewall data, but I dont know how to combine tstats and the search in the above link. Summary indexing wouldnt work since I would still like the stats calculations over the larger time frame (i.e. doing the search every hour for the last hour worth of data might miss whether a connection truly does have a low standard deviation).
Has anyone successfully combined tstats with streamevents/eventstats or built a search that works around just how resource intensive the search is?
Well, the whole search is inefficient.
Firstly the sorting - it's ok with small sets of data but I can't imagine effective use with several dozen of gigabytes of data (and that's what you can easily get from network logs).
It's a dataset processing command so it needs all results returned from indexers to a searchhead in order to run.
Then you have streamstats by src,url which again generate huge memory consumption since you'll definitely have many different urls and many traffic sources.
Then another dataset processing command - eventstats.
Long story short - it's a highly inefficient way of processing the network logs since it needs to load all the events (probably several times) into memory of a single searchhead.
You can improve it a bit by not sorting tne data but instead calculating the diff_time "backwards" but it will still be centralized and uses eventstats. It's simply that if you want to calculate stats (especially eventstats) over many, many distinct values, it is memory consuming.
Using accelerated datamodel can help with search time but I doubt it will reduce memory footprint significantly.