Splunk Search

Transforming data for normal distribution

variableName
Explorer

I am attempting to take traffic logs over an arbitrary period of time and use the number of accesses and the time of those accesses and map that to a normal distribution model. My experience with Splunk is limited but I am sure that there must be a way to do this, right?

My current course of thought is to discretize the data into some span of time and take the count of traffic in that time period and put it into buckets. Then use those buckets to map the data to a normal distribution to which I should be able to denote shifts in data.

index="*" (host="192.168.0.*" OR dest="192.168.0.*") AND (dest!="198.24.14.0" AND src!="198.24.14.0")
| timechart span=15min distinct_count(src)

The above search is the baseline of what I have which represents the data that I am working with (obscured of course). Assuming arbitrary use of field (span=time) and field (dc(field)) but these won't be the only way that this search is applied. I've searched many previous threads and have tried using bins in nearly every way I can find. Unfortunately it appears that timechart() is interfering with bin() and returns output that I can't make use of. I have also tried manipulating _time with field(host/src/dest) in intervals of time but I don't know a way to get that into a variable to use later in the search.

Any help would be appreciated but, I am trying to not use apps outside of "Search and Reporting".

Thanks in advance

Labels (3)
0 Karma
1 Solution

ITWhisperer
SplunkTrust
SplunkTrust

So you could determine the average and standard deviation and then work out how many standard deviations (or half standard deviations in my example to give more points) your counts are away from the average, then count these as your buckets.

index="*" (host="192.168.0.*" OR dest="192.168.0.*") AND (dest!="198.24.14.0" AND src!="198.24.14.0")
| bin _time span=15min 
| stats distinct_count(src) as distinct_sources_in_period by _time
| eventstats stdev(distinct_sources_in_period) as stddev avg(distinct_sources_in_period) as average 
| eval halfdevs=round((distinct_sources_in_period - average) / (stddev / 2), 0) 
| stats count by halfdevs

 

View solution in original post

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

I am unclear what you are trying to achieve, but perhaps you need to break down the timechart command into separate parts so you can see if the data you want can be built from the parts. For example:

index="*" (host="192.168.0.*" OR dest="192.168.0.*") AND (dest!="198.24.14.0" AND src!="198.24.14.0")
| bin _time span=15min 
| stats distinct_count(src) by _time

 

variableName
Explorer

Using the search you just provided, how can I take the counts of each 15min interval and represent them along a normal distribution curve?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

My understanding of normal distribution in this context would be counts against frequency of the counts occurring. So:

index="*" (host="192.168.0.*" OR dest="192.168.0.*") AND (dest!="198.24.14.0" AND src!="198.24.14.0")
| bin _time span=15min 
| stats distinct_count(src) as distinct_sources_in_period by _time
| stats count as frequency by distinct_sources_in_period

variableName
Explorer

That's definitely on the right track, thank you so much for helping with this. Is there a way to dynamically group the distinct_sources_in_period? My intent is to make use of standard deviations in the dataset.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Depending on your range of counts you can use bin on them and set the span of the bins to an appropriate size

index="*" (host="192.168.0.*" OR dest="192.168.0.*") AND (dest!="198.24.14.0" AND src!="198.24.14.0")
| bin _time span=15min 
| stats distinct_count(src) as distinct_sources_in_period by _time
| bin distinct_sources_in_period span=1000
| stats count as frequency by distinct_sources_in_period

variableName
Explorer

Awesome! Is there a way to get *roughly* eight bins, which map percentages of the distinct_sources_in_period? Such as the lowest 0.1%, the lowest 2.1%, the lowest 13.6%, the lowest 34.1%, the highest 34.1%, the highest 13.6%, the highest 2.1%, and the highest 0.1%?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

This sounds like you are trying to force your data into normal distribution curve, that is, if you could set up the bins in this way, wouldn't you be guaranteed to get a normal distribution curve? ("Lies, **bleep** lies, and statistics")

variableName
Explorer

Essentially yes, I'm trying to make use of a normal distribution curve for some other things I need to do, as well as for data that I need to visually represent. However, I need to keep the search well defined inside of the "Search & Reporting" app and not reach out to other apps which may not export well. To further complicate matters I have a strong notion that if the normal distribution curve isn't discrete it will lead to even further complications down the road.

In broad terms I need a way to dynamically set and maintain thresholds for data over a period of time (up to a year) that is (relatively) as accurate at a time frame as narrow as fifteen minutes or as broad as a year. The solution I'm looking for also needs to generalize well such that I can reuse it across several other data sets which are similar in nature but not using the same fields. For example: destination or host used in place of source; total count/avg/etc used in place of distinct count; and probably more that I haven't yet considered.

I really appreciate your help in this, thank you for taking the time to work with me.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

So you could determine the average and standard deviation and then work out how many standard deviations (or half standard deviations in my example to give more points) your counts are away from the average, then count these as your buckets.

index="*" (host="192.168.0.*" OR dest="192.168.0.*") AND (dest!="198.24.14.0" AND src!="198.24.14.0")
| bin _time span=15min 
| stats distinct_count(src) as distinct_sources_in_period by _time
| eventstats stdev(distinct_sources_in_period) as stddev avg(distinct_sources_in_period) as average 
| eval halfdevs=round((distinct_sources_in_period - average) / (stddev / 2), 0) 
| stats count by halfdevs

 

0 Karma

variableName
Explorer

That works incredibly well! Thank you for all of your help. I'm going to mark this as the solution.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...