Splunk Search

How to calculate concurrency distribution

jgcsco
Path Finder

I have the following event that needs to calculate concurrency distribution:

Event, starttime=yyyy-mm-dd hh:mm:ss, duration=, sourceip=a.b.c.d

| rex "duration=(?.*?),"

| eval StartTime=round(strptime(startTime,"%Y-%m-%dT%H:%M:%SZ"),0)
|eval _time=StartTime
| eval increment = mvappend("1","-1")
| mvexpand increment
| eval _time = if(increment==1, _time, _time + Duration)
| sort 0 + _time
| fillnull sourceip value="NULL"
| streamstats sum(increment) as post_concurrency by sourceip
| eval concurrency = if(increment==-1, post_concurrency+1, post_concurrency)
| stats count(concurrency) by concurrency

I want to take a look at the concurrency distribution to find out if it is matching a z-distribution.

It seems giving me the data I want, but would like to get your opinion. I am not sure of the granularity of concurrency here, was it count it by second? Both StartTime and Duration are down to second.

Is there a way to make concurrency as x-Axis, and count(concurrency) as y-Axis?

Thanks,

1 Solution

sideview
SplunkTrust
SplunkTrust

OK. Everything before the last line in your question | stats count(concurrency) by concurrency is from our previous question over at http://answers.splunk.com/answers/227393/how-to-use-the-concurrency-command-to-timechart-th.html so this is really more concerned with looking at a "frequency distribution of concurrency", not the nuts and bolts of how that concurrency is calculated.

Using | stats count by concurrency like you are here will produce results that look like what you want, but it'll be an extremely misleading visualization. The reason is that the rows coming into the stats command are all either the start time or the end time of a call. There is no representation for the time in between these points.

To see why this is a problem, let's look at a specific situations. Let's analyze a time period from 1pm to 2pm. Say we have 5 calls that all start right at 1pm and that are each 15 minutes long. Thinking about this, we want to ultimately see some frequency distribution with a lot of concurrency=0, and a fair bit of concurrency=5.

However let's pipe such a set into | stats count by concurrency. It will give us a distribution but it turns out that concurrency=0 will have a value of 1, then every value of concurrency from 1 to 4 will have a value of 2, then concurrency=5 will have a value of 1. This doesn't seem to match our expectations very well.

Then if we consider 5 other calls that start at 2pm but that are each only 1 minute long. The stats count by concurrency output here will be exactly the same as it was for the first calls. If we consider a time period that covers both sets of calls, it'll be the same chart with all y-axis values doubled. o_O

Instead I think it makes more sense to analyze some discrete unit of time like minutes or seconds, and look at the distribution of concurrency values for each sourceip across all the time periods.

Here in the search below I'm telling timechart to use not 1 second but rather 15 minutes as our bucket size but it's up to you. The search language looks a lot like what we did in the last question except at the end we have an extra untable and stats command.

... all the stuff before up to 
| eval concurrency = if(increment==-1, post_concurrency+1, post_concurrency)
| timechart span=15min max(concurrency) as max_concurrency last(post_concurrency) as last_concurrency by sourceip limit=20 
| filldown last_concurrency* 
| foreach "max_concurrency: *" [eval <<MATCHSTR>>=coalesce('max_concurrency: <<MATCHSTR>>','last_concurrency: <<MATCHSTR>>')] 
| fields - last_concurrency* max_concurrency*
| untable _time sourceip concurrency
| chart count over concurrency by sourceip

That search above will analyze all the 15 minute periods for each sourceip, and give you a frequency distribution for each sourceip value, and graph them all as separate lines on the same frequency distribution chart.

If instead you want to just see a single overarching frequency distribution of "per-sourceip concurrency", replace that last chart command with our old friend, {drum fill} | stats count by concurrency.

View solution in original post

sideview
SplunkTrust
SplunkTrust

OK. Everything before the last line in your question | stats count(concurrency) by concurrency is from our previous question over at http://answers.splunk.com/answers/227393/how-to-use-the-concurrency-command-to-timechart-th.html so this is really more concerned with looking at a "frequency distribution of concurrency", not the nuts and bolts of how that concurrency is calculated.

Using | stats count by concurrency like you are here will produce results that look like what you want, but it'll be an extremely misleading visualization. The reason is that the rows coming into the stats command are all either the start time or the end time of a call. There is no representation for the time in between these points.

To see why this is a problem, let's look at a specific situations. Let's analyze a time period from 1pm to 2pm. Say we have 5 calls that all start right at 1pm and that are each 15 minutes long. Thinking about this, we want to ultimately see some frequency distribution with a lot of concurrency=0, and a fair bit of concurrency=5.

However let's pipe such a set into | stats count by concurrency. It will give us a distribution but it turns out that concurrency=0 will have a value of 1, then every value of concurrency from 1 to 4 will have a value of 2, then concurrency=5 will have a value of 1. This doesn't seem to match our expectations very well.

Then if we consider 5 other calls that start at 2pm but that are each only 1 minute long. The stats count by concurrency output here will be exactly the same as it was for the first calls. If we consider a time period that covers both sets of calls, it'll be the same chart with all y-axis values doubled. o_O

Instead I think it makes more sense to analyze some discrete unit of time like minutes or seconds, and look at the distribution of concurrency values for each sourceip across all the time periods.

Here in the search below I'm telling timechart to use not 1 second but rather 15 minutes as our bucket size but it's up to you. The search language looks a lot like what we did in the last question except at the end we have an extra untable and stats command.

... all the stuff before up to 
| eval concurrency = if(increment==-1, post_concurrency+1, post_concurrency)
| timechart span=15min max(concurrency) as max_concurrency last(post_concurrency) as last_concurrency by sourceip limit=20 
| filldown last_concurrency* 
| foreach "max_concurrency: *" [eval <<MATCHSTR>>=coalesce('max_concurrency: <<MATCHSTR>>','last_concurrency: <<MATCHSTR>>')] 
| fields - last_concurrency* max_concurrency*
| untable _time sourceip concurrency
| chart count over concurrency by sourceip

That search above will analyze all the 15 minute periods for each sourceip, and give you a frequency distribution for each sourceip value, and graph them all as separate lines on the same frequency distribution chart.

If instead you want to just see a single overarching frequency distribution of "per-sourceip concurrency", replace that last chart command with our old friend, {drum fill} | stats count by concurrency.

jgcsco
Path Finder

Thanks sideview for the detailed information. Will give it a try when I have a chance.

0 Karma

jgcsco
Path Finder
0 Karma
Get Updates on the Splunk Community!

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...