Archive

How to calculate concurrency distribution

Path Finder

I have the following event that needs to calculate concurrency distribution:

Event, starttime=yyyy-mm-dd hh:mm:ss, duration=, sourceip=a.b.c.d

| rex "duration=(?.*?),"

| eval StartTime=round(strptime(startTime,"%Y-%m-%dT%H:%M:%SZ"),0)
|eval _time=StartTime
| eval increment = mvappend("1","-1")
| mvexpand increment
| eval _time = if(increment==1, _time, _time + Duration)
| sort 0 + _time
| fillnull sourceip value="NULL"
| streamstats sum(increment) as post_concurrency by sourceip
| eval concurrency = if(increment==-1, post_concurrency+1, post_concurrency)
| stats count(concurrency) by concurrency

I want to take a look at the concurrency distribution to find out if it is matching a z-distribution.

It seems giving me the data I want, but would like to get your opinion. I am not sure of the granularity of concurrency here, was it count it by second? Both StartTime and Duration are down to second.

Is there a way to make concurrency as x-Axis, and count(concurrency) as y-Axis?

Thanks,

1 Solution

SplunkTrust
SplunkTrust

OK. Everything before the last line in your question | stats count(concurrency) by concurrency is from our previous question over at http://answers.splunk.com/answers/227393/how-to-use-the-concurrency-command-to-timechart-th.html so this is really more concerned with looking at a "frequency distribution of concurrency", not the nuts and bolts of how that concurrency is calculated.

Using | stats count by concurrency like you are here will produce results that look like what you want, but it'll be an extremely misleading visualization. The reason is that the rows coming into the stats command are all either the start time or the end time of a call. There is no representation for the time in between these points.

To see why this is a problem, let's look at a specific situations. Let's analyze a time period from 1pm to 2pm. Say we have 5 calls that all start right at 1pm and that are each 15 minutes long. Thinking about this, we want to ultimately see some frequency distribution with a lot of concurrency=0, and a fair bit of concurrency=5.

However let's pipe such a set into | stats count by concurrency. It will give us a distribution but it turns out that concurrency=0 will have a value of 1, then every value of concurrency from 1 to 4 will have a value of 2, then concurrency=5 will have a value of 1. This doesn't seem to match our expectations very well.

Then if we consider 5 other calls that start at 2pm but that are each only 1 minute long. The stats count by concurrency output here will be exactly the same as it was for the first calls. If we consider a time period that covers both sets of calls, it'll be the same chart with all y-axis values doubled. o_O

Instead I think it makes more sense to analyze some discrete unit of time like minutes or seconds, and look at the distribution of concurrency values for each sourceip across all the time periods.

Here in the search below I'm telling timechart to use not 1 second but rather 15 minutes as our bucket size but it's up to you. The search language looks a lot like what we did in the last question except at the end we have an extra untable and stats command.

... all the stuff before up to 
| eval concurrency = if(increment==-1, post_concurrency+1, post_concurrency)
| timechart span=15min max(concurrency) as max_concurrency last(post_concurrency) as last_concurrency by sourceip limit=20 
| filldown last_concurrency* 
| foreach "max_concurrency: *" [eval <<MATCHSTR>>=coalesce('max_concurrency: <<MATCHSTR>>','last_concurrency: <<MATCHSTR>>')] 
| fields - last_concurrency* max_concurrency*
| untable _time sourceip concurrency
| chart count over concurrency by sourceip

That search above will analyze all the 15 minute periods for each sourceip, and give you a frequency distribution for each sourceip value, and graph them all as separate lines on the same frequency distribution chart.

If instead you want to just see a single overarching frequency distribution of "per-sourceip concurrency", replace that last chart command with our old friend, {drum fill} | stats count by concurrency.

View solution in original post

SplunkTrust
SplunkTrust

OK. Everything before the last line in your question | stats count(concurrency) by concurrency is from our previous question over at http://answers.splunk.com/answers/227393/how-to-use-the-concurrency-command-to-timechart-th.html so this is really more concerned with looking at a "frequency distribution of concurrency", not the nuts and bolts of how that concurrency is calculated.

Using | stats count by concurrency like you are here will produce results that look like what you want, but it'll be an extremely misleading visualization. The reason is that the rows coming into the stats command are all either the start time or the end time of a call. There is no representation for the time in between these points.

To see why this is a problem, let's look at a specific situations. Let's analyze a time period from 1pm to 2pm. Say we have 5 calls that all start right at 1pm and that are each 15 minutes long. Thinking about this, we want to ultimately see some frequency distribution with a lot of concurrency=0, and a fair bit of concurrency=5.

However let's pipe such a set into | stats count by concurrency. It will give us a distribution but it turns out that concurrency=0 will have a value of 1, then every value of concurrency from 1 to 4 will have a value of 2, then concurrency=5 will have a value of 1. This doesn't seem to match our expectations very well.

Then if we consider 5 other calls that start at 2pm but that are each only 1 minute long. The stats count by concurrency output here will be exactly the same as it was for the first calls. If we consider a time period that covers both sets of calls, it'll be the same chart with all y-axis values doubled. o_O

Instead I think it makes more sense to analyze some discrete unit of time like minutes or seconds, and look at the distribution of concurrency values for each sourceip across all the time periods.

Here in the search below I'm telling timechart to use not 1 second but rather 15 minutes as our bucket size but it's up to you. The search language looks a lot like what we did in the last question except at the end we have an extra untable and stats command.

... all the stuff before up to 
| eval concurrency = if(increment==-1, post_concurrency+1, post_concurrency)
| timechart span=15min max(concurrency) as max_concurrency last(post_concurrency) as last_concurrency by sourceip limit=20 
| filldown last_concurrency* 
| foreach "max_concurrency: *" [eval <<MATCHSTR>>=coalesce('max_concurrency: <<MATCHSTR>>','last_concurrency: <<MATCHSTR>>')] 
| fields - last_concurrency* max_concurrency*
| untable _time sourceip concurrency
| chart count over concurrency by sourceip

That search above will analyze all the 15 minute periods for each sourceip, and give you a frequency distribution for each sourceip value, and graph them all as separate lines on the same frequency distribution chart.

If instead you want to just see a single overarching frequency distribution of "per-sourceip concurrency", replace that last chart command with our old friend, {drum fill} | stats count by concurrency.

View solution in original post

Path Finder

Thanks sideview for the detailed information. Will give it a try when I have a chance.

0 Karma

Path Finder
0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!