Splunk Search

process control chart e.g. upper/lower control limit.

bigtyma
Communicator

I have been asked to help a co-worker create a process control chart to understand an applications response time.

The following three events are generated for each test.

INFO=Signon_Screen RESPONSE_TIME=2.1000
INFO=Signon_Dept_Screen RESPONSE_TIME=0.6000
INFO=Citrix_Login_Comp RESPONSE_TIME=7.6000

The link below is a step in the right direction but I am having trouble getting this to work.

http://splunk-base.splunk.com/answers/73300/which-search-is-faster-reusing-a-calculation-in-an-if-cl...

0 Karma

jonuwz
Influencer

Simplest solution :

index=_internal sourcetype="splunk_web_access" uri="/en-GB/api/shelper" | table spent | rename spent as val 
| autoregress val 
| eval rangeMR=abs(val-val_p1) 
| eventstats count as numR avg(val) as AVG sum(rangeMR) as mrAVG
| eval mrAVG=mrAVG/(numR-1) 
| eval UCL=AVG+2.66*mrAVG 
| eval LCL=AVG-2.66*mrAVG 
| table val_p1 val AVG LCL UCL

The pitfalls are worth mentioning though.

a) A process is said to be in control if its datapoints are within 6 sigma. The naive approach is +-3 sigma. However, for data with a hard floor ( i.e. 0 seconds, you end up with a LCL < 0 which is nonesense. Then you might want to set the LCL to 1 sigma and the UCL to 5 sigma. In other words you need to know your data and process.

b) The approach was invented when they used to sample ball bearings produced over the day, and sample batches. In technology, the number of samples is huge. 3 sigma is ~ 1 in a thousand. In high volume processing your process will be out of control a lot more frequently than your heldesk can investigate. Backups, anything that creates network latency is going to kill you unless the network is only a small part of the process.

c) Process charts sample over all data. This means it is possible for a steady stream of highly regular samples to push the chart out of control in the past. Again - this shows the roots in batch sampling for a shifts work - not live monitoring.

d) The occasional spike means nothing in todays world. Running your data through a fourier transform to look for regular spikes is far more fun and informative. If you want to use control charts for monitoring, you need to also look at loss functions ( read Taguchi ) for a mechanism to infer impact.

Unless you know what you are doing, using SPC charts to monitor data is a road to frustration.

This technique has massive value in tuning software components that you have written and control. Its utterly useless in a meta sense like citrix logons.

The technique dates back to a time where the number of moving parts to create an item was low, and each manufacturing process could be tuned - and that was the point - if you minimize variance - you reduce the monetary loss to the company in returned defective product.

I seriously doubt you - or anyone else working in IT can name ( let alone influence ) all the moving parts in a citrix logon.

bigtyma
Communicator

Thank you, if you have any other suggested reading please let me know!

0 Karma
Get Updates on the Splunk Community!

Video | Welcome Back to Smartness, Pedro

Remember Splunk Community member, Pedro Borges? If you tuned into Episode 2 of our Smartness interview series, ...

Detector Best Practices: Static Thresholds

Introduction In observability monitoring, static thresholds are used to monitor fixed, known values within ...

Expert Tips from Splunk Education, Observability in Action, Plus More New Articles on ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...