Splunk Search

process control chart e.g. upper/lower control limit.

bigtyma
Communicator

I have been asked to help a co-worker create a process control chart to understand an applications response time.

The following three events are generated for each test.

INFO=Signon_Screen RESPONSE_TIME=2.1000
INFO=Signon_Dept_Screen RESPONSE_TIME=0.6000
INFO=Citrix_Login_Comp RESPONSE_TIME=7.6000

The link below is a step in the right direction but I am having trouble getting this to work.

http://splunk-base.splunk.com/answers/73300/which-search-is-faster-reusing-a-calculation-in-an-if-cl...

0 Karma

jonuwz
Influencer

Simplest solution :

index=_internal sourcetype="splunk_web_access" uri="/en-GB/api/shelper" | table spent | rename spent as val 
| autoregress val 
| eval rangeMR=abs(val-val_p1) 
| eventstats count as numR avg(val) as AVG sum(rangeMR) as mrAVG
| eval mrAVG=mrAVG/(numR-1) 
| eval UCL=AVG+2.66*mrAVG 
| eval LCL=AVG-2.66*mrAVG 
| table val_p1 val AVG LCL UCL

The pitfalls are worth mentioning though.

a) A process is said to be in control if its datapoints are within 6 sigma. The naive approach is +-3 sigma. However, for data with a hard floor ( i.e. 0 seconds, you end up with a LCL < 0 which is nonesense. Then you might want to set the LCL to 1 sigma and the UCL to 5 sigma. In other words you need to know your data and process.

b) The approach was invented when they used to sample ball bearings produced over the day, and sample batches. In technology, the number of samples is huge. 3 sigma is ~ 1 in a thousand. In high volume processing your process will be out of control a lot more frequently than your heldesk can investigate. Backups, anything that creates network latency is going to kill you unless the network is only a small part of the process.

c) Process charts sample over all data. This means it is possible for a steady stream of highly regular samples to push the chart out of control in the past. Again - this shows the roots in batch sampling for a shifts work - not live monitoring.

d) The occasional spike means nothing in todays world. Running your data through a fourier transform to look for regular spikes is far more fun and informative. If you want to use control charts for monitoring, you need to also look at loss functions ( read Taguchi ) for a mechanism to infer impact.

Unless you know what you are doing, using SPC charts to monitor data is a road to frustration.

This technique has massive value in tuning software components that you have written and control. Its utterly useless in a meta sense like citrix logons.

The technique dates back to a time where the number of moving parts to create an item was low, and each manufacturing process could be tuned - and that was the point - if you minimize variance - you reduce the monetary loss to the company in returned defective product.

I seriously doubt you - or anyone else working in IT can name ( let alone influence ) all the moving parts in a citrix logon.

bigtyma
Communicator

Thank you, if you have any other suggested reading please let me know!

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...