How to search a list of times when CPU is greater ...

wellhung · ‎07-13-2016

I don't need the entire tables, just the names of those processes will do so it would look like this:

 hosts       datetime                 top processes
-----------------------------------------------------------------------------------------
 myhost      01/01/1998 00:00:00      chrome.exe, notepad.exe

I already enabled perfmonmk for process. Thanks!

muebel · ‎07-13-2016

Hi wellhung,

This could be a good use of the bucket command to break out the events into 15 second timespans, and then find any chunks of time with average cpu utilization greater than 90%

sourcetype=perfmonmk | bin _time span=15s | stats avg(cpu_util) AS average_cpu_util by host _time process | where average_cpu_util > 90

You can play around a bit with the bucket span, or use other stats functions (perc90, median) etc to get the representation that you want.

Please let me know if this answers your question!

wellhung · ‎07-13-2016

Hi, thanks for replying.

I can't seem to use this query. Is "cpu_util" a counter? is it supposed to be %_Processor_Time?
Either wouldn't give me any result.

Could you please explain what this does?

stats avg(cpu_util) AS average_cpu_util by host _time process

The query does not yield any result from this part on, which looks like the most important part.

What counters should I have for this to work? Mine is basically like the ones on this page: [Link]http://blogs.splunk.com/2013/12/09/monitor-processes-per-user-on-microsoft-remote-desktop-services-s....

Thanks!

muebel · ‎07-13-2016

Hi wellhung, Yeah, my search was just an example. You'll want to replace the pertinent fields with those that are in the data you are working with.

That stats line finds the average %_Processor_Time (if that's the field you are interested in) for each host, _time bucket (15 seconds) and process, then builds a nice table of the results. The where clause at the end returns only results that have the utilization greater than 90% %_Processor_Time

wellhung · ‎07-14-2016

Hi,

You know why my avgCPU always shows 100%, that can't be right I think...

index=perfmon sourcetype="PerfmonMk:Process" instance!="_Total" AND instance!="Idle" AND instance!="wmi*" | bucket _time span=15s | stats avg(%_Processor_Time) AS "avgCPU" avg(Working_Set_-_Private) AS "AvgMemory" BY host _time instance | where avgCPU > 90 | DEDUP _time

In your query I'm not sure what "process" is so I changed it to instance (which are the names of the processes). Is this query giving me what I actually want? Or am I mangling data somewhere...

Thanks!

muebel · ‎07-14-2016

Yup, if instance field contains the name of the process, that should work.

For avgCPU being 100% for all time buckets, I guess that means all distinct event values have 100 for that field value. Can you confirm this in the raw events?

Could you post an example of a few of the events?

wellhung · ‎07-14-2016

Hi,

I guess since I'm querying anything >90 I would get only the 100%s but what I am worried about, am I really getting the processes that runs for at least 15 seconds, or am I getting everything that at one point peaked at > 90 %.

All I see is chrome at 100%, pages and pages of it. It might be true but I only use chrome on the server if I need to download something and for at least a few days I haven't even opened chrome there.

My question though, does the polling interval matter? When Splunk UF forwards the data does it only forward data at the time of polling or the whole lot? Say interval is 30s, does the forwarded data contain all the data since last poll 29.xx seconds ago or only data at the time of the poll?

Raw:

7/14/16
2:57:45.000 PM  
InetMgr 0   3504    23699456    
%_Processor_Time = 0 ID_Process = 3504 Working_Set_-_Private = 23699456 host = mehost index = perfmon instance = InetMgr object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
LogonUI 0   748 7745536 
%_Processor_Time = 0 ID_Process = 748 Working_Set_-_Private = 7745536 host = mehost index = perfmon instance = LogonUI object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
PRTG_Probe  0.15595768067475685 4104    21671936    
%_Processor_Time = 0.15595768067475685 ID_Process = 4104 Working_Set_-_Private = 21671936 host = mehost index = perfmon instance = PRTG_Probe object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
Ssms    0   676 76500992    
%_Processor_Time = 0 ID_Process = 676 Working_Set_-_Private = 76500992 host = mehost index = perfmon instance = Ssms object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
System  0.051985893558252283    4   73728   
%_Processor_Time = 0.051985893558252283 ID_Process = 4 Working_Set_-_Private = 73728 host = mehost index = perfmon instance = System object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
chrome  100 56  68251648    
%_Processor_Time = 100 ID_Process = 56 Working_Set_-_Private = 68251648 host = mehost  index = perfmon instance = chrome object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
chrome#1    0   2256    937984  
%_Processor_Time = 0 ID_Process = 2256 Working_Set_-_Private = 937984 host = mehost  index = perfmon instance = chrome#1 object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
chrome#2    100 2196    128057344   
%_Processor_Time = 100 ID_Process = 2196 Working_Set_-_Private = 128057344 host = mehost  index = perfmon instance = chrome#2 object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM  
cmd 0   1668    610304  
%_Processor_Time = 0 ID_Process = 1668 Working_Set_-_Private = 610304 host = mehost index = perfmon instance = cmd object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process

muebel · ‎07-14-2016

This could depend on your sampling period, and if an event is registered if the proc util is 0. If you only have one sample in a 15 second period, and that sample is above 90%, this will trigger>

I'd do a simple table of %_Processor_Time to get an idea of what the values generally look like, I Can see chrome as a value of 100 there, but then system above it has 0.05(...)

A decimal value makes me think that it's representing the percentage as something between 0 and 1, but then the 100 value for the chrome process confuses that.

With that being said, the "stats avg()" function will work on whatever values you give it, so there could still be something up with the actual source data.

wellhung · ‎07-14-2016

I recreated the index, the %_Processor_Time table basically shows 100s and then bunch of 0s. And nothing in between, for now, I hope. I'll come back to it tomorrow maybe there will be some diversity.

0 1,718 81.229%
100 282 13.333%
0.15612418354205651 4 0.189%

0.15594301259076812 3 0.142%

0.15594309209218249 3 0.142%

0.15600581901974953 3 0.142%

0.15607107435167317 3 0.142%

0.15613357457951121 3 0.142%

0.1561871657303206 3 0.142%

0.16447270967157646

How to search a list of times when CPU is greater than 90% for more than 15 seconds and list top processes for each of those times?

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Are you a member of the Splunk Community?

How to search a list of times when CPU is greater than 90% for more than 15 seconds and list top processes for each of those times?

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions