I don't need the entire tables, just the names of those processes will do so it would look like this:
hosts datetime top processes
-----------------------------------------------------------------------------------------
myhost 01/01/1998 00:00:00 chrome.exe, notepad.exe
I already enabled perfmonmk for process. Thanks!
Hi wellhung,
This could be a good use of the bucket command to break out the events into 15 second timespans, and then find any chunks of time with average cpu utilization greater than 90%
sourcetype=perfmonmk | bin _time span=15s | stats avg(cpu_util) AS average_cpu_util by host _time process | where average_cpu_util > 90
You can play around a bit with the bucket span, or use other stats functions (perc90, median) etc to get the representation that you want.
Please let me know if this answers your question!
Hi, thanks for replying.
I can't seem to use this query. Is "cpu_util" a counter? is it supposed to be %_Processor_Time?
Either wouldn't give me any result.
Could you please explain what this does?
stats avg(cpu_util) AS average_cpu_util by host _time process
The query does not yield any result from this part on, which looks like the most important part.
What counters should I have for this to work? Mine is basically like the ones on this page: [Link]http://blogs.splunk.com/2013/12/09/monitor-processes-per-user-on-microsoft-remote-desktop-services-s....
Thanks!
Hi wellhung, Yeah, my search was just an example. You'll want to replace the pertinent fields with those that are in the data you are working with.
That stats line finds the average %_Processor_Time (if that's the field you are interested in) for each host, _time bucket (15 seconds) and process, then builds a nice table of the results. The where clause at the end returns only results that have the utilization greater than 90% %_Processor_Time
Hi,
You know why my avgCPU always shows 100%, that can't be right I think...
index=perfmon sourcetype="PerfmonMk:Process" instance!="_Total" AND instance!="Idle" AND instance!="wmi*" | bucket _time span=15s | stats avg(%_Processor_Time) AS "avgCPU" avg(Working_Set_-_Private) AS "AvgMemory" BY host _time instance | where avgCPU > 90 | DEDUP _time
In your query I'm not sure what "process" is so I changed it to instance (which are the names of the processes). Is this query giving me what I actually want? Or am I mangling data somewhere...
Thanks!
Yup, if instance field contains the name of the process, that should work.
For avgCPU being 100% for all time buckets, I guess that means all distinct event values have 100 for that field value. Can you confirm this in the raw events?
Could you post an example of a few of the events?
Hi,
I guess since I'm querying anything >90 I would get only the 100%s but what I am worried about, am I really getting the processes that runs for at least 15 seconds, or am I getting everything that at one point peaked at > 90 %.
All I see is chrome at 100%, pages and pages of it. It might be true but I only use chrome on the server if I need to download something and for at least a few days I haven't even opened chrome there.
My question though, does the polling interval matter? When Splunk UF forwards the data does it only forward data at the time of polling or the whole lot? Say interval is 30s, does the forwarded data contain all the data since last poll 29.xx seconds ago or only data at the time of the poll?
Raw:
7/14/16
2:57:45.000 PM
InetMgr 0 3504 23699456
%_Processor_Time = 0 ID_Process = 3504 Working_Set_-_Private = 23699456 host = mehost index = perfmon instance = InetMgr object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
LogonUI 0 748 7745536
%_Processor_Time = 0 ID_Process = 748 Working_Set_-_Private = 7745536 host = mehost index = perfmon instance = LogonUI object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
PRTG_Probe 0.15595768067475685 4104 21671936
%_Processor_Time = 0.15595768067475685 ID_Process = 4104 Working_Set_-_Private = 21671936 host = mehost index = perfmon instance = PRTG_Probe object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
Ssms 0 676 76500992
%_Processor_Time = 0 ID_Process = 676 Working_Set_-_Private = 76500992 host = mehost index = perfmon instance = Ssms object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
System 0.051985893558252283 4 73728
%_Processor_Time = 0.051985893558252283 ID_Process = 4 Working_Set_-_Private = 73728 host = mehost index = perfmon instance = System object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
chrome 100 56 68251648
%_Processor_Time = 100 ID_Process = 56 Working_Set_-_Private = 68251648 host = mehost index = perfmon instance = chrome object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
chrome#1 0 2256 937984
%_Processor_Time = 0 ID_Process = 2256 Working_Set_-_Private = 937984 host = mehost index = perfmon instance = chrome#1 object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
chrome#2 100 2196 128057344
%_Processor_Time = 100 ID_Process = 2196 Working_Set_-_Private = 128057344 host = mehost index = perfmon instance = chrome#2 object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
7/14/16
2:57:45.000 PM
cmd 0 1668 610304
%_Processor_Time = 0 ID_Process = 1668 Working_Set_-_Private = 610304 host = mehost index = perfmon instance = cmd object = Process source = PerfmonMk:Process sourcetype = PerfmonMk:Process
This could depend on your sampling period, and if an event is registered if the proc util is 0. If you only have one sample in a 15 second period, and that sample is above 90%, this will trigger>
I'd do a simple table of %_Processor_Time to get an idea of what the values generally look like, I Can see chrome as a value of 100 there, but then system above it has 0.05(...)
A decimal value makes me think that it's representing the percentage as something between 0 and 1, but then the 100 value for the chrome process confuses that.
With that being said, the "stats avg()" function will work on whatever values you give it, so there could still be something up with the actual source data.
I recreated the index, the %_Processor_Time table basically shows 100s and then bunch of 0s. And nothing in between, for now, I hope. I'll come back to it tomorrow maybe there will be some diversity.
0 1,718 81.229%
100 282 13.333%
0.15612418354205651 4 0.189%
0.15594301259076812 3 0.142%
0.15594309209218249 3 0.142%
0.15600581901974953 3 0.142%
0.15607107435167317 3 0.142%
0.15613357457951121 3 0.142%
0.1561871657303206 3 0.142%
0.16447270967157646