Splunk Search

Calculating the average time between start and stop jobs for a service

carlyleadmin
Contributor

Hi,

i've asked this question before and never got it to work.maybe it was my fault that i was not clear on what i wanted to accomplish.

i am forwarding windows system logs.i have a service called "RS word processing service" where it starts and stops thru out the day.but as you can see from the logs as well the recovery time is quick.the problem is when the srevice stops and does not recover/start in timely manner, such as the one highlighted in the attached screenshot.from the logs i know that the recovery time is less than less than 5 seconds when it stopped,so it will start back up in 5 seconds.i hope it is clear up to this point.

what i am trying to achieve is to get only the result/results where the time between stop and start is more than a minute or 2 and based on that set an alert.is this possible?can anyone help me with this?

alt text

Tags (1)
0 Karma
1 Solution

carlyleadmin
Contributor

In case someone is in the same situation as i am ,i just wanted to share the solution i applied.All those answers above are helpful but not exactly addressed my issue.i think i made it way complicated with what i wanted to achieve.all i did was to use the stats or streamstats command with latest() to bring up the state where it stopped and names it as Final_State and search for Final_state in my query and even if the service got restarted and in which the service stops and starts between restart it did not trigger an alarm because i am only looking for the final state which is completely stopped.So far my alarm seems to be working fine.Thanks everyone

my search query;

"The SunGard Investran Scheduling Service"| rex field=Message "The (?[a-zA-Z|\s]+) Service service entered the (?[a-zA-Z]+) state."|streamstats latest(State) AS Final_State|search Final_State=stopped|table Name Message Final_State host _time

alt text

View solution in original post

0 Karma

carlyleadmin
Contributor

In case someone is in the same situation as i am ,i just wanted to share the solution i applied.All those answers above are helpful but not exactly addressed my issue.i think i made it way complicated with what i wanted to achieve.all i did was to use the stats or streamstats command with latest() to bring up the state where it stopped and names it as Final_State and search for Final_state in my query and even if the service got restarted and in which the service stops and starts between restart it did not trigger an alarm because i am only looking for the final state which is completely stopped.So far my alarm seems to be working fine.Thanks everyone

my search query;

"The SunGard Investran Scheduling Service"| rex field=Message "The (?[a-zA-Z|\s]+) Service service entered the (?[a-zA-Z]+) state."|streamstats latest(State) AS Final_State|search Final_State=stopped|table Name Message Final_State host _time

alt text

View solution in original post

0 Karma

elliotproebstel
Champion

You can append the following to the search you're running now, and it will create a new field called duration for each event.

| sort _time 
| streamstats earliest(_time) as start_time reset_after="("like(Message, \"%running%\")")" 
| eval duration=if(like(Message, "%running%"), _time-start_time, NULL) 
| eventstats max(duration) AS duration BY start_time
| fields - start_time

If you want to alert when the value of duration exceeds a certain threshold (say, 500), you can add this at the end:

| where duration>500
0 Karma

carlyleadmin
Contributor

Elliot when i run your search i get an error

"Error in 'streamstats' command: The argument 'reset_after=(like(Message, "%running%"))' is invalid."

0 Karma

carlyleadmin
Contributor

ok maybe i am going about this the wrong way.I have this process and i want to be alerted when the service stops and does not start automatically.i know the stop-start process is no more than 5 seconds.and i don't want to be alerted every time it stops,i just want to be alerted when it stops and does not start in 5 or 10 seconds,but when that happens the only entry in system or application log is the "stopped state" and "running state" will not be in the logs unless i manually start the service,and if the "running state" is already in the logs,that means service recovered within that grace period which is 5 secs and everything is okay so in that case i don't want those entries.so my question,would be possible to set an alert on above requirements(5 Sec rule)since the running state is not in the logs after it stopped.how splunk would know that the service stopped and not started in 5 or 10 seconds.

i hope this makes sense.

0 Karma

robgora_deloitt
Path Finder

Are you using the WinHostMon or WMI to get the status of the service?

0 Karma

carlyleadmin
Contributor

I am using WMI

0 Karma

robgora_deloitt
Path Finder

If you did an eval on the message looking for Service entered a stopped state, you could then look for that eval that has been in that state for greater than 5 minutes or whatever number you were looking for.

This way if the service has not started in the amount of time you want, send an alert to you.

0 Karma

robgora_deloitt
Path Finder
0 Karma

carlyleadmin
Contributor

Thanks Robgora it is very helpful link but i am still having hard time configuring mine

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!