We have job that run on all hosts every 5 minutes and once completed it writes completed message. On the basis of completed message we know it's successful. I was able to create alert on the basis of completed message(trigger alert when number of host is not equal to 100). In email it send all the host that it found message and I have to find myself in which host it didn't run.
I want to set alert and send email with the hostname in which the job did not complete.
search: host=tm1-dc-cc-* "Completed monitor recovery service"
host=tm1-dc-cc-* "Completed monitor recovery service" --> needs to trigger alert when host count not equal to 100.
host=tm1-dc-cc-* "Completed monitor recovery service"
| stats values(host) dc(host) as count
| where count!=100
here where count!=100
helps you to trigger your expectation. Please accept the answer if it helps !
I believe you can write this kind of alert by two methods
1) First is to find when is the last completed message written from the host. If it's not recently, it wasn't completed within your threshold time, then alert it for those host.
e.g. your job runs every 5 min and I'm assuming your alert search too, they you'll select data for say last 60 mins, see for each host when was the last Completed message was received and compare that with current time. Below search would generate events when a host has not written a completed message in 10 mins. Your alert condition would "if number of events greater than 0".
host=tm1-dc-cc-* "Completed monitor recovery service" | table _time host | dedup host
| eval age=now()-_time | where age>600
2) Other option is to have a lookup table file with all your host names. You can setup a scheduled search to frequently update the lookup table with new servers. Once you've the lookup table, use that in search to find which ones are actually not reported in given period:
E.g. say you've a lookup table file tm_dc_hosts.csv with a column host, below can be your alert search with alert condition as "if number of events greater than 0".
[| inputlookup tm_dc_hosts.csv | table host ] "Completed monitor recovery service"
| table host | eval from=2
| append [| inputlookup tm_dc_hosts.csv | table host | eval from=1]
| stats max(from) as from by host | where from=1
@somesoni2 what eval from=2
actually means here ? could you please elaborate the input lookup query there.
it didn't work exactly what you suggested but I was able to make it work by
host=tm1-dc-cc-* | search NOT [search host=tm1-dc-cc-* "completed monitor recovery" | fields host | format] | stats count by host
Thank you though.
Hi @mahasd
use the below search
your search | NOT "Completed monitor recovery service" | stats c by host
trigger if the count was greater than 0 and you will also get a list of hosts.
Thanks