I'm currently using a very old deployment monitor search to determine when forwarders are down and it doesn't seem to be working very well in 6.5 (false positives + non alerts). I know the Monitoring Console has some additional functionality.
Does anyone have a specific search for this? I'm hoping to alert if forwarders don't check in for 2-3 mins.
There is a forwarder dashboard in the DMC that you can enable. It has an associated alert that will notify you of missing forwarders. The dashboard will show you forwarder status (active/missing, version, data volume, etc). Note that this dashboard is strictly for forwarders, not data coming in via TCP inputs (there's another dashboard for that). The time period for a forwarder to be considered missing is 15 minutes.
http://docs.splunk.com/Documentation/Splunk/6.5.0/DMC/ForwardersDeployment
Splunk is relying on a saved search that looks at the tcpin metrics reported by your indexers to build this dashboard and report on missing forwarders. If you have a lot of forwarders this search can put a pretty heavy load on your indexers (the setup page in the DMC also warns about this). It's also relying on the forwarder guid to uniquely identify your forwarder. So if you reimage a host but retain the hostname, or reinstall the forwarder for some reason, a forwarder will appear to be missing when its actually not. To clear up the missing forwarders, you'll need to periodically rebuild the forwarder asset data in the DMC (its just a button click).
Beyond this, if you want to get to the right search you need to consider how many forwarders you have, the amount of change in your environment, and what you really want to monitor for (ie, missing forwarders or missing data).
I'm not sure what you mean when you say you "tried grabbing the search that the DMC uses"
The DMC alert is disabled by default. Did you try enabling it?
There is a forwarder dashboard in the DMC that you can enable. It has an associated alert that will notify you of missing forwarders. The dashboard will show you forwarder status (active/missing, version, data volume, etc). Note that this dashboard is strictly for forwarders, not data coming in via TCP inputs (there's another dashboard for that). The time period for a forwarder to be considered missing is 15 minutes.
http://docs.splunk.com/Documentation/Splunk/6.5.0/DMC/ForwardersDeployment
Splunk is relying on a saved search that looks at the tcpin metrics reported by your indexers to build this dashboard and report on missing forwarders. If you have a lot of forwarders this search can put a pretty heavy load on your indexers (the setup page in the DMC also warns about this). It's also relying on the forwarder guid to uniquely identify your forwarder. So if you reimage a host but retain the hostname, or reinstall the forwarder for some reason, a forwarder will appear to be missing when its actually not. To clear up the missing forwarders, you'll need to periodically rebuild the forwarder asset data in the DMC (its just a button click).
Beyond this, if you want to get to the right search you need to consider how many forwarders you have, the amount of change in your environment, and what you really want to monitor for (ie, missing forwarders or missing data).
Thanks for the lengthy reply, that all makes sense from a monitoring perspective and I've done a solid amount of research on that side of it.
Specifically though I was looking for a best practice way of being alerted when forwarders are down/missing. I've tried grabbing the search that the DMC uses but I've had no luck.
Spent a lot of time googling before posting this but every search I've tried based on my findings has not worked as intended.
If you've got details about what you've tried, I'd love to know what and why it hasn't worked for you.
There is an app called "broken hosts" which might help here
If not something like this might work:
| metadata type=hosts index=_internal | eval age=now()-recentTime | eval status=if(age<1200,"UP","DOWN") | convert ctime(recentTime) as "Last Active On" | rename age as Age |eval Hour=round(Age/3600,0)|eval Minute=round((Age%3600)/60,0)|eval Age="-".Hour."h"." : ".Minute."m" |table host, status, "Last Active On", Age | search status=DOWN | lookup dnslookup clienthost AS host | search clientip!=''
Note I'm using the DNS lookup because we de-register DNS entries when a host is decommed, otherwise just remove the lookup...
I've tried a lot of searches including ones with | metadata and they all had weirdness, this one actually looks really accurate/promising I think I can make it work.
Thanks guys!