The "MS Exchange" app does a good job at reporting on services that are down; however, what if the service is down intentionally? What is the best way to tell the app NOT to alert on an intentionally stopped service?
The use-case is: An Exchange-2010 environment. They do not have "Edge" servers and have intentionally disabled the MSExchangeEdgeSync service. This service is still installed so when queried via powershell, the "MS Exchange" app reports that this service is down, and the dashboards alert as such.
I can see several ways to achieve the goal (maybe adjust the search or eventtype); however, these approaches do not easily accommodate if there is a blended environment where the service is intentionally stopped on some systems but running (and therefore should be monitored) on others.
Instead of handling this in Splunk, is it better to have the "MS Exchange" admin remove the stopped service from the system?
Advice appreciated.
I had a similar situation where a client had some services intentionally disabled, and didn't want to have those services raise alerts on the front page (i.e. In the "Service Availability" panel).
The workaround I put in place was to:
1. Create a lookup table containing a list of the services the client doesn't want flagged as problematic.
I created a lookup called msx_service_exclusions.csv
with the following format:
service_name,service_status,comment
MSExchangeEdgeSync,disabled,Requested to be disabled by client as this is not enabled on the client access servers (20130328-0931 R.Turk)
The reason why I put the comment field in will become clear later. You could also clean this up a bit more with formal lookup definitions, but I'm trying to keep it simple... ish.
2. Create custom (local) versions of the saved searches that generate the dashboards/alerts that detect the conditions you want ignored.
The reason you want this to be local, is that in the event that newer versions of the Splunk for Microsoft Exchange app come out, you don't want to blow away your customisations (plus you shouldn't really be making any changes in the default
directory anyway...)
File: $SPLUNK_HOME/etc/apps/Splunk_for_Exchange/local/savedsearches.conf
The search should be on one line without breaks... I've split it out for readability. My customisations are indented.
[Static Health Overview - Service Availability]
search = eventtype=msexchange-topology
|stats latest(ServicesNotRunning) as ServicesNotRunning by Name
|eval Service=split(ServicesNotRunning,",")
| lookup msx_exchange_exclusions.csv service_name AS Service OUTPUT service_status
| search NOT service_status="disabled"
|eval ServiceCount=if(ServicesNotRunning!="",mvcount(Service),0)
|table Name,Service,ServiceCount
|addcoltotals fieldname=Service labelfield=Name label="# Problem Services"
|eval Service=if(Name="# Problem Services",ServiceCount,Service)
|search Name="# Problem Services" OR ServiceCount>0
|table Name,Service
|sort - Name
The first customisation creates a field service_status
for all events collected by the search and sets a value of disabled
for all services you have set to be disabled (leaving the rest to be NULL).
The second line searches for all results that do NOT have this value of "disabled", effectively filtering out the unwanted results. The search [search NOT service_status=""
] would also work as only matching services will have a value for service_status
.
This should give you the desired result of no longer displaying disabled services. Add/rinse/repeat for scheduled searches & dashboard panels that also report on services that you don't want monitored.
For cases where the service is only disabled on some hosts, you just change the lookup & the search accordingly. Example:
Lookup: msx_service_and_host_exclusions.csv
host,service_name,service_status,comment
msx_mbxsvr01,MSExchangeEdgeSync,disabled,Requested to be disabled by client as this is not enabled on the client access servers (20130328-0931 R.Turk)
Search (only changes):
....
| lookup msx_exchange_exclusions.csv host service_name AS Service OUTPUT service_status
| search NOT service_status="disabled"
....
Optional
To make things pretty & so your Exchange admins don't forget which services they are intentionally disregarding (and why), a simple dashboard could be created (e.g. Service Exclusions) that display in table form the max_service_exclusions.csv lookup, even listing the reason why it has been excluded (which may change over time).
| inputlookup msx_service_exclusions.csv | fields service_name, service_status, comment
Also creating a scheduled monthly email alert to your Exchange admins with the same search above will remind them what's being filtered and why (in case things change).
EDIT
From what I can tell (not really being a Windows guy), the Powershell script get-hoststats.ps1
in the app TA-Exchange-2010-HubTransport app is what's checking for running & not-running services that are registered on the host. If you had a way to safely remove the service from the system, then that would stop the service from being reported as down to Splunk and would not be reported on. Example event below.
I would manually run the script in a test environment before rolling out though.
All that being said, IANAEA - (I Am Not An Exchange Administrator)
Hope this helps 🙂
I added a bit at the end - essentially it's going to boil down to the level of comfort your exchange admin (maybe you?) have in making these changes. Historically I would say the path of least resistance is making the change at the reporting level (i.e. Splunk).
(also the saving you'd make in terms of licensed usage from making the change on the Exchange server would be negligible)
R. Turk,
Thanks, this is a very complete explanation of how to make this change in Splunk; however, a big part of my question was should this change be made in Splunk or is there a better way, such as disabling or removing the service from the Exchange server entirely.
Thanks again.