Greetings Splunkers,
I've seen in a lot of the online documentation about the "autoLB" mode of load-balancing, however haven't seen anywhere that actually goes into any details of how this is done, specifically how the indexer is monitored by the forwarder and considered to be a valid recipient of raw data (eg. Splunk specific health checks/heartbeats).
My reasoning is that I would like to monitor the Splunk servers to see whether they are in an UP/DOWN state from a monitoring platform, and while it's possible to script up, if there's a way that a Forwarder uses, it'd be best to use the same method to gauge actual impact.
A similar question was asked HERE (http://splunk-base.splunk.com/answers/8720/best-practice-for-monitoring-indexer-health) with no answers.
Can anyone please shed some light on this, or point me towards some documentation that details this (I did look, but the technical details were very much on the light side).
Regards,
RT 🙂
Have you looked at the Deployment Monitor app? It does some stuff around health. Basically, health is generally going to come down to looking at the metrics to see if any data has been indexed recently. I would imagine that forwarders are simply going to try to connect to the indexer, and pass to the next if they don't get the proper ACK. I doubt you want to replicate that, but looking at the stats of the indexer would seem to be a reasonable way to determine this.
Here's a boiled down version of what the Deployment Monitor is using to establish some level of monitoring of indexers:
index="_internal" source="*metrics.log" group=per_index_thruput series!="_*" | stats max(_time) as _time sum(kb) as kb by splunk_server | eval status = if(KB==0, "idle", if(parseQ_percentage>50, "overloaded", if(indexQ_percentage>50,"overloaded","normal")))
You could easily run a CLI or API search that checks on event counts from each Splunk Server. Obviously, there will be a penalty of running the search. Similarly, you could check on known sources across all servers and report on those event counts by "splunk_server". For example, to check all servers to see if they are indexing I could do this search:
index=_internal source=*metrics.log earliest=-2m | stats count by splunk_server
Regarding auto load balancing, it works as follows... Assume I have 5 indexers, call them i1, i2, i3, i4, and i5. If my forwarder is running autoLB, it will do the following:
Notice that splunk removes the previously used indexer from the pool, and if the subsequent indexer fails, it is also removed from the pool until success.
Have you looked at the Deployment Monitor app? It does some stuff around health. Basically, health is generally going to come down to looking at the metrics to see if any data has been indexed recently. I would imagine that forwarders are simply going to try to connect to the indexer, and pass to the next if they don't get the proper ACK. I doubt you want to replicate that, but looking at the stats of the indexer would seem to be a reasonable way to determine this.
Here's a boiled down version of what the Deployment Monitor is using to establish some level of monitoring of indexers:
index="_internal" source="*metrics.log" group=per_index_thruput series!="_*" | stats max(_time) as _time sum(kb) as kb by splunk_server | eval status = if(KB==0, "idle", if(parseQ_percentage>50, "overloaded", if(indexQ_percentage>50,"overloaded","normal")))
I don't know exactly how autoLB decides if an indexer is "good" or not, but a new feature of 4.2 is "Indexer Acknowledgement" -- http://www.splunk.com/base/Documentation/latest/Deploy/Protectagainstlossofin-flightdata
You could always configure Nagios or its peers to connect to your splunkd/splunkweb on their various ports. Splunkd's management port is HTTP(S), as is Splunkweb. The problem with connecting to a forwarder port is (a) you can't speak the protocol and (b) simply connecting doesn't mean all is well.
This is one of those cases where I feel like a Nagios passive check is of value. You could do something as simple as a CLI search every 2-3 minutes on each indexer. You would want something to validate that there is "recent" data showing up for that indexer (use splunk_server=xxxx
) and if not then the passive check reports back to Nagios that all is not well. The nice thing about passive checks is if they don't update in a timeframe, Nagios can be set up to assume them to be down.