Solved: Monitoring Splunk with Nagios to make sure it's ru...

gozulin · ‎03-20-2014

I have a few splunk indexers and many forwarders and I'd like to have a nagios monitor that alerts me when something is broken.

There are two possibilities I can think of. Checking forwarder-side or indexer-side. I would prefer forwarder-side.

Checking data read by the splunk forwarder.
Checking data sent by the splunk forwarder over the network.

Checking on the indexers that log entries for a particular host + source file combination are not 0 over a certain period of time (Say 1 hour).

What is the best way to go?

lguinn2 · ‎03-20-2014

You could do that with Splunk - from your search head, you can search all of the internal Splunk logs from your environment for forwarders, indexers, etc.

Search to check that forwarders are sending data:

index=_internal source=*metrics.log group=tcpin_connections 
| eval sourceHost=if(isnull(hostname), sourceHost,hostname) 
| rename connectionType as connectType
| eval connectType=case(fwdType=="uf","univ fwder", fwdType=="lwf", "lightwt fwder",fwdType=="full", "heavy fwder", connectType=="cooked" or connectType=="cookedSSL","Splunk fwder", connectType=="raw" or connectType=="rawSSL","legacy fwder")
| eval version=if(isnull(version),"pre 4.2",version)
| rename version as Ver  arch as MachType
| fields connectType sourceIp sourceHost destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server Ver MachType
| eval Indexer= splunk_server
| eval Hour=relative_time(_time,"@h")
| stats avg(tcp_KBps) sum(tcp_eps) sum(tcp_Kprocessed) sum(kb) by Hour connectType sourceIp sourceHost MachType destPort Indexer Ver
| fieldformat Hour=strftime(Hour,"%x %H")

You can play around with this to get it to deliver exactly what you want. As written, it gives an hour-by-hour summary of how much data each forwarder submitted to each indexer. The internal metrics log doesn't have any details about the data that was sent, in terms of sourcetype or host.

Using this as a starting point, you could create a scheduled search that triggers a script which feeds the info to nagios. Or you could compare the results to a known list of indexers and forwarders, triggering an alert if any are missing.

If you want to create a list of hosts, sources or sourcetypes that have not had new data recently, you could do this:

| metadata type=hosts | eval timeSinceLastEvent = now() - recentTime
| where timeSinceLastEvent > 3600
| fieldformat timeSinceLastEvent=tostring(timeSinceLastEvent,"duration")
| eval recentTime =strftime(recentTime,"%x %X")
| eval firstTime =strftime(firstTime,"%x %X")
| eval lastTime =strftime(lastTime,"%x %X")

This displays a list of the hosts for whom no new data has been indexed in the last hour. The same search can be used for source or sourcetype by simply changing the type= on the first line

View solution in original post

lguinn2 · ‎03-20-2014

You could do that with Splunk - from your search head, you can search all of the internal Splunk logs from your environment for forwarders, indexers, etc.

Search to check that forwarders are sending data:

index=_internal source=*metrics.log group=tcpin_connections 
| eval sourceHost=if(isnull(hostname), sourceHost,hostname) 
| rename connectionType as connectType
| eval connectType=case(fwdType=="uf","univ fwder", fwdType=="lwf", "lightwt fwder",fwdType=="full", "heavy fwder", connectType=="cooked" or connectType=="cookedSSL","Splunk fwder", connectType=="raw" or connectType=="rawSSL","legacy fwder")
| eval version=if(isnull(version),"pre 4.2",version)
| rename version as Ver  arch as MachType
| fields connectType sourceIp sourceHost destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server Ver MachType
| eval Indexer= splunk_server
| eval Hour=relative_time(_time,"@h")
| stats avg(tcp_KBps) sum(tcp_eps) sum(tcp_Kprocessed) sum(kb) by Hour connectType sourceIp sourceHost MachType destPort Indexer Ver
| fieldformat Hour=strftime(Hour,"%x %H")

You can play around with this to get it to deliver exactly what you want. As written, it gives an hour-by-hour summary of how much data each forwarder submitted to each indexer. The internal metrics log doesn't have any details about the data that was sent, in terms of sourcetype or host.

Using this as a starting point, you could create a scheduled search that triggers a script which feeds the info to nagios. Or you could compare the results to a known list of indexers and forwarders, triggering an alert if any are missing.

If you want to create a list of hosts, sources or sourcetypes that have not had new data recently, you could do this:

| metadata type=hosts | eval timeSinceLastEvent = now() - recentTime
| where timeSinceLastEvent > 3600
| fieldformat timeSinceLastEvent=tostring(timeSinceLastEvent,"duration")
| eval recentTime =strftime(recentTime,"%x %X")
| eval firstTime =strftime(firstTime,"%x %X")
| eval lastTime =strftime(lastTime,"%x %X")

This displays a list of the hosts for whom no new data has been indexed in the last hour. The same search can be used for source or sourcetype by simply changing the type= on the first line

gozulin · ‎04-07-2015

Thank you very much! That's a big help 🙂

Monitoring Splunk with Nagios to make sure it's running/forwarding/indexing

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard

Are you a member of the Splunk Community?

Monitoring Splunk with Nagios to make sure it's running/forwarding/indexing

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard