Getting Data In

Monitoring Splunk with Nagios to make sure it's running/forwarding/indexing

gozulin
Communicator

I have a few splunk indexers and many forwarders and I'd like to have a nagios monitor that alerts me when something is broken.

There are two possibilities I can think of. Checking forwarder-side or indexer-side. I would prefer forwarder-side.

Checking data read by the splunk forwarder.
Checking data sent by the splunk forwarder over the network.

Checking on the indexers that log entries for a particular host + source file combination are not 0 over a certain period of time (Say 1 hour).

What is the best way to go?

0 Karma
1 Solution

lguinn2
Legend

You could do that with Splunk - from your search head, you can search all of the internal Splunk logs from your environment for forwarders, indexers, etc.

Search to check that forwarders are sending data:

index=_internal source=*metrics.log group=tcpin_connections 
| eval sourceHost=if(isnull(hostname), sourceHost,hostname) 
| rename connectionType as connectType
| eval connectType=case(fwdType=="uf","univ fwder", fwdType=="lwf", "lightwt fwder",fwdType=="full", "heavy fwder", connectType=="cooked" or connectType=="cookedSSL","Splunk fwder", connectType=="raw" or connectType=="rawSSL","legacy fwder")
| eval version=if(isnull(version),"pre 4.2",version)
| rename version as Ver  arch as MachType
| fields connectType sourceIp sourceHost destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server Ver MachType
| eval Indexer= splunk_server
| eval Hour=relative_time(_time,"@h")
| stats avg(tcp_KBps) sum(tcp_eps) sum(tcp_Kprocessed) sum(kb) by Hour connectType sourceIp sourceHost MachType destPort Indexer Ver
| fieldformat Hour=strftime(Hour,"%x %H")

You can play around with this to get it to deliver exactly what you want. As written, it gives an hour-by-hour summary of how much data each forwarder submitted to each indexer. The internal metrics log doesn't have any details about the data that was sent, in terms of sourcetype or host.

Using this as a starting point, you could create a scheduled search that triggers a script which feeds the info to nagios. Or you could compare the results to a known list of indexers and forwarders, triggering an alert if any are missing.

If you want to create a list of hosts, sources or sourcetypes that have not had new data recently, you could do this:

| metadata type=hosts | eval timeSinceLastEvent = now() - recentTime
| where timeSinceLastEvent > 3600
| fieldformat timeSinceLastEvent=tostring(timeSinceLastEvent,"duration")
| eval recentTime =strftime(recentTime,"%x %X")
| eval firstTime =strftime(firstTime,"%x %X")
| eval lastTime =strftime(lastTime,"%x %X")

This displays a list of the hosts for whom no new data has been indexed in the last hour. The same search can be used for source or sourcetype by simply changing the type= on the first line

View solution in original post

lguinn2
Legend

You could do that with Splunk - from your search head, you can search all of the internal Splunk logs from your environment for forwarders, indexers, etc.

Search to check that forwarders are sending data:

index=_internal source=*metrics.log group=tcpin_connections 
| eval sourceHost=if(isnull(hostname), sourceHost,hostname) 
| rename connectionType as connectType
| eval connectType=case(fwdType=="uf","univ fwder", fwdType=="lwf", "lightwt fwder",fwdType=="full", "heavy fwder", connectType=="cooked" or connectType=="cookedSSL","Splunk fwder", connectType=="raw" or connectType=="rawSSL","legacy fwder")
| eval version=if(isnull(version),"pre 4.2",version)
| rename version as Ver  arch as MachType
| fields connectType sourceIp sourceHost destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server Ver MachType
| eval Indexer= splunk_server
| eval Hour=relative_time(_time,"@h")
| stats avg(tcp_KBps) sum(tcp_eps) sum(tcp_Kprocessed) sum(kb) by Hour connectType sourceIp sourceHost MachType destPort Indexer Ver
| fieldformat Hour=strftime(Hour,"%x %H")

You can play around with this to get it to deliver exactly what you want. As written, it gives an hour-by-hour summary of how much data each forwarder submitted to each indexer. The internal metrics log doesn't have any details about the data that was sent, in terms of sourcetype or host.

Using this as a starting point, you could create a scheduled search that triggers a script which feeds the info to nagios. Or you could compare the results to a known list of indexers and forwarders, triggering an alert if any are missing.

If you want to create a list of hosts, sources or sourcetypes that have not had new data recently, you could do this:

| metadata type=hosts | eval timeSinceLastEvent = now() - recentTime
| where timeSinceLastEvent > 3600
| fieldformat timeSinceLastEvent=tostring(timeSinceLastEvent,"duration")
| eval recentTime =strftime(recentTime,"%x %X")
| eval firstTime =strftime(firstTime,"%x %X")
| eval lastTime =strftime(lastTime,"%x %X")

This displays a list of the hosts for whom no new data has been indexed in the last hour. The same search can be used for source or sourcetype by simply changing the type= on the first line

gozulin
Communicator

Thank you very much! That's a big help 🙂

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...