Splunk Search

Monitoring Critical Device Logging

Path Finder

We are looking to take an enterprise level approach on the monitoring of critical device logging. We have a list of several hundred critical devices that we need to monitor for the presence of logs. Our critical device list appears as follows in a lookup, critical_device_sample:

SplunkHost,PairGroup,HoursLag,Index
10.1.2.x,,1,networkdevice
10.1.1.x,,1,networkdevice
switcha-tx,B,1,networkdevice
switchb-tx,B,1,networkdevice
switcha-ma,,1,networkdevice
neversentlogs,,2,proxy
proxy-ma,,2,proxy
proxy-tx,,2,proxy
proxy1-ga,A,1,proxy
proxy2-ga,A,1,proxy

The search:

| inputlookup critical_device_sample | eval SplunkHost=lower(SplunkHost) | join SplunkHost type=outer [metadata index=proxy index=networkdevice type=hosts  
           | rename totalCount as Count, host as SplunkHost, lastTime as "Last Event" 
           | eval actualhourslag=(now()-'Last Event')/60/60
           | eval SplunkHost=lower(SplunkHost)] |  fieldformat "Last Event"=strftime('Last Event', "%c") | where actualhourslag>HoursLag OR NOT actualhourslag="*"

The above query works, showing critical devices that aren't reporting. However, this needs some improvement. Specifically, we want to pair the primary and backup devices using a PairGroup field. If Primary is down, backup took over and we don't care provided we are still receiving logs from one of the pairs. Understanding that something failed over may be another query. If the PairGroup is empty, it's assumed it has no backup, the lookup table can also be modified if the explicit word "nobackup" is easier.

There is a minor thing to mention that someone must make an assumption on - the pair groups should have the same HoursLag and index. Though I know this is a leap, should they not, feel free to take the min value for HoursGroup and max? value for index. Finally, no need to assume there are tertiary devices. However, if you can handle three letter "C"'s in PairGroup, fantastic.

We would like to only search in the indexes named in the lookup sample as opposed to trying to keep them in sync. Any idea how to approach this problem? - Happy to see someone write out these queries, but pointing me in the right direction would also help as much. I'm fairly new to Splunk so trying to decide between, map, subsearches, joins vs appendpipes give me enough of rabbit hole...

Thank you in advance.

0 Karma

Builder

Hi @antb ,
I would use tstats instead of metadata for this one.

You could do something like this:


| tstats summariesonly=t count
where ( earliest=@d-7d latest=@m )
( [ | inputlookup critical_device_sample | stats count by Index | table Index | rename Index as index | format ] )
( [ | inputlookup critical_device_sample | stats count by SplunkHost | table SplunkHost | rename SplunkHost as host | format ] )
by _time host span=1d
| stats max(_time) as maxtime by host
| eval timecheck = strftime(relative_time(now(), "@d"), "%s")
| where maxtime < timecheck

This search quickly finds events for hosts in the last 7 days and filters to show only those hosts that do not have events for the current day. It also is limited to the list of indexes found in your critical_device_sample lookup as well as the list of hosts.


I would also change the format of your lookup to have the following fields:

SplunkHost,Group,Priority,HoursLag,Index


The reason for this, is you want to be able to lookup if a device is in a particular group and what the priority is. You could then adjust the above search using this lookup to add further value/insight.

0 Karma

SplunkTrust
SplunkTrust

Hi antb,
I see some problems in your search:
at first you have in the subsearch two index= condition, maybe you forgor an OR condition?
then you have a subsearch that probably has more than 50,000 results but there's this limit of subsearches, so probably in your subsearch you don't process all the results; so at first put the search as main search and lookup as subsearch.
Then join command is a very slow command and it's better to replace with stats command.

So I suggest to change your search in something like this:

| metadata index=proxy OR index=networkdevice type=hosts
| eval SplunkHost=lower(host)
| stats count BY SplunkHost
| append [ 
     | inputlookup critical_device_sample 
     | eval SplunkHost=lower(SplunkHost), count=0
     | fields SplunkHost, count, PairGroup
     ]
| stats values(PairGroup) AS PairGroup sum(count) AS Total BY SplunkHost
| where Total=0
| fillnull PairGroup value="nobackup"

In this way, the servers with no logs are listed and you know if the resulted hosts have a backup or not, if you run your search every five minutes and put a period of five minutes, you have a quick search and an alert.

It's not so clear for me your minor request, could you describe more?

Bye.
Giuseppe

0 Karma

Path Finder

Hi Gcusello,

It may seem strange, however, metadata accepts multiple indexes as documented [here]1. [1]: https://docs.splunk.com/Documentation/Splunk/7.3.0/SearchReference/Metadata

Additionally, I don't see you taking into account the HoursLag field? This is our tolerance, that would allow us to give a bit of time between log entries, some devices may only report once a day while others, we would want an alert if it's over 1 hour of time.

That said, I am going to take a look right now at the query you've suggested and let you know if this solved my issue in the next day or so. Appreciate the response!

As for the 2nd (minor) request, I had meant to keep the meta data search pull the index names from the lookup table as opposed to updating the query each time. - Though that's a minor one, so no worries.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!