Hello folks,
I have a compliance control requirement to alert when there is a log ingestion failure to Splunk. The desire is to focus at the sourcetype level as opposed to the host level (too many false positives) or index level (loses granularity as sourcetypes increase). The keys to the requirement are to dynamically expand as new sourcetypes come online and Splunk's results must consider the frequency of events on a per sourcetype basis. For example, a generic 4-hour window wouldn't suffice for a sourcetype getting multiple events every second, nor would it be properly handle a sourcetype that receives events once or twice per day.
I've tried the Meta Woot app and while beneficial for other issues, it does not address the control requirements. Has anyone developed a query with reasonable performance times, or found another app to handle compliance logging failures that considers the variance in event frequency and not an absolute?
Thanks!
I hadn't realised they had switched to a licenced model.
The basic idea behind a roll your own technique is to have a lookup file that contains the index and sourcetype and threshold in seconds that you need to see data in - you can create a simple example of all index sourcetype pairs seen in the previous hour and give them a threshold of 10 minutes, e.g. like this
| tstats latest(_time) as last_seen count where index=* earliest=-1h@h latest=@h by index sourcetype
| sort index sourcetype
| table index sourcetype
| eval threshold=600
| outputlookup monitor.csvNow you have a control set that you use to look for missing data outside the threshold.
Now initialise the results file - makes the SPL easier if all data is present there at the start.
| inputlookup monitor.csv
| fields - threshold
| eval last_seen=now(), missing_data=0
| outputlookup monitor_results.csvThen you can run this as a scheduled alert at the frequency you want - this example you can run every minute.
| tstats max(_time) as last_seen count where [ | inputlookup monitor.csv | fields index sourcetype ] earliest=-1m@m latest=@m by index sourcetype
``` We have data so reset missing indicator ```
| eval missing_data = 0
``` Grab all previous results and combine with what we found ```
| inputlookup monitor_results.csv append=t
| fields - threshold
| stats first(*) as * by index sourcetype
``` Get the threshold and see if the last seen exceeds the configured threshold ```
| lookup monitor.csv index sourcetype OUTPUT threshold
| eval exceeds_threshold = if(now() - last_seen > threshold, 1, 0)
``` Now work out if we need to alert - only alert the first time we exceed the threshold ```
| eval alert=if(exceeds_threshold = 1 AND missing_data = 0, 1, 0)
``` Increment the missing data counter to avoid continual alerts ```
| eval missing_data=if(exceeds_threshold = 1, missing_data + 1, missing_data)
``` Write out these results ```
| outputlookup monitor_results.csv
| where alert = 1The logic for that is
This will alert the first time an index/sourcetype has not been seen for the given number of threshold seconds.
NB: This is a starting point, but gives you the principles of how to manage it.
Hope this helps
The TrackMe app is powerful and would do what you want - requires a bit of investment in time to set it up.
https://splunkbase.splunk.com/app/4621
I've rolled my own with a regular saved search that uses tstats to collect index/sourcetype pairs and saves the results to a lookup, calculating the average latency and min/max gaps between events for each. Alerts then run to check current ingestion against those metrics per index/sourcetype.
There's an investment in time either way - but TrackMe is a good place to start.
Unfortunately, this solution exceeds my budget.
I hadn't realised they had switched to a licenced model.
The basic idea behind a roll your own technique is to have a lookup file that contains the index and sourcetype and threshold in seconds that you need to see data in - you can create a simple example of all index sourcetype pairs seen in the previous hour and give them a threshold of 10 minutes, e.g. like this
| tstats latest(_time) as last_seen count where index=* earliest=-1h@h latest=@h by index sourcetype
| sort index sourcetype
| table index sourcetype
| eval threshold=600
| outputlookup monitor.csvNow you have a control set that you use to look for missing data outside the threshold.
Now initialise the results file - makes the SPL easier if all data is present there at the start.
| inputlookup monitor.csv
| fields - threshold
| eval last_seen=now(), missing_data=0
| outputlookup monitor_results.csvThen you can run this as a scheduled alert at the frequency you want - this example you can run every minute.
| tstats max(_time) as last_seen count where [ | inputlookup monitor.csv | fields index sourcetype ] earliest=-1m@m latest=@m by index sourcetype
``` We have data so reset missing indicator ```
| eval missing_data = 0
``` Grab all previous results and combine with what we found ```
| inputlookup monitor_results.csv append=t
| fields - threshold
| stats first(*) as * by index sourcetype
``` Get the threshold and see if the last seen exceeds the configured threshold ```
| lookup monitor.csv index sourcetype OUTPUT threshold
| eval exceeds_threshold = if(now() - last_seen > threshold, 1, 0)
``` Now work out if we need to alert - only alert the first time we exceed the threshold ```
| eval alert=if(exceeds_threshold = 1 AND missing_data = 0, 1, 0)
``` Increment the missing data counter to avoid continual alerts ```
| eval missing_data=if(exceeds_threshold = 1, missing_data + 1, missing_data)
``` Write out these results ```
| outputlookup monitor_results.csv
| where alert = 1The logic for that is
This will alert the first time an index/sourcetype has not been seen for the given number of threshold seconds.
NB: This is a starting point, but gives you the principles of how to manage it.
Hope this helps
Excellent starting point, very much appreciate the suggestion and the level of detail explaining the thought process.
Hi @b17gunnr
I think creating a search yourself might end up being clumbersome and hard to cover the variance. Have you seen the Splunkbase app TrackMe?
TrackMe is a good for monitoring anomalies in ingestion (per host/sourcetype etc) and looks at things like event count, size, frequency, lag/delay etc.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
This solution is outside my current budgeting options.