In my office we have a script on our log servers that monitors the hosts sending logs and alerts us if a machine starts pumping out an inordinate amount of logs. I'm trying to figure out if it's possible to move this into Splunk and try to get rid of yet another hand-rolled script. My concern though, is that this would have to be a batch job run once or twice a day, thus losing the real-time alerting that we get now. So, I'm wondering is there a way to set things up so that when new logs come in the count of logs from that host can be checked against some threshold that I set?
You'd have to more or less roll this yourself, eg, run a search every 5 minutes and look at the last x recent minutes to see if the number has changed drastically.
However, in the splunk world, you often tend to run your indexing system somewhere near capacity, so sometimes during a spike it can go over capacity, causing lag in the indexed data, which might make your volume appear to go down.
Options:
eg a search of sourcetype=foo host=bar | stats count as event_count | eval event_count>50000 or event_count<200 This would emit one event only if they count is outside the threshhold, so you could make your alert condition be more events than 0.
I've found that a good way to track this kind of information is by leveraging the indexing metrics on the splunk indexer instance. The indexing metrics are captured every 30 seconds for the top 10 source, sourcetype, index, and host. (You can up the number or series in the limits.conf
file, under the [metrics]
section.) Using metrics, you can look at and monitor trends by total kb, total events, or average events per second, or number of times a specific series shows up in a sample time frame. (There are more ways to split this out, that was just a few stats that I've found helpful.)
We have an email alerting saved search setup to run every 5 minutes (from -6m@m to -1m@m) that uses the 'source' metrics to point out any log files that are becoming too chatty (which may lead to exceeding our license usage, which is the primary scenario this search was setup to alert us about.)
Here is a slimmed down version of our search:
index=_internal sourcetype=splunkd Metrics "group=per_source_thruput" NOT (splunk ("metrics.log" OR splunkd.log OR web_access.log OR splunkd_access.log)) NOT "/var/log/ftpd.debug" NOT wineventlog:security | stats avg(eps) as eps, count, sum(kb) as total_kb, avg(kbps) by series | search eps>=3 count>=3 total_kb>100
The list of NOT
s are used to exclude known sources that produce a large amount of logs, and are also unlikely to be run-away-inputs (as apposed to some other poorly written apps that are likely to get stuck an an infinite loop and write out a few gigs of nonsense in a 5 minute window if you don't kill it fast enough.) The limiting search
at the end was more trail and error than anything. The idea is that I only want to know about repeating offenders (count>=3
), don't bother me with anything too small (total_kb>100
), and we have to be seeing a certain sustained level of activity (eps>3
). This approach is not at all adaptive based on series, other than to filter them out. But that's a limitation of my search, not the logs.
PLEASE NOTE: Please understand that this search provided here is only an example. So don't just copy it, run it, and expect to get sane results on your system. This is simply one possible way of tracking your indexing/usage patterns, but this is merely a starting point, not a solution.
I'm been meaning to overhaul this alert. We take aggregate snapshots of the indexing metrics info that end up in the summary
index for 15 minute and 24h intervals. (This was setup for long term indexing analysis, since _internal
is very short-lived.) I would like to find a way of combining these snapshots with live metrics events to build an alert to show a series that is exceeding the some baseline value which is established based on previous activity, say the last week. For example, Make the alert report any series that is exceeding the previous average by 1.5 standard deviations. (This would take some work, but it seems doable... If I come up with something helpful, perhaps update this post.)
Thats a good point. I've updated the post to add a disclaimer.
I certainly did not intend to present this as a reusable solution; I was simply pointing out a starting point that leverages the indexing metrics. Hopefully the post is clearer about that now.
This is a pretty efficient and informative approach (we've already collected the data) under the presumption that you have sufficient indexing capacity to accept at least some part of an unusual rise. In a tightly run shop, that's probably true, but I'm hesitant to prescribe as a general approach for all splunk instances.
You'd have to more or less roll this yourself, eg, run a search every 5 minutes and look at the last x recent minutes to see if the number has changed drastically.
However, in the splunk world, you often tend to run your indexing system somewhere near capacity, so sometimes during a spike it can go over capacity, causing lag in the indexed data, which might make your volume appear to go down.
Options:
eg a search of sourcetype=foo host=bar | stats count as event_count | eval event_count>50000 or event_count<200 This would emit one event only if they count is outside the threshhold, so you could make your alert condition be more events than 0.