Splunk Search

How do I get the total number of events per day per host... Efficiently and accurately?

I-Man
Communicator

We are trying to create a summery index search so that we can record the number of events per day per host. I would use the following search however it takes too long to run:

sistats count by host

Additionally, i tried to use the metrics.log way of doing things however as the eps is just an average it is not accurate and we are monitoring over 500 hosts and not sure the maxseries in limits.conf could safely be increased that high.

index=_internal source=*metrics.log group="per_host_thruput" | sichart avg(eps) by series

Does anyone know of a more efficient accurate way of measuring the event count per host?

Your help is always appreciated 🙂

I-Man

1 Solution

Lowell
Super Champion

There are two different timestamps to keep in mind when looking at this kind of statistic: (1) the event's timestamp which is the date/time information that Splunk extracts from an event. Or (2), the time when the event was indexed.

Unfortunately, there are some gotchas (or limitations) with trying to capture stats with either of these fields. Normally this isn't a big problem, but it's important to be aware of the differences and know the sometimes hidden assumptions behind either approach.

If you create stats based on event timestamp, then you have no option other than to search over all of your events for an entire period of time. This can be resource intensive, and there are some tricks that help, like jrodman commented. One potential issue is if you have events that are sometimes delayed beyond your summary interval. (If your summarizing daily, and some events are delayed for a couple of days because down forwarder, then they obviously will not be included in your count because they didn't exist when the summary indexing saved search ran.)

If you go with stats based on index time, than trying to directly get your stats using a search becomes very painful. This is because _indextime is not optimized for searching like "_time" is. However, you do have many more options with this approach, like using metrics as you pointed out. (Metrics can be quite complicated and are prone to misinterpretation. And I agree that trying a "maxseries" of 500 probably isn't going to look pretty. So this probably isn't an option for you.)

Another approach that looks at index-time would be to capture buckets stats each day (probably in a lookup file), and then do a daily comparison and summary index the delta as the number of events that were indexed by host.

At a basic level you could start with | metadata type=hosts, but you may need to capture this at a per-bucket level so account for buckets aging out. This could be complicated to implement, but it should be really fast, and you could conceivably run this at any interval that you wanted (weeks, days, hours, minutes). It's certainly nice not to have to search over all of your events.


Update:

I did some additional thinking about the metadata/bucket based approach. I thought perhaps | dbinspect would help, but it simply gives a count per axis (host/source/sourcetype) so there isn't a breakdown of the events per host. Having said that, this information is available in the Hosts.data files in each bucket, so it would be possible to write a custom search command to capture this info for this kind of stats collection. This would effectively handle the issue of bucket rotation, but wouldn't help with deleted events; then again there really isn't a good way to account for that in any of these situations.

If the straight forward "sistats" approach is working; stick with that. But if your volume increases or you need a more creative solution, baselining and tracking changes to Hosts.data may be a more efficient solution to consider.

View solution in original post

0 Karma

Lowell
Super Champion

There are two different timestamps to keep in mind when looking at this kind of statistic: (1) the event's timestamp which is the date/time information that Splunk extracts from an event. Or (2), the time when the event was indexed.

Unfortunately, there are some gotchas (or limitations) with trying to capture stats with either of these fields. Normally this isn't a big problem, but it's important to be aware of the differences and know the sometimes hidden assumptions behind either approach.

If you create stats based on event timestamp, then you have no option other than to search over all of your events for an entire period of time. This can be resource intensive, and there are some tricks that help, like jrodman commented. One potential issue is if you have events that are sometimes delayed beyond your summary interval. (If your summarizing daily, and some events are delayed for a couple of days because down forwarder, then they obviously will not be included in your count because they didn't exist when the summary indexing saved search ran.)

If you go with stats based on index time, than trying to directly get your stats using a search becomes very painful. This is because _indextime is not optimized for searching like "_time" is. However, you do have many more options with this approach, like using metrics as you pointed out. (Metrics can be quite complicated and are prone to misinterpretation. And I agree that trying a "maxseries" of 500 probably isn't going to look pretty. So this probably isn't an option for you.)

Another approach that looks at index-time would be to capture buckets stats each day (probably in a lookup file), and then do a daily comparison and summary index the delta as the number of events that were indexed by host.

At a basic level you could start with | metadata type=hosts, but you may need to capture this at a per-bucket level so account for buckets aging out. This could be complicated to implement, but it should be really fast, and you could conceivably run this at any interval that you wanted (weeks, days, hours, minutes). It's certainly nice not to have to search over all of your events.


Update:

I did some additional thinking about the metadata/bucket based approach. I thought perhaps | dbinspect would help, but it simply gives a count per axis (host/source/sourcetype) so there isn't a breakdown of the events per host. Having said that, this information is available in the Hosts.data files in each bucket, so it would be possible to write a custom search command to capture this info for this kind of stats collection. This would effectively handle the issue of bucket rotation, but wouldn't help with deleted events; then again there really isn't a good way to account for that in any of these situations.

If the straight forward "sistats" approach is working; stick with that. But if your volume increases or you need a more creative solution, baselining and tracking changes to Hosts.data may be a more efficient solution to consider.

0 Karma

I-Man
Communicator

Lowell, thanks so much for taking the time to explain the various ways of accomplishing this. I will stick with "sistats" for now, however if our traffic increases then I will definitely look into the hosts.data files. Thanks again!

0 Karma

I-Man
Communicator

Check the comments, an actual scheduled summary report runs much faster than a si command in the flashtimeline view. The saved summary search runs fast enough that i should be able to collect all the stats i need efficiently using sistats count by host.

0 Karma

I-Man
Communicator

I see what your saying in regards to the fields being extracted in flashtimeline vs summary report. I ran the "sistats count by host" command over the last 60 min in the flashtimeline view and it took about 40 min to run. When I scheduled the saved summary report, it took less than 3 minutes to run. This is efficient and will def work. Thanks for the explanation jrodman!

0 Karma

jrodman
Splunk Employee
Splunk Employee

Be sure you're evaluating the speed of the search in context. In a summary report, the fields will not be extracted. In the flashtimeline view in the ui, they always are.

Try the 'advanced charting' view for a better sense.

The time window is set to the 1 day window, right?

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...