We are trying to create a summery index search so that we can record the number of events per day per host. I would use the following search however it takes too long to run:
sistats count by host
Additionally, i tried to use the metrics.log way of doing things however as the eps is just an average it is not accurate and we are monitoring over 500 hosts and not sure the maxseries in limits.conf could safely be increased that high.
index=_internal source=*metrics.log group="per_host_thruput" | sichart avg(eps) by series
Does anyone know of a more efficient accurate way of measuring the event count per host?
Your help is always appreciated 🙂
I-Man
There are two different timestamps to keep in mind when looking at this kind of statistic: (1) the event's timestamp which is the date/time information that Splunk extracts from an event. Or (2), the time when the event was indexed.
Unfortunately, there are some gotchas (or limitations) with trying to capture stats with either of these fields. Normally this isn't a big problem, but it's important to be aware of the differences and know the sometimes hidden assumptions behind either approach.
If you create stats based on event timestamp, then you have no option other than to search over all of your events for an entire period of time. This can be resource intensive, and there are some tricks that help, like jrodman commented. One potential issue is if you have events that are sometimes delayed beyond your summary interval. (If your summarizing daily, and some events are delayed for a couple of days because down forwarder, then they obviously will not be included in your count because they didn't exist when the summary indexing saved search ran.)
If you go with stats based on index time, than trying to directly get your stats using a search becomes very painful.  This is because _indextime is not optimized for searching like "_time" is.   However, you do have many more options with this approach, like using metrics as you pointed out.  (Metrics can be quite complicated and are prone to misinterpretation. And I agree that trying a "maxseries" of 500 probably isn't going to look pretty.  So this probably isn't an option for you.)
Another approach that looks at index-time would be to capture buckets stats each day (probably in a lookup file), and then do a daily comparison and summary index the delta as the number of events that were indexed by host.
At a basic level you could start with | metadata type=hosts, but you may need to capture this at a per-bucket level so account for buckets aging out.  This could be complicated to implement, but it should be really fast, and you could conceivably run this at any interval that you wanted (weeks, days, hours, minutes).  It's certainly nice not to have to search over all of your events.
Update:
I did some additional thinking about the metadata/bucket based approach.  I thought perhaps | dbinspect would help, but it simply gives a count per axis (host/source/sourcetype) so there isn't a breakdown of the events per host.  Having said that, this information is available in the Hosts.data files in each bucket, so it would be possible to write a custom search command to capture this info for this kind of stats collection.  This would effectively handle the issue of bucket rotation, but wouldn't help with deleted events; then again there really isn't a good way to account for that in any of these situations.
If the straight forward "sistats" approach is working; stick with that.   But if your volume increases or you need a more creative solution, baselining and tracking changes to Hosts.data may be a more efficient solution to consider.
There are two different timestamps to keep in mind when looking at this kind of statistic: (1) the event's timestamp which is the date/time information that Splunk extracts from an event. Or (2), the time when the event was indexed.
Unfortunately, there are some gotchas (or limitations) with trying to capture stats with either of these fields. Normally this isn't a big problem, but it's important to be aware of the differences and know the sometimes hidden assumptions behind either approach.
If you create stats based on event timestamp, then you have no option other than to search over all of your events for an entire period of time. This can be resource intensive, and there are some tricks that help, like jrodman commented. One potential issue is if you have events that are sometimes delayed beyond your summary interval. (If your summarizing daily, and some events are delayed for a couple of days because down forwarder, then they obviously will not be included in your count because they didn't exist when the summary indexing saved search ran.)
If you go with stats based on index time, than trying to directly get your stats using a search becomes very painful.  This is because _indextime is not optimized for searching like "_time" is.   However, you do have many more options with this approach, like using metrics as you pointed out.  (Metrics can be quite complicated and are prone to misinterpretation. And I agree that trying a "maxseries" of 500 probably isn't going to look pretty.  So this probably isn't an option for you.)
Another approach that looks at index-time would be to capture buckets stats each day (probably in a lookup file), and then do a daily comparison and summary index the delta as the number of events that were indexed by host.
At a basic level you could start with | metadata type=hosts, but you may need to capture this at a per-bucket level so account for buckets aging out.  This could be complicated to implement, but it should be really fast, and you could conceivably run this at any interval that you wanted (weeks, days, hours, minutes).  It's certainly nice not to have to search over all of your events.
Update:
I did some additional thinking about the metadata/bucket based approach.  I thought perhaps | dbinspect would help, but it simply gives a count per axis (host/source/sourcetype) so there isn't a breakdown of the events per host.  Having said that, this information is available in the Hosts.data files in each bucket, so it would be possible to write a custom search command to capture this info for this kind of stats collection.  This would effectively handle the issue of bucket rotation, but wouldn't help with deleted events; then again there really isn't a good way to account for that in any of these situations.
If the straight forward "sistats" approach is working; stick with that.   But if your volume increases or you need a more creative solution, baselining and tracking changes to Hosts.data may be a more efficient solution to consider.
Lowell, thanks so much for taking the time to explain the various ways of accomplishing this. I will stick with "sistats" for now, however if our traffic increases then I will definitely look into the hosts.data files. Thanks again!
Check the comments, an actual scheduled summary report runs much faster than a si command in the flashtimeline view. The saved summary search runs fast enough that i should be able to collect all the stats i need efficiently using sistats count by host.
I see what your saying in regards to the fields being extracted in flashtimeline vs summary report. I ran the "sistats count by host" command over the last 60 min in the flashtimeline view and it took about 40 min to run. When I scheduled the saved summary report, it took less than 3 minutes to run. This is efficient and will def work. Thanks for the explanation jrodman!
 
					
				
		
 
		
		
		
		
		
	
			
		
		
			
					
		Be sure you're evaluating the speed of the search in context. In a summary report, the fields will not be extracted. In the flashtimeline view in the ui, they always are.
Try the 'advanced charting' view for a better sense.
The time window is set to the 1 day window, right?
