I'm working on a search to do some analysis on indexing delays. I'm essentially comparing _time
and _indextime
to see performance changes in forwarding and indexing performance. The problem I'm running into is that there are just way too many events to try to effectively search across a busy host or sourcetype. And I really don't need all the events, I would much rather just deal with a small sample of these events.
I'm guessing someone will point this out, but I really don't want to go to the effort of doing the whole summary-indexing thing. I only want to run this kind of analysis on demand, it's not worth incurring a constant overhead of searching across all events in the system.
So I was wondering if there is some way to drop out results from the search? This could be somewhat random, or more predictable like "keep only every 10th event". I'm looking for any ideas.
Preferably there would be a way to tell splunk to do this in within the base search command, which would eliminate the need to even fetch these events from disk, but I doubt that's possible, so a search command would be good enough.
Any ideas?
Here is one search I'm running into issues with:
host=someforwarder.example.com index=_internal | eval _time=_indextime | sort _time | timechart span="5m" count
With this command, I can't even look at a 4 hour window because it exceeds the 10,000 event count. (And I'm guessing that sort
doesn't handle more than 10,000 events.) So the end of the chart gets cut off. I would much rather randomly reduce the events throughout the timeframe so I can see the trend over the across the full time range.
Update:
I was able to get past my sorting limitation (it's a bit ugly) with the following search:
$basesearch$ | bucket _indextime as _time span=5m | stats count by _time | sort _time | timechart span=5m sum(count) as count
splunk 6.4
http://docs.splunk.com/Documentation/Splunk/6.4.0/ReleaseNotes/MeetSplunk
Data Sampling Mode for Dashboard Searches
Efficiently evaluate trends and patterns using sample ratio within search.
Indexed data do not have primary key. If you have some kind of hash like sha256 that is randomly distributed at index time, you can randomize the string prefix in a search.
If you do not have such hash at index time, you can create a data model an add such a calculated field, and then apply sampling to the data model.
A general strategy is to write some evenly/randomly distributed key with high cardinality, then do a prefix search on a random string. The _time field already creates 86400 partitions everyday.
100/86400 seconds daily
sourcetype=access_combined
[| gentimes start=1/1/2000 end=1/2/2000 increment=1s | head 86400 | eval _time=starttime%86400 | eval r=random() | sort 100 r | sort _time
| eval date_hour=floor(_time/3600) | eval date_minute=floor(_time/60%60) | eval date_second=floor(_time%60)
| return 86400 date_hour date_minute date_second]
| timechart count span=1h
10/3600 seconds hourly
sourcetype=access_combined
[| gentimes start=1/1/2000 end=1/2/2000 increment=1s | head 3600 | eval _time=starttime%3600 | eval r=random() | sort 10 r | sort _time
| eval date_hour="*" | eval date_minute=floor(_time/60%60) | eval date_second=floor(_time%60)
| return 86400 date_hour date_minute date_second]
| timechart count span=1h
0.1% of 86400 seconds daily
sourcetype=access_combined [| gentimes start=1/1/2000 end=1/2/2000 increment=1s | head 86400 | eval _time=starttime%86400
| eval r=random()%1000 | search r=0 | eval date_hour=floor(_time/3600) | eval date_minute=floor(_time/60%60) | eval date_second=floor(_time%60)
| return 86400 date_hour date_minute date_second] | timechart count span=1h
0.1% of 3600 seconds hourly
sourcetype=access_combined [| gentimes start=1/1/2000 end=1/2/2000 increment=1s | head 3600 | eval _time=starttime%3600
| eval r=random()%1000 | search r=0 | eval date_hour="*" | eval date_minute=floor(_time/60%60) | eval date_second=floor(_time%60)
| return 86400 date_hour date_minute date_second] | timechart count span=1h
a poor way to do this might be adding date_second = 1 OR date_second=11 OR date_second=21 ...
to your base search. Might have skewed results if you have something that generates data on every 1 minute or every 10 second or whatever, but if the timestamps are actually random by second, that might be okay. You of course have more options if you do a post_process, as you can, for example add a where (_serial % 10) = 0
to get every 10th event, or where (random()%10) = 0
for some random sample. However, this probably won't speed up your search in any significant way (as the time to do a subtraction, plus any kind of averaging into a bucket is pretty close to zero compared to the time for getting the data in the first place, so even filtering to one in a thousand won't get you a very noticeable improvement). It might help with the sort, though I'm not really sure in your example query why you're sorting on the delay and then trying to do a timechart on it?
timechart has no problems with more than 10,000 events. the 10,000 event limit really only applies to what can be physically displayed and paged through in the GUI results. search commands get all data and all results. the 10,000 item limit you are seeing is just because sort
defaults to only returning 10,000 items, but you can override that with sort 0 _time
or sort 10000000 _time
.
I would recommend random()
over _serial
just because the eval of _serial can't be efficiently map-reduced, which in turn would force the sort
and timechart
to be done on the search head only. random()
would also avoid possible interactions with cycles in your data (e.g., a log with events that always log the same 10 lines in a row).
I thought about a couple of those options, but I missed the _serial field, I like that idea the best so far. Thanks for your help! And yeah, if there is any kind of internal feature request for getting the core search engine to only return some sort of sample like this, please add me to the list of requesters. I have a few other searches where I think that kind of thing may be beneficial.
To answer your question about by usage of sorting and timechart. First, I'm not actually calculating delay in this specific graph, I'm simply trying to make a "timeline" like graph based on the index time rather than the event time. I have a view that shows the normal "timeline" right next to the index-timeline, which lets the admin to quickly determine if indexing delays were caused by a forwarder restart, or find a more complex delay patten. (I had one forwarder that would only forward _internal
events ever couple of hours. Upgrading to 4.1 seems to have fixed it..)
And yes, it would be nice to be able to get Splunk to return random or otherwise specified samples of large data sets in the base search.