Long time Splunk fan here. Initially when we started using Splunk, our queries were simple, and so searches ran fast.
So I loved Splunk.
However, over time as useage grew, queries got more complex and our datasets grew large.
Unfortunately, neither Splunk or any other log analysis products deal with this well ( and Splunk probably deals with it best amongst all others for that matter ).
Our "needle in a haystack" class of queries work okay, e.g. user=badman or ip=evil_ipp.
We have a fair number of those, and while they have gotten slower with how massive our data has become, they are still tolerable.
To some extent, we've built our own apps on top of Splunk that maintain important aggregates, though that still doesn't work for ad hoc queries, since the queries are such that they don't hit indexes.
Specifically, we have queries of these types that just run too slow, of the order of many minutes, or 10's of minutes:
These are queries which don't benefit from indexing, and effectively "break" Splunk despite having many nodes setup.
It feels like I am better off sshing into one machine and just grepping logs, and in some cases this isn't even possible due to security restrictions. So much as I loved Splunk as an introductory user, I've become very frustrated with search speed as data volumes and how queries have shifted from the simple "searching for a word" to the more complex examples outlined above.
1. Is search speed a major pain point for users of advanced queries and/or large datasets ?
[ Hopefully, if you can chime in and be vocal, the good folks at Splunk will hear our feedback 🙂 ]
Personally, it is incredibly frustrating to watch queries run for so long, to the point I have to go surf reddit while the advanced searches run.
2. What are folks doing to make searches that can't use indexes run fast in these scenarios ?
Modifying applications has helped in some cases, but due to not having control of some libraries, it isn't feasible at other times. Other times, admittedly some of my regexes aren't as optimized as they could be, but when I'm debugging a fire, its impractical for me to find out what is or isn't in my log so I have to play it safe with a regex.
This might not be the answer you're looking for, but it sounds like you're dealing with a lot of data. Once you get to a certain point you need to consider all levels of optimization. You're bottleneck might be hardware and not necessarily the type of query you're running.
You may want to look at your hardware architecture. Disks are probably the most important factor. Array Strip Size, File System, File System Block Size.. etc.
You can read more here:
Thank you for your suggestion, though I am first trying to figure out what can be done i software ( primarily through query optimization ). These classes of searches are not running a small percentage points slower, they are taking many minutes instead of seconds like indexed searches to the point it is not feasible to throw hardware at the problem.
Not trying to sound argumentative with what I'm writing but I'm confused on a couple things....and pretty tired. One thing that I'm not quite wrapping my head around is when you say "queries that don't benefit from indexing" - are you saying your queries are such that you can't add "index=x" due to the nature of the query? Alternatively are you all breaking up your data into multiple indicies?
I'm also not getting "over time...queries got more complex" and "search for error" I'd think that even looking for errors you would wrap things like index=, sourcetype=, source=, etc w/in the search itself otherwise you end up with God knows what in your results (maybe that's the point?). Perhaps you could prepend to your query index!=x where you know you won't see logs. I have a large MSSP type deployment with lots of indicies. In cases where I have queries looking for failed logins that need to cover lots of ground I don't have them looking in my netflow logs for example which saves a huge amount of time. YMMV of course.
If you are running 5.x you could look into report acceleration. If you aren't running 5.x....why not?
There is also the potential that given the growth of your instance the hardware maybe undersized and you need another indexer or something.
It looks like report acceleration is a solution geared at making queries that you run often faster ( vs. being able to truly run an adhoc query ). While that helps some use cases, it does not help my use case.
If you define field extractions for the field instead of using rex and sed, then you just search on that field in the initial search clause, Splunk will (unless told otherwise in extraction config) use the index to search for the field value first. and then once events are off disk do a round of filtering to get down to just the
field=theValue results. So if you've been using rex and sed wildly instead of doing field extractions, that's one way to speed everything up.
*error* will have to get everything off disk.
(*error OR error*), although its of course not the same thing, will in general perform a lot better.
More specifics may help even more though - I advise adding a few more actual searches. And as was said by another commenter, you can use the Job Inspector to look at where the time gets spent in the search language, and you can compare the 'scanCount' vs 'eventCount' to tell you how many events were gotten off disk (scanCount), versus how many of those events ended up being needed to run the actual search or report (eventCount).
*error will use the index - they'll each do prefix/suffix searches against indexed terms and then you can think of it as doing a big OR statement with all of those. And you can absolutely extract your restricted_path token there, as an extracted field. And I think leaving INDEXED_VALUE set to True would work just fine. Basically anything that can be done in a rex, can be done in a real field extraction. Rex is more of a quick and dirty tool imo, and as I mentioned it will usually be slower than doing a real field extraction.
The kinds of searches that are slow for me are ( giving representative examples here as I don't know if I can share my exact searches, will try to post some job inspector stats 😞
1) Search for "error" ( generates lots of results )
2) Search for *error* ( doesn't use an index )
3) Searches which use "rex" / sed such that they need access to the data of search results, e.g. error | rex mode=sed 's/username=.*/username=[DELETED]/g'
In most cases we can help show you the most inefficient parts of your searches and in many cases if not most, there is indeed a way to write the search differently so as to speed it up but not sacrifice functionality. post the searches and we'll get cracking. 😃