Getting Data In

Adavanced searches are too slow since they don't seem to make use of indexes

Explorer

Hey folks,

Long time Splunk fan here. Initially when we started using Splunk, our queries were simple, and so searches ran fast.

So I loved Splunk.

However, over time as useage grew, queries got more complex and our datasets grew large.

Unfortunately, neither Splunk or any other log analysis products deal with this well ( and Splunk probably deals with it best amongst all others for that matter ).

Our "needle in a haystack" class of queries work okay, e.g. user=badman or ip=evil_ipp.
We have a fair number of those, and while they have gotten slower with how massive our data has become, they are still tolerable.

To some extent, we've built our own apps on top of Splunk that maintain important aggregates, though that still doesn't work for ad hoc queries, since the queries are such that they don't hit indexes.

Specifically, we have queries of these types that just run too slow, of the order of many minutes, or 10's of minutes:

  • Queries which generate lots of results., e.g. error
  • Queries which have wildcard prefixes, e.g. *Error AND *.php
  • Queries which use splunk for complex analysis, and do data manipulation ( e.g. using rex and regex replacement )

These are queries which don't benefit from indexing, and effectively "break" Splunk despite having many nodes setup.

It feels like I am better off sshing into one machine and just grepping logs, and in some cases this isn't even possible due to security restrictions. So much as I loved Splunk as an introductory user, I've become very frustrated with search speed as data volumes and how queries have shifted from the simple "searching for a word" to the more complex examples outlined above.

1. Is search speed a major pain point for users of advanced queries and/or large datasets ?

[ Hopefully, if you can chime in and be vocal, the good folks at Splunk will hear our feedback 🙂 ]

Personally, it is incredibly frustrating to watch queries run for so long, to the point I have to go surf reddit while the advanced searches run.

2. What are folks doing to make searches that can't use indexes run fast in these scenarios ?

Modifying applications has helped in some cases, but due to not having control of some libraries, it isn't feasible at other times. Other times, admittedly some of my regexes aren't as optimized as they could be, but when I'm debugging a fire, its impractical for me to find out what is or isn't in my log so I have to play it safe with a regex.

Engager

This might not be the answer you're looking for, but it sounds like you're dealing with a lot of data. Once you get to a certain point you need to consider all levels of optimization. You're bottleneck might be hardware and not necessarily the type of query you're running.

You may want to look at your hardware architecture. Disks are probably the most important factor. Array Strip Size, File System, File System Block Size.. etc.

You can read more here:
http://wiki.splunk.com/Community:HardwareTuningFactors

Explorer

Thank you for your suggestion, though I am first trying to figure out what can be done i software ( primarily through query optimization ). These classes of searches are not running a small percentage points slower, they are taking many minutes instead of seconds like indexed searches to the point it is not feasible to throw hardware at the problem.

0 Karma

Motivator

Not trying to sound argumentative with what I'm writing but I'm confused on a couple things....and pretty tired. One thing that I'm not quite wrapping my head around is when you say "queries that don't benefit from indexing" - are you saying your queries are such that you can't add "index=x" due to the nature of the query? Alternatively are you all breaking up your data into multiple indicies?

I'm also not getting "over time...queries got more complex" and "search for error" I'd think that even looking for errors you would wrap things like index=, sourcetype=, source=, etc w/in the search itself otherwise you end up with God knows what in your results (maybe that's the point?). Perhaps you could prepend to your query index!=x where you know you won't see logs. I have a large MSSP type deployment with lots of indicies. In cases where I have queries looking for failed logins that need to cover lots of ground I don't have them looking in my netflow logs for example which saves a huge amount of time. YMMV of course.

If you are running 5.x you could look into report acceleration. If you aren't running 5.x....why not?

There is also the potential that given the growth of your instance the hardware maybe undersized and you need another indexer or something.

0 Karma

Explorer

It looks like report acceleration is a solution geared at making queries that you run often faster ( vs. being able to truly run an adhoc query ). While that helps some use cases, it does not help my use case.

0 Karma

SplunkTrust
SplunkTrust

If you define field extractions for the field instead of using rex and sed, then you just search on that field in the initial search clause, Splunk will (unless told otherwise in extraction config) use the index to search for the field value first. and then once events are off disk do a round of filtering to get down to just the field=theValue results. So if you've been using rex and sed wildly instead of doing field extractions, that's one way to speed everything up.

And indeed *error* will have to get everything off disk. (*error OR error*), although its of course not the same thing, will in general perform a lot better.

More specifics may help even more though - I advise adding a few more actual searches. And as was said by another commenter, you can use the Job Inspector to look at where the time gets spent in the search language, and you can compare the 'scanCount' vs 'eventCount' to tell you how many events were gotten off disk (scanCount), versus how many of those events ended up being needed to run the actual search or report (eventCount).

SplunkTrust
SplunkTrust

error* and *error will use the index - they'll each do prefix/suffix searches against indexed terms and then you can think of it as doing a big OR statement with all of those. And you can absolutely extract your restricted_path token there, as an extracted field. And I think leaving INDEXED_VALUE set to True would work just fine. Basically anything that can be done in a rex, can be done in a real field extraction. Rex is more of a quick and dirty tool imo, and as I mentioned it will usually be slower than doing a real field extraction.

0 Karma

Explorer

Also good point about error* vs. *error, and I'll ensure that my searches use them.

Does *error also use an index or does it force reading everything from disk?

0 Karma

Explorer

I wanted to note that the cases where I am using rex are cases where field=value can't typically be applied.

E.g. http://server/*restricted_path* is one example.

0 Karma

Explorer

The kinds of searches that are slow for me are ( giving representative examples here as I don't know if I can share my exact searches, will try to post some job inspector stats 😞
1) Search for "error" ( generates lots of results )
2) Search for *error* ( doesn't use an index )
3) Searches which use "rex" / sed such that they need access to the data of search results, e.g. error | rex mode=sed 's/username=.*/username=[DELETED]/g'

0 Karma

SplunkTrust
SplunkTrust

In most cases we can help show you the most inefficient parts of your searches and in many cases if not most, there is indeed a way to write the search differently so as to speed it up but not sacrifice functionality. post the searches and we'll get cracking. 😃

0 Karma

Influencer

Launch the job inspector and see where the bottlenecks are in your complex queries. Post the query and a screenshot of the job inspector.