Monitoring Splunk

Distributed search performance

jerdmann
Path Finder

Hi there! We have an environment of a single search head with 14 peers, and it seems like our distributed searches take much longer than they should. A given search takes around ~15-20 seconds to complete, regardless of whether the search is scoped to last minute or last week.

What's really interesting is that from the search head, if I specify a specific indexer by using "splunk_server=", the search completes in almost exactly the same time as if I run the search locally on that indexer. However, as soon as I add even a second indexer to the search, the search time goes up disproportionately (for most examples, from <1s to >5s).

What gives? I find it odd that distributing to even just 2 peers takes much longer than the sum of the individual searches on those peers. Are there any performance / tuning parameters we should be looking at? Our environment is almost completely vanilla Splunk and the knowledge bundle we distribute is ~500K, so I'm not sure what else we need to look at.

Any input? Thanks for the help!

EDIT
Yep, no problem.

It looks like for any given search, we spend ~8-10 seconds in dispatch.createProviderQueue. This is the single constant that's jumping out at me across searches, regardless of how big or small the search is.

I checked the Splunk manual and this value is the time it takes to connect to each search peer? Our highest latency peers are ~150ms away so I'm not sure how this value is so high.

Let me know if you need anything else from me. Thanks again!

Duration (seconds)      Component   Invocations Input count Output count
    0.039   command.fields  51  7,972   7,972
    0.43    command.remotetl    51  7,972   -
    0.784   command.search  51  -   7,972
    0.165   command.search.calcfields   36  8,007   8,007
    0.043   command.search.index    51  -   -
    0.036   command.search.fieldalias   36  8,007   8,007
    0.036   command.search.filter   36  -   -
    0   command.search.index.usec_1_8   63  -   -
    0   command.search.index.usec_8_64  16  -   -
    0.274   command.search.kv   36  -   -
    0.199   command.search.rawdata  36  -   -
    0.139   command.search.typer    51  7,972   7,972
    0.036   command.search.lookups  36  8,007   8,007
    0.034   command.search.tags 51  7,972   7,972
    0.018   command.search.summary  51  -   -
    0.001   dispatch.check_disk_usage   1   -   -
    8.482   dispatch.createProviderQueue    1   -   -
    0.043   dispatch.evaluate   1   -   -
    0.042   dispatch.evaluate.search    1   -   -
    1.057   dispatch.fetch  66  -   -
    0.001   dispatch.preview    1   -   -
    0.475   dispatch.process_remote_timeline    25  1,347,453   -
    5.116   dispatch.readEventsInResults    1   -   -
    0.118   dispatch.remote_timeline_fullevents 25  1,401,267   1,417
    0.078   dispatch.stream.local   4   -   -
    1.11    dispatch.stream.remote  47  -   2,589,235
    0.179   dispatch.stream.remote.TTNET-CH-SPIND-1.ttnet.local 4   -   280,901
    0.145   dispatch.stream.remote.TTNET-NY-SPIND-2.ttnet.local 4   -   284,779
    0.134   dispatch.stream.remote.TTNET-NY-SPIND-1.ttnet.local 4   -   239,333
    0.119   dispatch.stream.remote.TTNET-DE-SPIND-1.ttnet.local 3   -   172,445
    0.115   dispatch.stream.remote.TTNET-UK-SPIND-2.ttnet.local 4   -   237,858
    0.093   dispatch.stream.remote.TTNET-UK-SPIND-1.ttnet.local 4   -   278,894
    0.086   dispatch.stream.remote.TTNET-JP-SPIND-1.ttnet.local 4   -   199,469
    0.048   dispatch.stream.remote.TTNET-SY-SPIND-1.ttnet.local 3   -   176,112
    0.047   dispatch.stream.remote.TTNET-JP-SPIND-2.ttnet.local 3   -   127,786
    0.037   dispatch.stream.remote.TTNET-SG-SPIND-1.ttnet.local 3   -   157,387
    0.036   dispatch.stream.remote.TTNET-SG-SPIND-2.ttnet.local 3   -   173,048
    0.031   dispatch.stream.remote.TTNET-DE-SPIND-2.ttnet.local 3   -   107,137
    0.029   dispatch.stream.remote.TTNET-SY-SPIND-2.ttnet.local 3   -   137,724
    0.011   dispatch.stream.remote.TTNET-CH-SPIND-2.ttnet.local 2   -   16,362
    0.305   dispatch.timeline   66  -   -
    0.019   dispatch.writeStatus    7   -   -
    0.042   startup.handoff 1   -   -
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

[Edited to reflect that SSL may not need to be disabled]

As noted in the comments, it looks like this is because of SSL setup time, and particularly sequential SSL time. If you have significant latency between the search head and indexers, you'll see this. When using SSL by default, Splunk sets up connections to each indexer sequentially (though once set up, the searches run in parallel of course). It is possible to set up connections in parallel. You should do this by setting:

On the search head limits.conf:

[search]
multi_threaded_setup = true

Additionally, in versions of Splunk 5.0.3 and older, and possibly in more recent ones, the SSL implementation would not work reliably with multi_threaded_setup enabled. (This is why it is not enabled by default.) So in those, you will have to disable SSL between the search head and indexers. In newer versions, it should work okay, but the fix has not been thoroughly verified. if you run into problems on the newer versions, you should try without SSL also. To turn off SSL properly, you should set:

On the indexer server.conf:

[settings]
enableSplunkdSSL = false

On the search head server.conf:

[general]
useHTTPClientCompression = on-http

On the search head distsearch.conf:

[distributedSearch]
trySSLFirst = false

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

[Edited to reflect that SSL may not need to be disabled]

As noted in the comments, it looks like this is because of SSL setup time, and particularly sequential SSL time. If you have significant latency between the search head and indexers, you'll see this. When using SSL by default, Splunk sets up connections to each indexer sequentially (though once set up, the searches run in parallel of course). It is possible to set up connections in parallel. You should do this by setting:

On the search head limits.conf:

[search]
multi_threaded_setup = true

Additionally, in versions of Splunk 5.0.3 and older, and possibly in more recent ones, the SSL implementation would not work reliably with multi_threaded_setup enabled. (This is why it is not enabled by default.) So in those, you will have to disable SSL between the search head and indexers. In newer versions, it should work okay, but the fix has not been thoroughly verified. if you run into problems on the newer versions, you should try without SSL also. To turn off SSL properly, you should set:

On the indexer server.conf:

[settings]
enableSplunkdSSL = false

On the search head server.conf:

[general]
useHTTPClientCompression = on-http

On the search head distsearch.conf:

[distributedSearch]
trySSLFirst = false

gkanapathy
Splunk Employee
Splunk Employee

Yes, it should be server.conf for that setting. I'll edit the answer above. Although, the original questioner did not disable SSL and seems to be having no problems, so that may not be necessary.

0 Karma

ltawfall
Path Finder

You said to put this in the web.conf on the indexers. This conf option doesn't exist in that file. Is it supposed to be in server.conf on the indexers??
[settings]
enableSplunkdSSL = false

0 Karma

jerdmann
Path Finder

Yep, I will keep an eye on things. Thanks again!

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Okay. In older versions of Splunk, turning on multi_threaded_setup while leaving SSL on would cause unreliability, sometimes crashes. If you see that, try turning it off. If not, then I guess the problem was fixed.

0 Karma

jerdmann
Path Finder

Yep, this solved our problem! We actually just applied the multi_threaded_setup option to our search head, and now the dispatch.createProviderQueue step is down from ~8-10 seconds to ~1.

We haven't touched the SSL stuff because that's the performance improvement we're looking for, although we'll keep it in mind if we ever decide we need to go even faster.

Thanks a ton!

gkanapathy
Splunk Employee
Splunk Employee

perhaps the docs are referring to the lack of compression if you just turn off SSL. if you use the settings above, you should be okay.

0 Karma

jerdmann
Path Finder

Perfect, this is exactly what I was looking for! I will give this a shot and follow up with how everything goes.

Also, it's interesting that the manual says that having splunkd SSL enabled will improve distributed search performance. Maybe that's more of an event throughput thing (or maybe it's inaccurate) ?

0 Karma

jerdmann
Path Finder

That is actually an excellent point. Let me review the config to see if that can be disabled/tweaked. We will probably want to leave this on in the long run, but it's worth a shot to see if that's the root cause.

0 Karma

sowings
Splunk Employee
Splunk Employee

Spin-up time for creating SSL contexts to dispatch to each of the indexers?

0 Karma

jerdmann
Path Finder

Anyone have any input on this one? For now, we're just writing this off as a lost cause, and hopefully the experience will improve when we upgrade to Splunk 6.

The ~10-15 seconds per search isn't a dealbreaker per se. It's just a pain to know that distributed search seems so much slower than local searches given our indexer count. Actually, the event throughput is fine, it's just that initial time penalty tacked on to every search.

Let me know what you think either way though. Thanks again!

0 Karma

jerdmann
Path Finder

Yep, no problem. Here's the search I'm using as a benchmark. Pretty straightforward. What's interesting is that I can change the index to something else (or *), and the search time (and above stats) doesn't really change. Also, I can crank up the time range to last hour or day and the search time still doesn't change.

It seems like our problem has nothing to do with the results / event count. It's almost like we lose a bunch of time in the distribution / peer "handshaking" process if that makes any sense.

Let me know what you think. Thanks again!

index=_internal earliest=-1m
0 Karma

gkanapathy
Splunk Employee
Splunk Employee

also, you should try the running the search in "fast" mode rather than verbose or smart mode and see if that makes a difference.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

it would be helpful if you included the search query, or at least something with the same structure.

0 Karma

_d_
Splunk Employee
Splunk Employee

Will need more information. Run a search, look at the Job Inspector, the "i" button below the search bar timepicker, and paste the "Execution costs" portion here.

0 Karma
Get Updates on the Splunk Community!

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Hey Splunky People! We are excited to share the latest updates in Splunk Enterprise 9.4. In this release we ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...