Hi there! We have an environment of a single search head with 14 peers, and it seems like our distributed searches take much longer than they should. A given search takes around ~15-20 seconds to complete, regardless of whether the search is scoped to last minute or last week.
What's really interesting is that from the search head, if I specify a specific indexer by using "splunk_server=
What gives? I find it odd that distributing to even just 2 peers takes much longer than the sum of the individual searches on those peers. Are there any performance / tuning parameters we should be looking at? Our environment is almost completely vanilla Splunk and the knowledge bundle we distribute is ~500K, so I'm not sure what else we need to look at.
Any input? Thanks for the help!
EDIT
Yep, no problem.
It looks like for any given search, we spend ~8-10 seconds in dispatch.createProviderQueue. This is the single constant that's jumping out at me across searches, regardless of how big or small the search is.
I checked the Splunk manual and this value is the time it takes to connect to each search peer? Our highest latency peers are ~150ms away so I'm not sure how this value is so high.
Let me know if you need anything else from me. Thanks again!
Duration (seconds) Component Invocations Input count Output count
0.039 command.fields 51 7,972 7,972
0.43 command.remotetl 51 7,972 -
0.784 command.search 51 - 7,972
0.165 command.search.calcfields 36 8,007 8,007
0.043 command.search.index 51 - -
0.036 command.search.fieldalias 36 8,007 8,007
0.036 command.search.filter 36 - -
0 command.search.index.usec_1_8 63 - -
0 command.search.index.usec_8_64 16 - -
0.274 command.search.kv 36 - -
0.199 command.search.rawdata 36 - -
0.139 command.search.typer 51 7,972 7,972
0.036 command.search.lookups 36 8,007 8,007
0.034 command.search.tags 51 7,972 7,972
0.018 command.search.summary 51 - -
0.001 dispatch.check_disk_usage 1 - -
8.482 dispatch.createProviderQueue 1 - -
0.043 dispatch.evaluate 1 - -
0.042 dispatch.evaluate.search 1 - -
1.057 dispatch.fetch 66 - -
0.001 dispatch.preview 1 - -
0.475 dispatch.process_remote_timeline 25 1,347,453 -
5.116 dispatch.readEventsInResults 1 - -
0.118 dispatch.remote_timeline_fullevents 25 1,401,267 1,417
0.078 dispatch.stream.local 4 - -
1.11 dispatch.stream.remote 47 - 2,589,235
0.179 dispatch.stream.remote.TTNET-CH-SPIND-1.ttnet.local 4 - 280,901
0.145 dispatch.stream.remote.TTNET-NY-SPIND-2.ttnet.local 4 - 284,779
0.134 dispatch.stream.remote.TTNET-NY-SPIND-1.ttnet.local 4 - 239,333
0.119 dispatch.stream.remote.TTNET-DE-SPIND-1.ttnet.local 3 - 172,445
0.115 dispatch.stream.remote.TTNET-UK-SPIND-2.ttnet.local 4 - 237,858
0.093 dispatch.stream.remote.TTNET-UK-SPIND-1.ttnet.local 4 - 278,894
0.086 dispatch.stream.remote.TTNET-JP-SPIND-1.ttnet.local 4 - 199,469
0.048 dispatch.stream.remote.TTNET-SY-SPIND-1.ttnet.local 3 - 176,112
0.047 dispatch.stream.remote.TTNET-JP-SPIND-2.ttnet.local 3 - 127,786
0.037 dispatch.stream.remote.TTNET-SG-SPIND-1.ttnet.local 3 - 157,387
0.036 dispatch.stream.remote.TTNET-SG-SPIND-2.ttnet.local 3 - 173,048
0.031 dispatch.stream.remote.TTNET-DE-SPIND-2.ttnet.local 3 - 107,137
0.029 dispatch.stream.remote.TTNET-SY-SPIND-2.ttnet.local 3 - 137,724
0.011 dispatch.stream.remote.TTNET-CH-SPIND-2.ttnet.local 2 - 16,362
0.305 dispatch.timeline 66 - -
0.019 dispatch.writeStatus 7 - -
0.042 startup.handoff 1 - -
[Edited to reflect that SSL may not need to be disabled]
As noted in the comments, it looks like this is because of SSL setup time, and particularly sequential SSL time. If you have significant latency between the search head and indexers, you'll see this. When using SSL by default, Splunk sets up connections to each indexer sequentially (though once set up, the searches run in parallel of course). It is possible to set up connections in parallel. You should do this by setting:
On the search head limits.conf:
[search]
multi_threaded_setup = true
Additionally, in versions of Splunk 5.0.3 and older, and possibly in more recent ones, the SSL implementation would not work reliably with multi_threaded_setup
enabled. (This is why it is not enabled by default.) So in those, you will have to disable SSL between the search head and indexers. In newer versions, it should work okay, but the fix has not been thoroughly verified. if you run into problems on the newer versions, you should try without SSL also. To turn off SSL properly, you should set:
On the indexer server.conf:
[settings]
enableSplunkdSSL = false
On the search head server.conf:
[general]
useHTTPClientCompression = on-http
On the search head distsearch.conf:
[distributedSearch]
trySSLFirst = false
[Edited to reflect that SSL may not need to be disabled]
As noted in the comments, it looks like this is because of SSL setup time, and particularly sequential SSL time. If you have significant latency between the search head and indexers, you'll see this. When using SSL by default, Splunk sets up connections to each indexer sequentially (though once set up, the searches run in parallel of course). It is possible to set up connections in parallel. You should do this by setting:
On the search head limits.conf:
[search]
multi_threaded_setup = true
Additionally, in versions of Splunk 5.0.3 and older, and possibly in more recent ones, the SSL implementation would not work reliably with multi_threaded_setup
enabled. (This is why it is not enabled by default.) So in those, you will have to disable SSL between the search head and indexers. In newer versions, it should work okay, but the fix has not been thoroughly verified. if you run into problems on the newer versions, you should try without SSL also. To turn off SSL properly, you should set:
On the indexer server.conf:
[settings]
enableSplunkdSSL = false
On the search head server.conf:
[general]
useHTTPClientCompression = on-http
On the search head distsearch.conf:
[distributedSearch]
trySSLFirst = false
Yes, it should be server.conf for that setting. I'll edit the answer above. Although, the original questioner did not disable SSL and seems to be having no problems, so that may not be necessary.
You said to put this in the web.conf on the indexers. This conf option doesn't exist in that file. Is it supposed to be in server.conf on the indexers??
[settings]
enableSplunkdSSL = false
Yep, I will keep an eye on things. Thanks again!
Okay. In older versions of Splunk, turning on multi_threaded_setup while leaving SSL on would cause unreliability, sometimes crashes. If you see that, try turning it off. If not, then I guess the problem was fixed.
Yep, this solved our problem! We actually just applied the multi_threaded_setup option to our search head, and now the dispatch.createProviderQueue step is down from ~8-10 seconds to ~1.
We haven't touched the SSL stuff because that's the performance improvement we're looking for, although we'll keep it in mind if we ever decide we need to go even faster.
Thanks a ton!
perhaps the docs are referring to the lack of compression if you just turn off SSL. if you use the settings above, you should be okay.
Perfect, this is exactly what I was looking for! I will give this a shot and follow up with how everything goes.
Also, it's interesting that the manual says that having splunkd SSL enabled will improve distributed search performance. Maybe that's more of an event throughput thing (or maybe it's inaccurate) ?
That is actually an excellent point. Let me review the config to see if that can be disabled/tweaked. We will probably want to leave this on in the long run, but it's worth a shot to see if that's the root cause.
Spin-up time for creating SSL contexts to dispatch to each of the indexers?
Anyone have any input on this one? For now, we're just writing this off as a lost cause, and hopefully the experience will improve when we upgrade to Splunk 6.
The ~10-15 seconds per search isn't a dealbreaker per se. It's just a pain to know that distributed search seems so much slower than local searches given our indexer count. Actually, the event throughput is fine, it's just that initial time penalty tacked on to every search.
Let me know what you think either way though. Thanks again!
Yep, no problem. Here's the search I'm using as a benchmark. Pretty straightforward. What's interesting is that I can change the index to something else (or *), and the search time (and above stats) doesn't really change. Also, I can crank up the time range to last hour or day and the search time still doesn't change.
It seems like our problem has nothing to do with the results / event count. It's almost like we lose a bunch of time in the distribution / peer "handshaking" process if that makes any sense.
Let me know what you think. Thanks again!
index=_internal earliest=-1m
also, you should try the running the search in "fast" mode rather than verbose or smart mode and see if that makes a difference.
it would be helpful if you included the search query, or at least something with the same structure.
Will need more information. Run a search, look at the Job Inspector, the "i" button below the search bar timepicker, and paste the "Execution costs" portion here.