Hi everyone,
I am running Splunk 6.2.2 on a distributed setup with 3 search heads in a search head cluster and 4 non-clustered indexers. Splunk seems sluggish and I am trying to figure out why startup.handoff
always seem to take a long time. For example:
This search has completed and has returned 595 results by scanning 806 events in 6.179 seconds.
Duration (seconds) Component Invocations Input count Output count
0.01 command.fields 14 595 595
0.02 command.remotetl 14 595 -
0.26 command.search 14 - 595
0.02 command.search.fieldalias 8 806 806
0.02 command.search.calcfields 8 806 806
0.01 command.search.index 14 - -
0.01 command.search.filter 8 - -
0.00 command.search.index.usec_1_8 1,431 - -
0.00 command.search.index.usec_8_64 25 - -
0.08 command.search.rawdata 8 - -
0.07 command.search.kv 8 - -
0.05 command.search.typer 8 595 595
0.03 command.search.lookups 8 806 806
0.01 command.search.tags 8 595 595
0.00 command.search.summary 14 - -
0.00 dispatch.check_disk_usage 1 - -
0.12 dispatch.createdSearchResultInfrastructure 1 - -
1.00 dispatch.evaluate 1 - -
1.00 dispatch.evaluate.search 1 - -
0.28 dispatch.fetch 15 - -
1.04 dispatch.finalizeRemoteTimeline 1 - -
0.00 dispatch.localSearch 1 - -
0.01 dispatch.parserThread 13 - -
0.02 dispatch.process_remote_timeline 2 109,035 -
0.17 dispatch.readEventsInResults 1 - -
0.04 dispatch.remote_timeline_fullevents 7 673,420 363
0.00 dispatch.stream.local 1 - -
0.28 dispatch.stream.remote 13 - 869,624
0.08 dispatch.stream.remote.SPLUNKIDX06 3 - 259,267
0.07 dispatch.stream.remote.SPLUNKIDX05 3 - 263,521
0.07 dispatch.stream.remote.SPLUNKIDX04 3 - 190,198
0.06 dispatch.stream.remote.SPLUNKIDX03 3 - 152,511
0.06 dispatch.timeline 15 - -
0.03 dispatch.writeStatus 8 - -
0.17 startup.configuration 6 - -
6.40 startup.handoff 6 - -
Any good tips where to look for the problem?
Thank you
From the documentation: startup.handoff The time elapsed between the forking of a separate search process and the beginning of useful work of the forked search processes. In other words it is the approximate time it takes to build the search apparatus. This is cumulative across all involved peers. If this takes a long time, it could be indicative of I/O issues with .conf files or the dispatch directory.
The things that I have seen affecting this have been:
If you are using the SoS app or the Distributed Management Console you should be able to see if your problems are #1 or 2.
For 3 - Really large search bundles are most often due to large lookup tables. You can grab one of the bundles from the search head or indexers, move it to a temp directory and expand it using tar. Then look and see if one or more directories is very large (du -m --max-depth=1). This could point you to a lookup table you may not need to distribute. Sometimes will will see bundle replication timeouts that indicate this problem in the splunkd.log
4 - more difficult to measure as this could also be due to network utilization when transferring the bundle to the indexers. But you should be able to measure disk I/O using some tool.
It could also be that one indexer is much slower than all the others and that will cause the whole startup process to be slow.
If you find your answer, please post for other's benefit.
Hello,
is it possible that splunk changed the way how it calculates the search time? I have searches with "152,98 startup.handoff" and "This search has completed and ... in 2.272 seconds. " so without putting the huge handoff time in there. Looks like benchmark cheating 😉
Regards
Arnim
From the documentation: startup.handoff The time elapsed between the forking of a separate search process and the beginning of useful work of the forked search processes. In other words it is the approximate time it takes to build the search apparatus. This is cumulative across all involved peers. If this takes a long time, it could be indicative of I/O issues with .conf files or the dispatch directory.
The things that I have seen affecting this have been:
If you are using the SoS app or the Distributed Management Console you should be able to see if your problems are #1 or 2.
For 3 - Really large search bundles are most often due to large lookup tables. You can grab one of the bundles from the search head or indexers, move it to a temp directory and expand it using tar. Then look and see if one or more directories is very large (du -m --max-depth=1). This could point you to a lookup table you may not need to distribute. Sometimes will will see bundle replication timeouts that indicate this problem in the splunkd.log
4 - more difficult to measure as this could also be due to network utilization when transferring the bundle to the indexers. But you should be able to measure disk I/O using some tool.
It could also be that one indexer is much slower than all the others and that will cause the whole startup process to be slow.
If you find your answer, please post for other's benefit.
I'm having a similiar issue, with startup.handoff taking the majority of the search time.
0.00 command.fields 15 197 197
0.08 command.search 15 - 197
0.04 command.search.index 24 - -
0.01 command.search.filter 2 - -
0.00 command.search.calcfields 2 1,167 1,167
0.00 command.search.fieldalias 2 1,167 1,167
0.00 command.search.index.usec_1_8 489 - -
0.00 command.search.index.usec_512_4096 5 - -
0.02 command.search.rawdata 2 - -
0.01 command.search.kv 2 - -
0.00 command.search.typer 2 197 197
0.00 command.search.lookups 2 1,167 1,167
0.00 command.search.summary 15 - -
0.00 command.search.tags 2 197 197
0.00 dispatch.check_disk_usage 1 - -
0.00 dispatch.createdSearchResultInfrastructure 1 - -
0.04 dispatch.evaluate 1 - -
0.03 dispatch.evaluate.search 1 - -
0.04 dispatch.fetch 16 - -
0.06 dispatch.localSearch 1 - -
0.01 dispatch.readEventsInResults 1 - -
0.08 dispatch.stream.local 15 - -
0.31 dispatch.timeline 16 - -
0.05 dispatch.writeStatus 6 - -
0.02 startup.configuration 1 - -
0.64 startup.handoff 1 - -
Open a new question, and include a pastebinned search.log from that job.
I have the same problem too but no memory swap issue.
Execution costs
Duration (seconds) Component Invocations Input count Output count
0.02 command.dedup 16 596 187
0.01 command.fields 15 2,869 2,869
0.02 command.prededup 15 2,869 596
0.02 command.rename 16 187 187
0.74 command.search 30 2,869 5,738
0.40 command.search.index 789 - -
0.08 command.search.fieldalias 9 2,869 2,869
0.07 command.search.calcfields 9 2,869 2,869
0.02 command.search.filter 24 - -
0.00 command.search.index.usec_1_8 242 - -
0.00 command.search.index.usec_8_64 6 - -
0.28 command.search.typer 9 2,869 2,869
0.12 command.search.lookups 9 2,869 2,869
0.08 command.search.kv 9 - -
0.04 command.search.rawdata 9 - -
0.01 command.search.tags 9 2,869 2,869
0.00 command.search.summary 15 - -
0.00 dispatch.check_disk_usage 1 - -
1.42 dispatch.createdSearchResultInfrastructure 1 - -
0.32 dispatch.evaluate 1 - -
0.32 dispatch.evaluate.search 1 - -
0.00 dispatch.evaluate.dedup 1 - -
0.00 dispatch.evaluate.rename 1 - -
2.30 dispatch.fetch 16 - -
0.00 dispatch.localSearch 1 - -
0.84 dispatch.parserThread 14 - -
0.00 dispatch.preview 3 - -
0.00 dispatch.readEventsInResults 1 - -
0.02 dispatch.results_combiner 16 - -
0.00 dispatch.stream.local 1 - -
0.73 dispatch.stream.remote 14 - 436,030
0.31 dispatch.stream.remote.splunkqa2i 4 - 176,291
0.19 dispatch.stream.remote.splunk4i 4 - 121,692
0.19 dispatch.stream.remote.splunk2i 4 - 129,755
0.04 dispatch.stream.remote.splunk1i 2 - 8,292
0.18 dispatch.timeline 16 - -
3.05 dispatch.writeStatus 11 - -
0.38 startup.configuration 5 - -
8.06 startup.handoff 5 - -
Open a new question, and include a pastebinned search.log from that job.
You shouldn't be swapping, ever. Memory is cheap!
Those log messages should be from the actual search execution, not the startup... so they're not to blame.
Well I am out of ideas, but thanks for your help anyway.
In the end I was able to improve performance overall by reducing swappiness on the hosts to 10, the search heads seem way more responsive now.
If someone has any additional ideas on troubleshooting startup.handoff
it would be great to hear.
Check the search.log linked to from the job inspector, scroll through it for gaps greater than a second in the timestamps.
Thanks for your reply. I don't get anything over 1 second, the closest is stuff like:
06-25-2015 21:23:56.002 INFO DispatchThread - Generating results preview took 2 ms
06-25-2015 21:23:56.844 INFO NewTransam - Finalizing. Committing all open txns withheld
06-25-2015 21:30:12.649 INFO DispatchThread - Generating results preview took 5 ms
06-25-2015 21:30:13.527 INFO NewTransam - Finalizing. Committing all open txns withheld
Remaining events are very close, the max gap I can find is 0.2/0.3 seconds for a handful, most are milliseconds apart.
I don't get why search total times are lower than the time in startup.handoff
, like:
This search has completed and has returned 49 results by scanning 24,453 events in 4.975 seconds.
8.03 startup.handoff
Thank you