Splunk Search

How to troubleshoot why startup.handoff in the Search Job Inspector always seems to take a long time?

gustavomichels
Path Finder

Hi everyone,

I am running Splunk 6.2.2 on a distributed setup with 3 search heads in a search head cluster and 4 non-clustered indexers. Splunk seems sluggish and I am trying to figure out why startup.handoff always seem to take a long time. For example:

This search has completed and has returned 595 results by scanning 806 events in 6.179 seconds. 

Duration (seconds)      Component   Invocations     Input count     Output count
    0.01    command.fields  14  595     595
    0.02    command.remotetl    14  595     -
    0.26    command.search  14  -   595
    0.02    command.search.fieldalias   8   806     806
    0.02    command.search.calcfields   8   806     806
    0.01    command.search.index    14  -   -
    0.01    command.search.filter   8   -   -
    0.00    command.search.index.usec_1_8   1,431   -   -
    0.00    command.search.index.usec_8_64  25  -   -
    0.08    command.search.rawdata  8   -   -
    0.07    command.search.kv   8   -   -
    0.05    command.search.typer    8   595     595
    0.03    command.search.lookups  8   806     806
    0.01    command.search.tags     8   595     595
    0.00    command.search.summary  14  -   -
    0.00    dispatch.check_disk_usage   1   -   -
    0.12    dispatch.createdSearchResultInfrastructure  1   -   -
    1.00    dispatch.evaluate   1   -   -
    1.00    dispatch.evaluate.search    1   -   -
    0.28    dispatch.fetch  15  -   -
    1.04    dispatch.finalizeRemoteTimeline     1   -   -
    0.00    dispatch.localSearch    1   -   -
    0.01    dispatch.parserThread   13  -   -
    0.02    dispatch.process_remote_timeline    2   109,035     -
    0.17    dispatch.readEventsInResults    1   -   -
    0.04    dispatch.remote_timeline_fullevents     7   673,420     363
    0.00    dispatch.stream.local   1   -   -
    0.28    dispatch.stream.remote  13  -   869,624
    0.08    dispatch.stream.remote.SPLUNKIDX06  3   -   259,267
    0.07    dispatch.stream.remote.SPLUNKIDX05  3   -   263,521
    0.07    dispatch.stream.remote.SPLUNKIDX04  3   -   190,198
    0.06    dispatch.stream.remote.SPLUNKIDX03  3   -   152,511
    0.06    dispatch.timeline   15  -   -
    0.03    dispatch.writeStatus    8   -   -
    0.17    startup.configuration   6   -   -
    6.40    startup.handoff     6   -   -

Any good tips where to look for the problem?

Thank you

1 Solution

sjohnson_splunk
Splunk Employee
Splunk Employee

From the documentation: startup.handoff The time elapsed between the forking of a separate search process and the beginning of useful work of the forked search processes. In other words it is the approximate time it takes to build the search apparatus. This is cumulative across all involved peers. If this takes a long time, it could be indicative of I/O issues with .conf files or the dispatch directory.

The things that I have seen affecting this have been:

  1. search head is very busy
  2. indexers are very busy
  3. large search bundles
  4. slow I/O on drive that hosts the splunk ($SPLUNK_HOME)

If you are using the SoS app or the Distributed Management Console you should be able to see if your problems are #1 or 2.

For 3 - Really large search bundles are most often due to large lookup tables. You can grab one of the bundles from the search head or indexers, move it to a temp directory and expand it using tar. Then look and see if one or more directories is very large (du -m --max-depth=1). This could point you to a lookup table you may not need to distribute. Sometimes will will see bundle replication timeouts that indicate this problem in the splunkd.log

4 - more difficult to measure as this could also be due to network utilization when transferring the bundle to the indexers. But you should be able to measure disk I/O using some tool.

It could also be that one indexer is much slower than all the others and that will cause the whole startup process to be slow.

If you find your answer, please post for other's benefit.

View solution in original post

Administrator
Explorer

Hello,

 

is it possible that splunk changed the way how it calculates the search time? I have searches with "152,98 startup.handoff" and "This search has completed and ... in 2.272 seconds. " so without putting the huge handoff time in there. Looks like benchmark cheating 😉

 

Regards

Arnim

 

 

sjohnson_splunk
Splunk Employee
Splunk Employee

From the documentation: startup.handoff The time elapsed between the forking of a separate search process and the beginning of useful work of the forked search processes. In other words it is the approximate time it takes to build the search apparatus. This is cumulative across all involved peers. If this takes a long time, it could be indicative of I/O issues with .conf files or the dispatch directory.

The things that I have seen affecting this have been:

  1. search head is very busy
  2. indexers are very busy
  3. large search bundles
  4. slow I/O on drive that hosts the splunk ($SPLUNK_HOME)

If you are using the SoS app or the Distributed Management Console you should be able to see if your problems are #1 or 2.

For 3 - Really large search bundles are most often due to large lookup tables. You can grab one of the bundles from the search head or indexers, move it to a temp directory and expand it using tar. Then look and see if one or more directories is very large (du -m --max-depth=1). This could point you to a lookup table you may not need to distribute. Sometimes will will see bundle replication timeouts that indicate this problem in the splunkd.log

4 - more difficult to measure as this could also be due to network utilization when transferring the bundle to the indexers. But you should be able to measure disk I/O using some tool.

It could also be that one indexer is much slower than all the others and that will cause the whole startup process to be slow.

If you find your answer, please post for other's benefit.

View solution in original post

cegoes
Explorer

I'm having a similiar issue, with startup.handoff taking the majority of the search time.

    0.00    command.fields  15  197 197
    0.08    command.search  15  -   197
    0.04    command.search.index    24  -   -
    0.01    command.search.filter   2   -   -
    0.00    command.search.calcfields   2   1,167   1,167
    0.00    command.search.fieldalias   2   1,167   1,167
    0.00    command.search.index.usec_1_8   489 -   -
    0.00    command.search.index.usec_512_4096  5   -   -
    0.02    command.search.rawdata  2   -   -
    0.01    command.search.kv   2   -   -
    0.00    command.search.typer    2   197 197
    0.00    command.search.lookups  2   1,167   1,167
    0.00    command.search.summary  15  -   -
    0.00    command.search.tags 2   197 197
    0.00    dispatch.check_disk_usage   1   -   -
    0.00    dispatch.createdSearchResultInfrastructure  1   -   -
    0.04    dispatch.evaluate   1   -   -
    0.03    dispatch.evaluate.search    1   -   -
    0.04    dispatch.fetch  16  -   -
    0.06    dispatch.localSearch    1   -   -
    0.01    dispatch.readEventsInResults    1   -   -
    0.08    dispatch.stream.local   15  -   -
    0.31    dispatch.timeline   16  -   -
    0.05    dispatch.writeStatus    6   -   -
    0.02    startup.configuration   1   -   -
    0.64    startup.handoff 1   -   -

martin_mueller
SplunkTrust
SplunkTrust

Open a new question, and include a pastebinned search.log from that job.

BP9906
Builder

I have the same problem too but no memory swap issue.

Execution costs

Duration (seconds)      Component   Invocations Input count Output count
    0.02    command.dedup   16  596 187
    0.01    command.fields  15  2,869   2,869
    0.02    command.prededup    15  2,869   596
    0.02    command.rename  16  187 187
    0.74    command.search  30  2,869   5,738
    0.40    command.search.index    789 -   -
    0.08    command.search.fieldalias   9   2,869   2,869
    0.07    command.search.calcfields   9   2,869   2,869
    0.02    command.search.filter   24  -   -
    0.00    command.search.index.usec_1_8   242 -   -
    0.00    command.search.index.usec_8_64  6   -   -
    0.28    command.search.typer    9   2,869   2,869
    0.12    command.search.lookups  9   2,869   2,869
    0.08    command.search.kv   9   -   -
    0.04    command.search.rawdata  9   -   -
    0.01    command.search.tags 9   2,869   2,869
    0.00    command.search.summary  15  -   -
    0.00    dispatch.check_disk_usage   1   -   -
    1.42    dispatch.createdSearchResultInfrastructure  1   -   -
    0.32    dispatch.evaluate   1   -   -
    0.32    dispatch.evaluate.search    1   -   -
    0.00    dispatch.evaluate.dedup 1   -   -
    0.00    dispatch.evaluate.rename    1   -   -
    2.30    dispatch.fetch  16  -   -
    0.00    dispatch.localSearch    1   -   -
    0.84    dispatch.parserThread   14  -   -
    0.00    dispatch.preview    3   -   -
    0.00    dispatch.readEventsInResults    1   -   -
    0.02    dispatch.results_combiner   16  -   -
    0.00    dispatch.stream.local   1   -   -
    0.73    dispatch.stream.remote  14  -   436,030
    0.31    dispatch.stream.remote.splunkqa2i   4   -   176,291
    0.19    dispatch.stream.remote.splunk4i 4   -   121,692
    0.19    dispatch.stream.remote.splunk2i 4   -   129,755
    0.04    dispatch.stream.remote.splunk1i 2   -   8,292
    0.18    dispatch.timeline   16  -   -
    3.05    dispatch.writeStatus    11  -   -
    0.38    startup.configuration   5   -   -
    8.06    startup.handoff 5   -   -

martin_mueller
SplunkTrust
SplunkTrust

Open a new question, and include a pastebinned search.log from that job.

martin_mueller
SplunkTrust
SplunkTrust

You shouldn't be swapping, ever. Memory is cheap!

martin_mueller
SplunkTrust
SplunkTrust

Those log messages should be from the actual search execution, not the startup... so they're not to blame.

gustavomichels
Path Finder

Well I am out of ideas, but thanks for your help anyway.

In the end I was able to improve performance overall by reducing swappiness on the hosts to 10, the search heads seem way more responsive now.

If someone has any additional ideas on troubleshooting startup.handoff it would be great to hear.

martin_mueller
SplunkTrust
SplunkTrust

Check the search.log linked to from the job inspector, scroll through it for gaps greater than a second in the timestamps.

gustavomichels
Path Finder

Thanks for your reply. I don't get anything over 1 second, the closest is stuff like:

06-25-2015 21:23:56.002 INFO  DispatchThread - Generating results preview took 2 ms
06-25-2015 21:23:56.844 INFO  NewTransam - Finalizing. Committing all open txns withheld

06-25-2015 21:30:12.649 INFO  DispatchThread - Generating results preview took 5 ms
06-25-2015 21:30:13.527 INFO  NewTransam - Finalizing. Committing all open txns withheld

Remaining events are very close, the max gap I can find is 0.2/0.3 seconds for a handful, most are milliseconds apart.

I don't get why search total times are lower than the time in startup.handoff, like:

This search has completed and has returned 49 results by scanning 24,453 events in 4.975 seconds.
8.03     startup.handoff

Thank you

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!