Apologies if this is blatantly obvious.
I have been troubleshooting search performance, and like many others, have gotten focused on the "startup.handoff" value when inspecting the search jobs (due to its shear value size)
The value is huge (always) and is definitely not time ? - e.g. some of our 20 second searches actually ran for 100+ days if this were the case.
What exactly does the startup.handoff value in search job inspector represent, and if relevant, how do we optimize the value smaller?
If you want to understand the search job inspector results, start here. When you search for startup.handoff, you'll find this description:
The time elapsed between the forking of a separate search process and the beginning of useful work of the forked search processes. In other words it is the approximate time it takes to build the search apparatus. This is cumulative across all involved peers. If this takes a long time, it could be indicative of I/O issues with .conf files or the dispatch directory.
So apparently, this value is cumulative, which means that if you have a large environment, you will see much higher numbers than with a single-instance splunk (you can roughly divide the number by the number of peers involved to get an idea of how long each of them roughly takes).
You should be able to see the effect of this parameter in action if there is a significant delay between starting a search and seeing the first results. If you are not experiencing this, then I don't think you need to take action. If you do, see here whether that already fixes it.
If it doesn't then I'm afraid we'll have to inspect your searches in more detail, but maybe I could already help you with some basic understanding of that number.
Thanks - here's the confusion (and trying to avoid drilling into our environment, examining search-logs etc,
rather to better understand startup.handoff and how to tune it if applicable (conf variables to dig into etc),
regardless of our results - what does the attribute represent and directions to tune it if relevant.
Apologies if missing the boat - allot of folks are asking about this in various ways (google is our friend 🙂
taking a very simple query, run in fast-mode ("index=iis sourcetype=iis earliest=-65m@m latest=-5m@m|stats count by cs_host")
the query took approximately 31 seconds to complete
We have fourteen (14) indexers running, our startup.handoff for the same search=141,579,974,861.00
doing that math (assuming it is seconds as the output suggests) 141,579,974,861.00 / 14 / 86400 == 1638657 days to run ?
(if it's milliseconds take it down a few notches) -
so still confused (despite the documentation) as to what startup.handoff really represents?
you'll have to trust me as unclear how to upload an image to a response without sending to it a website or other first - reading it right now (ran same exact query as listed above) , 142,529,121,751.359985352 (startup.handoff)
as indicated - do not want to get into digging into our environment specifically - unless there is something insanely wrong with ours.. .possible, however others are reporting this too (google is our friend) so suspect not - however open to all possibilities of-course
The label at the top of the column in the job inspector says Duration (seconds), so it's in seconds. The number you're reporting is so ridiculously large that I suspect it's not real--maybe there's a time discrepancy somewhere on one of your indexers or your SH/storage or something.