I have two systems running splunk, and for some as-yet unknown reason the exact same search on both systems produces a difference in the total time to complete the search.
The two systems are as follows:
System A = splunk v4.1.5 on Ubuntu, 2 Cores, 2 GBytes RAM, 60GB internal SATA disk T Specific search time = 0.891 seconds
System B = splunk v4.1.2 on RHEL, 16 cores, 18 GBytes RAM, 10TB of fast SAN disks Specific search time = 9.124 seconds
From the search inspection, both systems show the splunk components performing almost identically. The data is exactly the same, though system A actually has more data indexed, so the search does perform the query against more indexed data, so parts of the search perform "slower" (increases are only in 0.xxx's of a second). From all that I can see, this doesn't significantly affect the outcome, since System A still performs significantly quicker.
It would appear I have 8 seconds un-accounted for in the search time on System B. Stats from System B are below:
Duration (seconds) Component Invocations Input count Output count 0.003 command.fields 3 6 6 1.521 command.search 6 6 12 0.654 command.search.index 3 - - 0.353 command.search.kv 1 - - 0.352 command.search.fieldalias 1 6 6 0.138 command.search.rawdata 1 - - 0.024 command.search.typer 3 6 6 0.004 command.search.filter 4 - - 0.003 command.search.tags 3 6 6 0.001 command.search.lookups 1 6 6 1.522 dispatch.fetch 4 - - 0.003 dispatch.preview 3 - - 0.02 dispatch.timeline 4 - -
If I add up the time taken in the "Duration" column, I can never account for the additional 8 or so seconds. The inspect search stats for System A are identical, with the only exception being that the total time of the search is easily calculated using the values in the "Duration" column.
Why am I not able to account for the 8 seconds?
What is taking up 8 complete seconds, when splunk is reporting that all components are finishing within 1 or 2 seconds?
Any assistance appreciated.
The search inspector times don't account for the time it takes to start up the search (but it will in an upcoming release). I would expect that this accounts for the discrepancy.
Hi Stephen, I've run the strace and there is nothing obvious regarding "blocking" operations. There appears to be time "gaps" between entries in the log file, but I don't know if that translates into actual time gaps between actions. I'm unsure of how to interpret the log accurately.
It's a bit tricky, but you can use strace to figure out what the "splunkd search" process is doing. If you capture the command line arguments to this process and run, with the additional argument "--search='search
This is from the logs:
01-12-2011 12:56:02.503 INFO UserManager - Done setting user context: NULL -> [Removed]
01-12-2011 12:56:07.944 INFO IndexProcessor - lazy loading index main
As you can see, there are significant gap of time (5 seconds) between the above events. The other server does not have large gaps of time at all (any gaps can be measured in 0.1 or 0.01 seconds).
The best way to diagnose is to look at the search.log files in the job directories for the two searches and compare their relative timing. They're in $SPLUNK_HOME/var/run/splunk/dispatch/