Splunk Search

Seconds un-accounted for in job/search inspection

Explorer

Hi Guys,

I have two systems running splunk, and for some as-yet unknown reason the exact same search on both systems produces a difference in the total time to complete the search.

The two systems are as follows:

System A = splunk v4.1.5 on Ubuntu, 2 Cores, 2 GBytes RAM, 60GB internal SATA disk T Specific search time = 0.891 seconds

System B = splunk v4.1.2 on RHEL, 16 cores, 18 GBytes RAM, 10TB of fast SAN disks Specific search time = 9.124 seconds

From the search inspection, both systems show the splunk components performing almost identically. The data is exactly the same, though system A actually has more data indexed, so the search does perform the query against more indexed data, so parts of the search perform "slower" (increases are only in 0.xxx's of a second). From all that I can see, this doesn't significantly affect the outcome, since System A still performs significantly quicker.

It would appear I have 8 seconds un-accounted for in the search time on System B. Stats from System B are below:

Duration (seconds)  Component                                   Invocations     Input count     Output count
0.003               command.fields                          3                       6                       6
1.521               command.search                          6                       6                       12
0.654               command.search.index                3                       -                       -
0.353               command.search.kv                   1                       -                       -
0.352               command.search.fieldalias   1                       6                       6       
0.138               command.search.rawdata          1                       -                       -       
0.024               command.search.typer                3                       6                       6       
0.004               command.search.filter           4                       -                       -       
0.003               command.search.tags                 3                       6                       6       
0.001               command.search.lookups          1                       6                       6       
1.522               dispatch.fetch                          4                       -                       -       
0.003               dispatch.preview                        3                       -                       -       
0.02                     dispatch.timeline                  4                       -                           -       

If I add up the time taken in the "Duration" column, I can never account for the additional 8 or so seconds. The inspect search stats for System A are identical, with the only exception being that the total time of the search is easily calculated using the values in the "Duration" column.

Why am I not able to account for the 8 seconds?

What is taking up 8 complete seconds, when splunk is reporting that all components are finishing within 1 or 2 seconds?

Any assistance appreciated.

Regards

Simon

Splunk Employee
Splunk Employee

The search inspector times don't account for the time it takes to start up the search (but it will in an upcoming release). I would expect that this accounts for the discrepancy.

Explorer

Hi Stephen, I've run the strace and there is nothing obvious regarding "blocking" operations. There appears to be time "gaps" between entries in the log file, but I don't know if that translates into actual time gaps between actions. I'm unsure of how to interpret the log accurately.

0 Karma

Splunk Employee
Splunk Employee

It's a bit tricky, but you can use strace to figure out what the "splunkd search" process is doing. If you capture the command line arguments to this process and run, with the additional argument "--search='search '" and without "--id=...", under strace -Ttttf -o you'll see what the blocking operation is, most likely I/O.

0 Karma

Explorer

This is from the logs:

01-12-2011 12:56:02.503 INFO UserManager - Done setting user context: NULL -> [Removed]
01-12-2011 12:56:07.944 INFO IndexProcessor - lazy loading index main

As you can see, there are significant gap of time (5 seconds) between the above events. The other server does not have large gaps of time at all (any gaps can be measured in 0.1 or 0.01 seconds).

0 Karma

Splunk Employee
Splunk Employee

The best way to diagnose is to look at the search.log files in the job directories for the two searches and compare their relative timing. They're in $SPLUNK_HOME/var/run/splunk/dispatch/.

0 Karma

Explorer

Thanks Stephen!
Any ideas what would cause the search start-up to take a full 8 seconds, as opposed to virtually no time at all on another system?

0 Karma