Since the rewrite of the tailing processor in 4.1, on the whole it seems much better than previous incarnations, but it appears to induce a hardcoded delay on directory traversal.
There are consistent gaps in our debug output. We have these set:
category.TailingProcessor=DEBUG
category.WatchedFile=DEBUG
category.BatchReader=DEBUG
category.FileTracker=DEBUG
The gaps we see are always around ~250-300ms, always when traversing into directories.
Prior versions had similar problems, but these went away somewhat with the tailing_proc_speed option.
In the worst case (0.3s), for 10000 distinct directories, this equates to ~50 minutes of idle time introduced by the tailing engine.
A few questions:
Is this really a hardcoded pause? If so, what's the reasoning?
Also, is there a way to tune / remove it?
There is no hardcoded pause in the new tailing processor for 4.1. The only limit we have is that any given file or directory should only be checked for changes every 1s at most.
It may be worthwhile to correlate the log messages in splunkd with output from the strace command to see exactly what splunkd is doing during those 250-300ms, it could be checking files within a component that doesn't log every system call. Another possibility is that data is being read and put on a queue for processing.
There is no hardcoded pause in the new tailing processor for 4.1. The only limit we have is that any given file or directory should only be checked for changes every 1s at most.
It may be worthwhile to correlate the log messages in splunkd with output from the strace command to see exactly what splunkd is doing during those 250-300ms, it could be checking files within a component that doesn't log every system call. Another possibility is that data is being read and put on a queue for processing.
concluded it must have been another component taking over while splunkd was in iowait... we found our high iowait was due to distance from the NFS filer (>3ms)
...any luck yet?
you mentioned in another question that the files are on network storage. did you check how long readdir/getdent calls were taking in the strace output?
I haven't yet correlated with any pauses in strace, so that's promising. I'm guessing it's another component doing heavy lifting, but at the same time we have a lot of inputs to digest. Is there any way to give priority to the various input processors over other components?