From some time now I'm getting very low performance in my infra and I'm looking for a way to confirm that what is lacking the performance are disks. In any other systems keeping iowait monitored and under control would be enough, but I don't think this is true for Splunk. Not always at least. In my case, global cpu use is under 80%, free RAM enough and indexers iowait is always under 1, that's why I cannot understand why some queues are full even when there are more resources to be used.
But, what came up to me is that IOWait under 1 is way too low compared to other params and something different from expected for me was happening. So I straced for a while the Splunk process and all its childs and what I got was these are waiting (futex and epoll_wait) for something to happen up to 93% of the total run time!! I think it's waiting for an IO operation chance but and, this is important, this chance is being controlled by Splunk and not by the kernel (Splunk is not trying to open a file until some condition) and that's why IO is very low even when performance and other resource usage are low too.
My questions here are:
1. Is this statement true? If so, when will a Splunk indexer OS will show high IOWait under any circumstances?
2. Does Full queues always imply low disk performance?
3. Is there any configuration tweak I can do?
4. Which params to monitor in a Splunk system in terms of IO performance? In these terms, are queue fill up percentage something like iowait?
-- 2. Does Full queues always imply low disk performance?
Are you referring to the indexer queues? If so, these memory based queues compensate for I/O bottleneck by storing the data in memory. The default queues are tiny, and most likely you should increase them.
How large are they?
Indexing only uses about 4 procs per pipeline (essentially 1 proc per queue). Therefore, looking at global CPU may not be a good measure. It is certainly possible to be CPU bound in the indexing pipeline, but is also common to be I/O bound. What filesystem is in use? I have seen some "unfavorable" results with some versions of XFS, which were improved by upgrading OS or moving to EXT4 (preferred). But as others have suggested, insure THP is disabled, and make sure ulimits are sufficient.
I have enabled 2 pipelines, so 8 procs perinstance. If so. where does it comes from all the other procs? ps aux | grep -i splunk | wc -l gives me something over 800...
And yes, we are using ext4.
My questions is simpler than this... How can I be sure I'm having low disk performance if I have no iowait?
So iowait or nmon is reporting the disk busk busy % as under 1%? That seems unusually low, I see ~5% on really fast disks...(although perhaps you have faster disks/less data).
Also how much data per indexer is coming in? There are limits as to how much each indexer will be able to take before you start seeing some kind of issue.
Finally, have you changed the queue sizes at all? (Just curious)
Just a thought - what OS?
We see something very similar, and think we have tracked it down to THP. I am waiting for a change to go through which I hope will make a dramatic improvement.
ideally your looking for
if it says anything else, it might be worth investigating.
We have been getting painfully slow results on some searches, so we have been on a witch hunt for the culprit (after stringing up users with poor searches, and removing rt search from the worst offenders)
They WILL learn! We started looking at Storage, & system resources. We thought THP was disabled, but turns out not to be the case, (and what do you know - we are on exactly the same OS as detailed here: https://docs.splunk.com/Documentation/Splunk/7.0.0/ReleaseNotes/SplunkandTHP). we also have some low nproc and filehandle limits, so we are adjusting them too. I have my fingers crossed that it helps as much as the article suggests, but we will see!
It seems we are going throught the very same issue and conclusions. Our nproc limit was increased in the indexing layer and we didn't get a huge performance improve... almost any. Have you tried using strace to check out syscall statistics? I could be really helpful. As I said, 93% of our CPU time in an indexer was wasted waiting... and still there were resources to be used.
strace -y -tt -T -f -c -p [splunkd_parent_pid]
I used pstree -p to figure out splunkd parent pid...
Hope it helps!
I presume you mean the first child process (rather than the root parent).
Its the first child which spawns the search jobs, so running some of my problematic searches against that pid, I am seeing 80%+ for epoll_wait.
I'm not qualified to say if that is good/bad but it sounds broadly consistent with yours.