We are periodically seeing spikes of Storage I/O Saturation (Monitoring Console > Resource Usage: Deployment). When split by host we can see that this is affecting all 6 indexers nearly simultaneously for the /opt/splunkdata mount points. As expected, this triggers the Health Status notification throughout the day (warning or alert).
To note, Load Averages are regularly > 5% with CPU usage normally under 10% for each indexer (24 cores each). RAM usage around 30% per indexer. We are wondering if our physical storage and/or network might be a bottleneck or if it's something on the Splunk side.
For a Splunk Admin beginner, could someone please offer some suggestions on where we could start troubleshooting these spikes or explain in more detail the specifics around Storage I/O Saturation?
We are on Enterprise 9.0.4 across the board and considering the recent update sooner than later.
Thank you!
Hi @tretrigh,
usually the issue in these situations is the storage:
which kind of storage are you using?
are you sure to have at least the requested 800 IOPS from your storage?
You can measure your storage performances using a tool as Bonnie++.
Ciao.
Giuseppe
Storage is all SSD on NetApp using RAID-DP connected using fibre channel backend. I'm waiting to hear more about matching up times where we're seeing spikes with the guys in Infrastructure. I'm unsure about the IOPS limits at this point.
To note, I learned that the OS / disk and the /splunkdata disk for each indexer are all on the same aggregate. As I am unfamiliar with NetApp, I don't know if this matters (but assuming it is okay)?
Hi @tretrigh,
Storage on SSD should give the requested performances.
All the indexers are in the same nove or in different ones?
Are resources shared or dedicated?, they shoud be dedicated.
maybe there's an momentary issue on NetApp.
Ciao.
Giuseppe