Monitoring Splunk

Troubleshooting High Storage I/O Saturation Spikes?

tretrigh
Path Finder

We are periodically seeing spikes of Storage I/O Saturation (Monitoring Console > Resource Usage: Deployment).  When split by host we can see that this is affecting all 6 indexers nearly simultaneously for the /opt/splunkdata mount points.  As expected, this triggers the Health Status notification throughout the day (warning or alert).

To note, Load Averages are regularly > 5% with CPU usage normally under 10% for each indexer (24 cores each).  RAM usage around 30% per indexer.  We are wondering if our physical storage and/or network might be a bottleneck or if it's something on the Splunk side.

For a Splunk Admin beginner, could someone please offer some suggestions on where we could start troubleshooting these spikes or explain in more detail the specifics around Storage I/O Saturation?

We are on Enterprise 9.0.4 across the board and considering the recent update sooner than later.

Thank you!

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @tretrigh,

usually the issue in these situations is the storage:

which kind of storage are you using?

are you sure to have at least the requested 800 IOPS from your storage?

You can measure your storage performances using a tool as Bonnie++.

Ciao.

Giuseppe

0 Karma

tretrigh
Path Finder

Storage is all SSD on NetApp using RAID-DP connected using fibre channel backend.  I'm waiting to hear more about matching up times where we're seeing spikes with the guys in Infrastructure.  I'm unsure about the IOPS  limits at this point.

To note, I learned that the OS / disk and the /splunkdata disk for each indexer are all on the same aggregate.  As I am unfamiliar with NetApp, I don't know if this matters (but assuming it is okay)?

 

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @tretrigh,

Storage on SSD should give the requested performances.

All the indexers are in the same nove or in different ones?

Are resources shared or dedicated?, they shoud be dedicated.

maybe there's an momentary issue on NetApp.

Ciao.

Giuseppe

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...