Hi everyone, I’ve been hanging around the Splunk community for a while, mostly dealing with application logs, but I’ve recently taken a dive into the deep end of hardware monitoring for my home lab. I’m currently running some older Dell R720s, and I’m specifically trying to monitor the health and performance of the PERC H710P controllers. These are managing a mix of SAS and SATA drives, and I’ve been noticing some intermittent latency that I can't quite pin down. A specific point I’ve encountered in my research is that OS-level disk metrics often fail to show what’s actually happening at the HBA or controller level—the OS might show high I/O wait, but it doesn't tell me if the PERC H710P is struggling with its cache or if a specific SATA drive is starting to throw parity errors. On a personal level, I spent most of last Saturday trying to get iDRAC7 to play nice with SNMP traps, but the data feels incredibly messy once it hits my indexer. I’m trying to avoid installing OpenManage Server Administrator (OMSA) on every single host because I want to keep the footprint light, but maybe that's the only way to get the granular RAID status I’m looking for? I’m curious if anyone here has successfully built a dashboard for these older SAS/SATA controllers, or if there’s a preferred "agentless" way to catch things like RAID rebuild progress or battery backup unit (BBU) failures? It feels like a waste of a good controller if I can't actually see what it's doing. Is it worth the effort to parse these specific hardware logs, or do most of you find that monitoring at the OS level is "good enough" for identifying hardware failure?
... View more