I am looking to show I/O latency on our indexers specific to reads and/or writes? The Monitoring Console shows total IOPS but we'd like to go a little more granular than that. We want to know if our disk latency is because of reads or writes on our hot/warm and cold mounts.
I'm looking at the introspection logs, at the fields indicated below, and it's not clear to me if reads_kb_ps and writes_kb_ps are the fields that will provide this data , based on their descriptions.
avg_service_ms: Average time requests caused the CPU to be in use, in milliseconds.
avg_total_ms: Average queue + execution time for requests to be completed, in milliseconds.
cpu_pct: Percentage of time the CPU was servicing requests.
device: Device name (e.g., as listed under /dev on UNIX).
fs_type: Mounted device file system type.
interval: Interval over which sampling occurred, in seconds.
mount_point: Mount point(s) of the underlying device.
reads_kb_ps: Total number of kb read per second.
reads_ps: Number of read requests per second.
writes_kb_ps: Total number of kb written per second.
writes_ps: Number of write requests per second.
I've looked all over and haven't been able to find anything helpful. I feel like someone else has to be doing this type of performance metric.
This is what the Monitoring Console has as the IOPS search. How can I pick this apart to give me what I'm looking for?
index=_introspection sourcetype=splunk_resource_usage component=IOStats host=<myhost> | eval mount_point = 'data.mount_point' | eval reads_ps = 'data.reads_ps' | eval writes_ps = 'data.writes_ps' | eval interval = 'data.interval' | eval op_count = (reads_ps + writes_ps) * interval | eval avg_service_ms = 'data.avg_service_ms' | eval avg_wait_ms = 'data.avg_total_ms' | eval cpu_pct = 'data.cpu_pct' | eval network_pct = 'data.network_pct' | timechart minspan=60s partial=f per_second(op_count) as iops, avg(data.cpu_pct) as avg_cpu_pct, avg(data.avg_service_ms) as avg_service_ms, avg(data.avg_total_ms) as avg_wait_ms, avg(data.network_pct) as avg_network_pct | eval iops = round(iops) | eval avg_cpu_pct = round(avg_cpu_pct) | eval avg_service_ms = round(avg_service_ms) | eval avg_wait_ms = round(avg_wait_ms) | eval avg_network_pct = round(avg_network_pct) | fields _time, iops avg_wait_ms | rename avg_wait_ms as "Wait Time (ms)"
I have been using this app for a long time and it's predecessor app nmon for splunk.
We're seeing a wide variety. Some servers are showing 4-30ms on our hot/warm disk and other servers are showing up towards 2500ms on our hot/warm disk, with spikes above that. I haven't even looked at our cold disk yet because the majority of our splunk users are hitting the warm buckets. I'm looking to show historic values of i/o latency; not just what's currently going on.
Also refer to What is the best app to monitor Linux in Splunk? , sar / iostat will work just fine but you might want to look at the linked answer so you can get this into Splunk easily and have prebuilt dashboards...
Thanks mpreddy. We can look on the box for current latency stats but I need to look at historic values, as well. Previous 6 weeks, for example. So I need to be able to graph something within Splunk.