Splunk Search

report on disk usage spikes?

mitag
Contributor

Need a report that:

  1. Lists volumes with significant disk usage spikes over a given timeframe.
  2. Plots those disk usage spikes over time.

P.S. Not interested in volumes with high percentage of used disk space - only in those that had a spike of say more than 20%.

I am assuming I'd need to:

  1. List volumes that had such a spike by calculating max and average values for e.g. UsePct for a volume and then leaving only those with the delta > 20;
  2. Run a timechart or something similar on those volumes.

Blanking out on how to do that and would appreciate your help - thanks!

P.P.S. This is as far as I've gotten - and it seems to correctly ID volumes with usage spikes (updated May 5):

sourcetype=WinHostMon source=disk FileSystem!="SNFS"
| stats min(storage_used_percent) as min,
        avg(storage_used_percent) as avg,
        max(storage_used_percent) as max,
        by host, Name FileSystem DriveType
| eval delta = max - avg
| where delta>20
| sort - max delta avg

The above produces the full stats table for all hosts and their volumes that had a spike; adding | fields host Name to it would produce just the hosts and volume names; the question remains: what is the best way to plot storage_used_percent on those volumes over the timeframe of the search?

P.P.P.S. Bonus points for streamlining the above search and making it faster; generally a streamlined mechanism for pinpointing anomalies (spikes, unusual deviations or volatility) on any available metrics - such as CPU, memory, disk and network utilization. (I have yet to properly configure Splunk infrastructure apps - perhaps such mechanisms are included in those.)

0 Karma

to4kawa
Ultra Champion

UPDATE:

 sourcetype=WinHostMon source=disk FileSystem!="SNFS"
| eval description=host."_".Name."_".FileSystem."_".DriveType
| bin _time span=1h
| stats min(storage_used_percent) as min,
           avg(storage_used_percent) as avg,
           max(storage_used_percent) as max by _time description
| eval delta = max - avg
| eval host=mvindex(split(description,"_"),0)
| eval flag = if(delta > 20,1,0)
| eventstats sum(flag) as flag by host
| where flag > 0
| sort 0 _time
| fields _time host max
| xyseries _time host max

I see, thanks to provide the detail.
It would be very easy to understand if other people wrote like this too.

0 Karma

mitag
Contributor

Thank you - appreciate the kind words. Doesn't seem to be working though (probably something simple).

(Can't seem to post an image... Here is the link to the two screenshots. Hopefully this works.)

0 Karma

to4kawa
Ultra Champion

I see the pics.

This is because of |where delta > 20

My answer is updated.

mitag
Contributor

as is - still doesn't work. See the same link above for two more screenshots. If I replace the last line with:

| timechart max(max) by host

.... then it's working.

0 Karma

to4kawa
Ultra Champion

good news.

please provide correct query and accept yours.

mitag
Contributor

I don't understand how yours works yet... 🙂

The one I've been battling with is this:

(sourcetype=WinHostMon source=disk FileSystem!="SNFS") OR (sourcetype=df source="df" Type!="cvfs")
    [ search ((sourcetype=WinHostMon source=disk FileSystem!="SNFS") OR (sourcetype=df source="df" Type!="cvfs"))
      | eval Name     = if (isnull (Name),       mount, Name)
      | eval FileSystem = if (isnull (FileSystem), Type, FileSystem)

      | stats min(storage_used_percent) as min,
              avg(storage_used_percent) as avg,
              max(storage_used_percent) as max,
              by host, Name FileSystem DriveType
      | eval delta = max - avg
      | where delta>20
      | table host Name
      ]
| timechart max(storage_used_percent) by host

... it works but only for Windows hosts ( sourcetype=WinHostMon source=disk). For Linux hosts - not yet... ( sourcetype=df source="df")

P.S. Thank you for all your help with this.

0 Karma

to4kawa
Ultra Champion

Hi @mitag
timechart creates times from time picker.
Howeverxyseries are only changing the vertical and horizontal.

As a reference.

0 Karma

to4kawa
Ultra Champion

hi @mitag

  1. your query has no _time . nobody makes timechart
  2. you don't provide sample logs. if you can create SPL with no logs, but others can't.
  3. Using stats can't compare the original values, eventstats is better.
0 Karma

mitag
Contributor

Sorry for the delay! The sourcetype is the standard sourcetype=WinHostMon. Searching for Type=Disk or source=disk would give you disk stats. Events look like this:

Type=Disk
Name="C:"
DriveType="fixed"
TotalSpaceKB=116859900
FreeSpaceKB=62318744
FileSystem="NTFS"

(host = ws2016_016 source = disk sourcetype = WinHostMon)

(If you'd like, I can send you a sample of raw events.)

They are sampled every 5-15 minutes. Some additional fields are calculated - e.g. for the above single event these fields are:

 storage                114120.99609375
 storage_free            60858.1484375
 storage_free_percent       53.32774031126161
 storage_used            53262.84765625
 storage_used_percent       46.67225968873839

My specific case is this: on several of our hosts, the boot disk ("C:") went full (from about 45% to 100% within minutes, then after 15-45 minutes - back to normal). I need to do a report that only shows those hosts and volumes that had a spike, and plot those spikes over time.

We could of course just search for all hosts with volumes close to full (say, over 90%) - but that does not isolate the spikes correctly as some volumes have been close to full for a while.

So I am thinking:

  1. calculate min, average and max storage_used_percent for each volume,
  2. calculate the delta (difference) between max and avg for each volume / host;
  3. List hosts and volumes where that delta is over a threshold (say, 20%)
  4. run a timechart command just on those volumes and hosts.

With the following search I am getting a list of hosts and volumes that had a spike:

sourcetype=WinHostMon source=disk FileSystem!="SNFS"
| stats min(storage_used_percent) as min
        avg(storage_used_percent) as avg
        max(storage_used_percent) as max
        by host, Name FileSystem DriveType
| eval delta = max - avg
| where delta>20
| sort - max delta avg
| fields Name host

Now, how do I pipe the results into a timechart (or any other plotting mechanism)?

Thanks!

0 Karma

mitag
Contributor

Does this look right? (Feels weird - as if I am doing two very similar transforms one after another - i.e. doesn't feel efficient.)

sourcetype=WinHostMon source="disk" 
    [ search sourcetype=WinHostMon source="disk" 
      | stats min(storage_used_percent) as min,
              avg(storage_used_percent) as avg,
              max(storage_used_percent) as max,
              by host, Name FileSystem DriveType
      | eval delta = max - avg
      | where delta>20
      | sort - max delta avg
      | table host Name 
      ]
| timechart max(storage_used_percent) by host
0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...