report on disk usage spikes?

mitag · ‎04-29-2020

Need a report that:

Lists volumes with significant disk usage spikes over a given timeframe.
Plots those disk usage spikes over time.

P.S. Not interested in volumes with high percentage of used disk space - only in those that had a spike of say more than 20%.

I am assuming I'd need to:

List volumes that had such a spike by calculating max and average values for e.g. UsePct for a volume and then leaving only those with the delta > 20;
Run a timechart or something similar on those volumes.

Blanking out on how to do that and would appreciate your help - thanks!

P.P.S. This is as far as I've gotten - and it seems to correctly ID volumes with usage spikes (updated May 5):

sourcetype=WinHostMon source=disk FileSystem!="SNFS"
| stats min(storage_used_percent) as min,
        avg(storage_used_percent) as avg,
        max(storage_used_percent) as max,
        by host, Name FileSystem DriveType
| eval delta = max - avg
| where delta>20
| sort - max delta avg

The above produces the full stats table for all hosts and their volumes that had a spike; adding | fields host Name to it would produce just the hosts and volume names; the question remains: what is the best way to plot storage_used_percent on those volumes over the timeframe of the search?

P.P.P.S. Bonus points for streamlining the above search and making it faster; generally a streamlined mechanism for pinpointing anomalies (spikes, unusual deviations or volatility) on any available metrics - such as CPU, memory, disk and network utilization. (I have yet to properly configure Splunk infrastructure apps - perhaps such mechanisms are included in those.)

to4kawa · ‎05-05-2020

UPDATE:

 sourcetype=WinHostMon source=disk FileSystem!="SNFS"
| eval description=host."_".Name."_".FileSystem."_".DriveType
| bin _time span=1h
| stats min(storage_used_percent) as min,
           avg(storage_used_percent) as avg,
           max(storage_used_percent) as max by _time description
| eval delta = max - avg
| eval host=mvindex(split(description,"_"),0)
| eval flag = if(delta > 20,1,0)
| eventstats sum(flag) as flag by host
| where flag > 0
| sort 0 _time
| fields _time host max
| xyseries _time host max

I see, thanks to provide the detail.
It would be very easy to understand if other people wrote like this too.

mitag · ‎05-05-2020

Thank you - appreciate the kind words. Doesn't seem to be working though (probably something simple).

(Can't seem to post an image... Here is the link to the two screenshots. Hopefully this works.)

to4kawa · ‎05-05-2020

I see the pics.

This is because of |where delta > 20

My answer is updated.

mitag · ‎05-08-2020

as is - still doesn't work. See the same link above for two more screenshots. If I replace the last line with:

| timechart max(max) by host

.... then it's working.

to4kawa · ‎05-08-2020

good news.

please provide correct query and accept yours.

mitag · ‎05-08-2020

I don't understand how yours works yet... 🙂

The one I've been battling with is this:

(sourcetype=WinHostMon source=disk FileSystem!="SNFS") OR (sourcetype=df source="df" Type!="cvfs")
    [ search ((sourcetype=WinHostMon source=disk FileSystem!="SNFS") OR (sourcetype=df source="df" Type!="cvfs"))
      | eval Name     = if (isnull (Name),       mount, Name)
      | eval FileSystem = if (isnull (FileSystem), Type, FileSystem)

      | stats min(storage_used_percent) as min,
              avg(storage_used_percent) as avg,
              max(storage_used_percent) as max,
              by host, Name FileSystem DriveType
      | eval delta = max - avg
      | where delta>20
      | table host Name
      ]
| timechart max(storage_used_percent) by host

... it works but only for Windows hosts ( sourcetype=WinHostMon source=disk). For Linux hosts - not yet... ( sourcetype=df source="df")

P.S. Thank you for all your help with this.

to4kawa · ‎05-09-2020

Hi @mitag
timechart creates times from time picker.
Howeverxyseries are only changing the vertical and horizontal.

As a reference.

to4kawa · ‎04-29-2020

hi @mitag

your query has no _time . nobody makes timechart
you don't provide sample logs. if you can create SPL with no logs, but others can't.
Using stats can't compare the original values, eventstats is better.

mitag · ‎05-05-2020

Sorry for the delay! The sourcetype is the standard sourcetype=WinHostMon. Searching for Type=Disk or source=disk would give you disk stats. Events look like this:

Type=Disk
Name="C:"
DriveType="fixed"
TotalSpaceKB=116859900
FreeSpaceKB=62318744
FileSystem="NTFS"

(host = ws2016_016 source = disk sourcetype = WinHostMon)

(If you'd like, I can send you a sample of raw events.)

They are sampled every 5-15 minutes. Some additional fields are calculated - e.g. for the above single event these fields are:

 storage                114120.99609375
 storage_free            60858.1484375
 storage_free_percent       53.32774031126161
 storage_used            53262.84765625
 storage_used_percent       46.67225968873839

My specific case is this: on several of our hosts, the boot disk ("C:") went full (from about 45% to 100% within minutes, then after 15-45 minutes - back to normal). I need to do a report that only shows those hosts and volumes that had a spike, and plot those spikes over time.

We could of course just search for all hosts with volumes close to full (say, over 90%) - but that does not isolate the spikes correctly as some volumes have been close to full for a while.

So I am thinking:

calculate min, average and max storage_used_percent for each volume,
calculate the delta (difference) between max and avg for each volume / host;
List hosts and volumes where that delta is over a threshold (say, 20%)
run a timechart command just on those volumes and hosts.

With the following search I am getting a list of hosts and volumes that had a spike:

sourcetype=WinHostMon source=disk FileSystem!="SNFS"
| stats min(storage_used_percent) as min
        avg(storage_used_percent) as avg
        max(storage_used_percent) as max
        by host, Name FileSystem DriveType
| eval delta = max - avg
| where delta>20
| sort - max delta avg
| fields Name host

Now, how do I pipe the results into a timechart (or any other plotting mechanism)?

Thanks!

mitag · ‎05-05-2020

Does this look right? (Feels weird - as if I am doing two very similar transforms one after another - i.e. doesn't feel efficient.)

sourcetype=WinHostMon source="disk" 
    [ search sourcetype=WinHostMon source="disk" 
      | stats min(storage_used_percent) as min,
              avg(storage_used_percent) as avg,
              max(storage_used_percent) as max,
              by host, Name FileSystem DriveType
      | eval delta = max - avg
      | where delta>20
      | sort - max delta avg
      | table host Name 
      ]
| timechart max(storage_used_percent) by host

report on disk usage spikes?

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?