Solved: Aggregate rate for entire cluster from individual ...

abhisawa · ‎11-22-2014

I have cluster of more than 100 hosts which getting data over network from multiple source. I can calculate rate of incoming data by collecting 'RX Bytes' field from 'ifconfig' output every minute. So my splunk query to create timechart for single hosts , looks like

index=os source=interfaces eth0 host=hostname1 | sort  -_time | streamstats current=false last(RXbytes) as lastRX  | eval RX_Thruput_bytes = ((lastRX-RXbytes)/(1024*60)) | timechart span=10m avg(RX_Thruput_bytes)

How can I make addition of avg(RX_Thruput_bytes) for all 100 hosts and determine rate of incoming data for entire cluster ?

abhisawa · ‎11-23-2014

After multiple iteration and cross verifying results with actual ifconfig data , following query works correctly. Updated streamstats with by host to provide accurate calculation.

index=os source=interfaces eth0  | sort 0 - _time
| streamstats current=f window=1 global=f last(RXbytes) as lastRX last(_time) as lastTime by host 
| eval thruput_kb = case(lastRX > RXbytes, (lastRX-RXbytes)/(1024*(lastTime-_time)))
| bucket span=4h _time  |stats avg(thruput_kb) as average_kb_per_host by host _time
| timechart span=4h sum(average_kb_per_host) as cluster_thruput_kb

Thank you martin for providing initial approach.

View solution in original post

abhisawa · ‎11-23-2014

After multiple iteration and cross verifying results with actual ifconfig data , following query works correctly. Updated streamstats with by host to provide accurate calculation.

index=os source=interfaces eth0  | sort 0 - _time
| streamstats current=f window=1 global=f last(RXbytes) as lastRX last(_time) as lastTime by host 
| eval thruput_kb = case(lastRX > RXbytes, (lastRX-RXbytes)/(1024*(lastTime-_time)))
| bucket span=4h _time  |stats avg(thruput_kb) as average_kb_per_host by host _time
| timechart span=4h sum(average_kb_per_host) as cluster_thruput_kb

Thank you martin for providing initial approach.

martin_mueller · ‎11-22-2014

Something like this?

  index=os source=interfaces eth0 | sort - _time
| streamstats current=f window=1 global=f last(RXbytes) as lastRX last(_time) as lastTime
| eval thruput_kb = case(lastRX > RXbytes, (lastRX-RXbytes)/1024*(lastTime-_time))
| timechart span=10m avg(thruput_kb) as average_kb_per_host dc(host) as hosts
| eval average_kb_per_cluster = average_kb_per_host * hosts | fields - average_kb_per_host hosts

Assuming every host reports every time, the dc() for every bucket will be the number of hosts in your cluster. Note, the total average is slightly dirty from a statistics point of view, if a single host has more or less number of reports in the ten-minute bucket his throughput will be weighted slightly more or less than that of other hosts. This might be more correct from a statistics point of view:

  index=os source=interfaces eth0 | sort - _time
| streamstats current=f window=1 global=f last(RXbytes) as lastRX last(_time) as lastTime
| eval thruput_kb = case(lastRX > RXbytes, (lastRX-RXbytes)/1024*(lastTime-_time))
| bucket span=10m _time | stats avg(thruput_kb) as average_kb_per_host by _time host
| timechart span=10m sum(thruput_kb) as cluster_thruput_kb

My brain isn't quire sure on what's more correct right now, so do try both and think about what works best.

martin_mueller · ‎11-23-2014

Those differences are expected - every time you run the search the underlying data changes a little because the time range has progressed a little.

martin_mueller · ‎11-23-2014

Are you running the search over a fixed time range (e.g. "Yesterday") or a relative time range (e.g. "Last 24 hours")?

abhisawa · ‎11-23-2014

I am running on Last 24 hours and difference is very minor.

abhisawa · ‎11-23-2014

Martin, Thank you for taking look at this query. Your 2nd query which I was looking for with modification as follows

For some reason stats average was getting zero for few of hosts so I changed stats avg(thruput_kb) as average_kb_per_host by _time host to stats avg(thruput_kb) as average_kb_per_host host _time, looks like fields order does matter.
I think in timechart span=10m sum(thruput_kb) as cluster_thruput_kb you meant sum(average_kb_per_host) .

So final query as following gives me believable output in chart BUT every single time I run this query gives me minor variation in timechart for 24 hour worth of data.

Is that expected ?

index=os source=interfaces eth0 | sort 0 - _time | streamstats current=f window=1 global=f last(RXbytes) as lastRX last(_time) as lastTime | eval thruput_kb = case(lastRX > RXbytes, (lastRX-RXbytes)/(1024*(lastTime-_time))) | bucket span=1h _time |stats avg(thruput_kb) as average_kb_per_host by host _time | timechart span=1h sum(average_kb_per_host) as cluster_thruput_kb

Aggregate rate for entire cluster from individual hosts data

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms