What is the appropriate way to calculate a sum of metric rates on counters and sum them, either for a single stat or for a timechart? What does the rate() of a metric mean? rate/sample or rate/second? I am looking for guidance.
I am extracting bind9 stats from our dozen DNS recursive servers every 5 minutes. The stats are counters. I am extracting the stats every 10 minutes so that I can get 2 samples each for rate calculations.
Base search:
| mstats rate(QrySuccess) as QrySuccess rate(QryFailure) as QryFailure rate(QrySERVFAIL) as QrySERVFAIL rate(QryFORMERR) as QryFORMERR
rate(QryNXDOMAIN) as QryNXDOMAIN rate(QryRecursion) as QryRecursion
prestats=false WHERE index="test_network_metrics" AND host="*" span=10m by host
| fields *
SingleStat Panel
| fields QrySuccess
| eval Success=QrySuccess/300
| stats sum(Success)
Timechart Panel| fields QrySuccess host| timechart span=10m latest(QrySuccess) as Success by host
The numbers don't exactly look right as at peak I am expecting traffic on the order of thousands per second. I am thinking that I botched the stats. System wide, I am running about 14M qph or about 3900 qps. If I leave off the division by 300 convert 5min to 1sec, it looks closer to normal, or about 30% of what I am expecting. Below is what I get from processing hourly summaries of DNS query transaction logs.
I experimented with summing the latest on the target field, but the numbers come out about the same.
| fields QrySuccess host
| fillnull value=0.0 QrySuccess
| stats latest(QrySuccess) as Success by host
| addcoltotals labelfield=host fieldname=Success
| tail 1
| fields Success