What is the appropriate way to calculate a sum of metric rates on counters and sum them, either for a single stat or for a timechart? What does the rate() of a metric mean? rate/sample or rate/second? I am looking for guidance.
I am extracting bind9 stats from our dozen DNS recursive servers every 5 minutes. The stats are counters. I am extracting the stats every 10 minutes so that I can get 2 samples each for rate calculations.
Base search: | mstats rate(QrySuccess) as QrySuccess rate(QryFailure) as QryFailure rate(QrySERVFAIL) as QrySERVFAIL rate(QryFORMERR) as QryFORMERR rate(QryNXDOMAIN) as QryNXDOMAIN rate(QryRecursion) as QryRecursion prestats=false WHERE index="test_network_metrics" AND host="*" span=10m by host | fields *
The numbers don't exactly look right as at peak I am expecting traffic on the order of thousands per second. I am thinking that I botched the stats. System wide, I am running about 14M qph or about 3900 qps. If I leave off the division by 300 convert 5min to 1sec, it looks closer to normal, or about 30% of what I am expecting. Below is what I get from processing hourly summaries of DNS query transaction logs.
I experimented with summing the latest on the target field, but the numbers come out about the same.