I am trying to understand what would cause a variance in the volume used in our quota vs the log size it is ingesting. I have been failing to find an explanation, and am wondering if anybody else has...
See more...
I am trying to understand what would cause a variance in the volume used in our quota vs the log size it is ingesting. I have been failing to find an explanation, and am wondering if anybody else has figured our the reason for the variance.
As an example, I am sending syslog from a wifi AP to our syslog server that is running a UF. On the syslog server, on Oct 27 I am getting 484 lines of logs and those logs have 91,462 bytes in the log file.
In Splunk if I search for these events using this search:
index=network sourcetype=wifi
| eval eventSize=len(_raw)
| stats sum(eventSize) count by sourcetype
I also get 484 events but using len(_raw) function, I get length of characters of 90,978. I would assume these numbers should be pretty close and they are.
Now when I look in the _internal index for the *license_usage.log metrics for the wifi sourcetype, using this query:
index=_internal sourcetype=splunkd source=*license_usage.log type=Usage st=wifi
| stats sum(b) as bytes by st
I get a different sum of 103,794 bytes.
I am trying to determine how this could be or makes sense. This isn't a large difference (roughly 12%), but its spread out over every index and sourcetype and combined equals a large part of our license quota that I cannot explain for. Another example is our firewall logs, which is one of our largest indexes.
For Oct 27, len(_raw) = 62,899,298,079 chars in length. *license_usage.log = 81,139,209,296
This is a difference of roughly 22% over much larger percentage of our quota. There are other large indexes with similar variance.
I only have one production cluster, and I don't have a great way to verify this would be the same result on someone else's cluster. Do other people have this issue?
I am trying to find a logical reason for why this would be the case. Things that I have tried to track down:
1) Using the License Usage dashboards in the Monitoring Console of our LM, on License Usage - Today I see numbers that align with the metrics using the len(_raw) metric. Using the Historical License Usage dashboard - If I switch to "NoSplit" parameter, I get numbers that also align with the len(_raw) metrics. If I change the Historical License Usage parameter to "SplitByIndex" I get numbers that align with the *license_usage.log metrics.
2) I have a support case opened to try to understand this difference. My SE told me that the "NoSplit" parameter (which is using the Type=RolloverSummary attribute in its base search) is the correct metric to measure license usage. My Support tech has told me that this is false, and that the "SplitByIndex" metric (using type=Usage) is the true count. Based on my manual measurements of the logs on the syslog server, I have to agree with the SE, but do not have any way to prove my LM is reporting incorrectly.
3) I have looked for duplicates using a variety of searches. Most show a couple of events, but we are talking less then 10, and its typically isolated on log sources with verbose or debugging output like DNS.
4) I have looked for misconfigured inputs.conf or outputs.conf files. This did yield some results. I found one SH that had multiple outputs.conf files that were cloning some of the data inputs originating from that SH (a WIN!!), but in regards to the syslog wifi and firewall sourcetypes, this doesn't seem to be the case.
5) I have reviewed the character encoding that is described here: https://docs.splunk.com/Documentation/Splunk/latest/Data/Configurecharactersetencoding
Every props.conf that I can find is set to CHARSET=UTF-8, which as I understand means Splunk is encoding all ingested logs using UTF-8. I thought this might be the culprit as higher order characters can take up multiple bytes in UTF-8. I do not think this is the case, as the syslog logs for wifi are only using the lower ASCII characters of UTF-8 from 0-127, which I believe are only taking up 1 byte per character. I am making this claim based on https://www.rfc-editor.org/rfc/rfc5424 and just visually looking at the logs, there is not that much to them. This also aligns with my manual measurement of the characters in the shell and using len(_raw).
Am I missing something here (or not understanding the underlaying process)? Is there other root causes I am overlooking? Does my cluster have some sort of issue, maybe with the configuration or architecture? Do others have this similar extra volume that is not easily explained, and this is normal behavior?
Anything helps, thanks.