I am trying to understand what would cause a variance in the volume used in our quota vs the log size it is ingesting. I have been failing to find an explanation, and am wondering if anybody else has figured our the reason for the variance.
As an example, I am sending syslog from a wifi AP to our syslog server that is running a UF. On the syslog server, on Oct 27 I am getting 484 lines of logs and those logs have 91,462 bytes in the log file.
In Splunk if I search for these events using this search:
index=network sourcetype=wifi
| eval eventSize=len(_raw)
| stats sum(eventSize) count by sourcetype
I also get 484 events but using len(_raw) function, I get length of characters of 90,978. I would assume these numbers should be pretty close and they are.
Now when I look in the _internal index for the *license_usage.log metrics for the wifi sourcetype, using this query:
index=_internal sourcetype=splunkd source=*license_usage.log type=Usage st=wifi
| stats sum(b) as bytes by st
I get a different sum of 103,794 bytes.
I am trying to determine how this could be or makes sense. This isn't a large difference (roughly 12%), but its spread out over every index and sourcetype and combined equals a large part of our license quota that I cannot explain for. Another example is our firewall logs, which is one of our largest indexes.
For Oct 27, len(_raw) = 62,899,298,079 chars in length. *license_usage.log = 81,139,209,296
This is a difference of roughly 22% over much larger percentage of our quota. There are other large indexes with similar variance.
I only have one production cluster, and I don't have a great way to verify this would be the same result on someone else's cluster. Do other people have this issue?
I am trying to find a logical reason for why this would be the case. Things that I have tried to track down:
1) Using the License Usage dashboards in the Monitoring Console of our LM, on License Usage - Today I see numbers that align with the metrics using the len(_raw) metric. Using the Historical License Usage dashboard - If I switch to "NoSplit" parameter, I get numbers that also align with the len(_raw) metrics. If I change the Historical License Usage parameter to "SplitByIndex" I get numbers that align with the *license_usage.log metrics.
2) I have a support case opened to try to understand this difference. My SE told me that the "NoSplit" parameter (which is using the Type=RolloverSummary attribute in its base search) is the correct metric to measure license usage. My Support tech has told me that this is false, and that the "SplitByIndex" metric (using type=Usage) is the true count. Based on my manual measurements of the logs on the syslog server, I have to agree with the SE, but do not have any way to prove my LM is reporting incorrectly.
3) I have looked for duplicates using a variety of searches. Most show a couple of events, but we are talking less then 10, and its typically isolated on log sources with verbose or debugging output like DNS.
4) I have looked for misconfigured inputs.conf or outputs.conf files. This did yield some results. I found one SH that had multiple outputs.conf files that were cloning some of the data inputs originating from that SH (a WIN!!), but in regards to the syslog wifi and firewall sourcetypes, this doesn't seem to be the case.
5) I have reviewed the character encoding that is described here: https://docs.splunk.com/Documentation/Splunk/latest/Data/Configurecharactersetencoding
Every props.conf that I can find is set to CHARSET=UTF-8, which as I understand means Splunk is encoding all ingested logs using UTF-8. I thought this might be the culprit as higher order characters can take up multiple bytes in UTF-8. I do not think this is the case, as the syslog logs for wifi are only using the lower ASCII characters of UTF-8 from 0-127, which I believe are only taking up 1 byte per character. I am making this claim based on https://www.rfc-editor.org/rfc/rfc5424 and just visually looking at the logs, there is not that much to them. This also aligns with my manual measurement of the characters in the shell and using len(_raw).
Am I missing something here (or not understanding the underlaying process)?
Is there other root causes I am overlooking?
Does my cluster have some sort of issue, maybe with the configuration or architecture?
Do others have this similar extra volume that is not easily explained, and this is normal behavior?
Anything helps, thanks.
Splunk measures license usage based on data it receives on the index pipeline (this is the pipeline on indexers from where it'll get written into disk). This data will have parsed and transformed raw data that you'd see in the files. It'll have events properly split into events and will have mapping to metadata/tsidx files. It can be very different from actual file size. Some times it could be less if there are a lot of whitespaces in the log file OR you're filtering events. It can be larger if number of events are higher and more metadata information are extracted. There is no clear guideline on how much a file will cost in terms of licensing just by seeing file size on disk.
The rollover summaries are the license usage that Splunk uses to validate if you're going over license.
Splunk measures license usage based on data it receives on the index pipeline (this is the pipeline on indexers from where it'll get written into disk). This data will have parsed and transformed raw data that you'd see in the files. It'll have events properly split into events and will have mapping to metadata/tsidx files. It can be very different from actual file size. Some times it could be less if there are a lot of whitespaces in the log file OR you're filtering events. It can be larger if number of events are higher and more metadata information are extracted. There is no clear guideline on how much a file will cost in terms of licensing just by seeing file size on disk.
The rollover summaries are the license usage that Splunk uses to validate if you're going over license.
This is very helpful to know.
Do you know if the additional metadata being extracted is counted against the quota? The Rollover Summaries count suggests that it is not.
I had a suspicion this was additional data getting added as things are being parsed, because you can see the additional fields and that must take up some space. I would guess this is why it changes from sourcetype to sourcetype and is not a consistent across all inputs. The rollover summaries also make the most sense for the license usage count. Splunk support had given me some conflicting info about with the Rollover summaries, but I am not confident in it. I have had multiple people tell me that the rollover summaries are what I should be paying attention to. It has been a bit confusing for someone new to this type of license model. Thanks for helping confirm that.
I have also found a number of messy configurations that could be adding some additional events or conflicting with each other and need cleaned up.
One important thing to keep in mind is that license usage shows volume as per when it was indexed, which doesn't necessarily have to coincide with when the event took place (which is what you probably are looking at with the len(_raw) queries). So make sure to compare apples vs. apples by restricting that len(_raw) query by _indextime, rather than by _time.
Not saying this explains your case, but at least in theory it could explain differences.
Also not exactly sure how the difference between rolloversummary and usage data can be explained.
Thank you - That is a good distinction to make. I will change up my search.