Hi,
I need to show a customer that Splunk is processing their entire file, and thought a good way of doing it was to calculate the total size of events from particular sources and then comparing it to the logfile itself. Is this possible? If so, how?
You should pick the best answer that got you to a solution and click Accept
to close the question.
You can use license_usage.log file as suggested by SloshBurch.
here is the query:
index=_internal source="*license_usage.log*" type=Usage | stats sum(eval(b/1024/1024)) AS volume_b by s
This will give you size of each source in MBs.
GREAT query. Using this one now and very helpful. Thanks so much!
Don't forget the license_usage.log file. Assuming there is no congestion, the license_usage.log file would show for any source (s), sourcetype (st), index (i), or host (h), the bytes (b) of that event. Therefore you could add up (sum) the total bytes per that file to show the true size. Or the roll over events each night will show a summary statistic of the same.
If there is no value for those fields then you may be on an old version of splunk OR there was index congestion.
The question is about source so unfortunately in most environments the usage.log will not be accurate. If you have a small Splunk environment it will probably work, but Splunk squashes the values of source and host to keep the event counts down for the usage.log file. It doesn't squash index or sourcetype so those would be accurate but if you are trying to use host or source and you have an environment that is not small, most likely this will be less accurate than summing up the lengths of all the _raw data.
If you are suing the default LINE_BREAKER which means each line is a single event then you can count lines. If you are sending all of the data (not diverting any to nullQueue) then you can count bytes.
Both like this:
index=* source=MyFile | eval bytes=len(_raw) | stats count AS Lines sum(bytes) AS Bytes by source
I used to do it this way but recently learned that this won't be 100% accurate because
len
and the license counter measure the same (they don't, len
measures characters while the license counter measures bytes)_indextime
is not the same as _time
. Sometimes forwarders get backed up and an item may be indexed some time after what it's _time
value is.These are both excellent points and my answer was very US-centric and not fully qualified.
If comparing byte or character counts be aware that Splunk does not index LINE_BREAKER characters ([\r\n], by default) so allow for that in your comparison.
I would probably compare the event count in Splunk to the number of lines in the log file, assuming a 1:1 ratio. This may not work if you merge multiple lines into a single event or split lines into multiple events.
This is another VERY excellent point.