I would like to check that a given file has been fully indexed by Splunk.
I tried counting the lines in the source file using "wc -l" against the number of events indexed in Splunk, but this doesn't match up because some of my events include multiple lines.
How can I do this?
Checking the line count of a source file against the number of lines indexed by Splunk can be easily achieved. Here is an example with a file that numbers 1117 lines indexed as 7 events :
[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/Xorg.0.log | wc -l
1116
source="/var/log/Xorg.0.log" | stats sum(linecount)
Another method, although often less accurate, is to measure the byte count of the source file (again, excluding empty lines) and compare it against the aggregated byte count for all events indexed for that source :
[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/anaconda.log | wc -c
841067
source="/var/log/anaconda.log" | eval esize=len(_raw) | stats sum(esize) AS sum_esize, count | eval fsize=sum_esize + count | fields fsize
Here is something that will make this task a lot easier.
You can use a REST API call of the following format:
https://xx.xx.xx.xx:8089/services/admin/inputstatus/TailingProcessor:FileStatus
you can run that as a curl command from CLI or just put that in the web browser. Replace the xxx with the IP address of the host where the Universal Forwarder is installed (or the indexer if its the indexer that is monitoring files directly).
This REST API call will present a page that will show you each file that falls under the monitor: stanza and what is the status on it's read.
Here is an examply:
/opt/splunk/var/log/splunk/metrics.log.2
file position 25000134
file size 25000134
parent $SPLUNK_HOME/var/log/splunk
percent 100.00
type finished reading
Pretty self-explanatory.
If the file has not been fully read yet the percentage will be less then 100.
if there is a reason why Splunk hasn't started reading the file (like CRC) the reason will be stated there. Really good for troubleshooting.
Checking the line count of a source file against the number of lines indexed by Splunk can be easily achieved. Here is an example with a file that numbers 1117 lines indexed as 7 events :
[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/Xorg.0.log | wc -l
1116
source="/var/log/Xorg.0.log" | stats sum(linecount)
Another method, although often less accurate, is to measure the byte count of the source file (again, excluding empty lines) and compare it against the aggregated byte count for all events indexed for that source :
[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/anaconda.log | wc -c
841067
source="/var/log/anaconda.log" | eval esize=len(_raw) | stats sum(esize) AS sum_esize, count | eval fsize=sum_esize + count | fields fsize
And, another additional...
If you export events from search results for one source, \r\n will be added to construct a file. As a result, sometimes file size of exported events for one file and its original source file are different if the original source file had only \n as new line character.
This answer is very great.
Additional info;
At Indexing time, splunk removes character(s) which was parsed as line breaking character. So, if \r\n was the line breaking characters, two bytes will be removed. So, if there is an empty line, \r\n at the empty line will be removed.
However, we counts these removed bytes at licensing volume as these characters were existed in source file or source from network (tcp or udp).