Getting Data In

How can I check that Splunk indexed the entire contents of a given file?

Splunk Employee
Splunk Employee

I would like to check that a given file has been fully indexed by Splunk.

I tried counting the lines in the source file using "wc -l" against the number of events indexed in Splunk, but this doesn't match up because some of my events include multiple lines.

How can I do this?

Tags (2)
1 Solution

Splunk Employee
Splunk Employee

Checking the line count of a source file against the number of lines indexed by Splunk can be easily achieved. Here is an example with a file that numbers 1117 lines indexed as 7 events :

  • Count the number of lines in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/Xorg.0.log | wc -l
1116

  • In Splunk, search for all events for that specific source over all time, and aggregate the values of the "linecount" field :

source="/var/log/Xorg.0.log" | stats sum(linecount)


alt text




The two numbers should match, provided that you do not work on a live file that is part of a rotation (example : /var/log/messages or $SPLUNK_HOME/var/log/splunk/metrics.log) or that you are routing events from this file to the null queue.



Another method, although often less accurate, is to measure the byte count of the source file (again, excluding empty lines) and compare it against the aggregated byte count for all events indexed for that source :

  • Count the number of bytes in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/anaconda.log | wc -c
841067

  • In Splunk, search for all events for that specific source over all time, calculate and aggregate the size of all events in the "esize" field using the len() eval function on the event raw data (the "_raw" field) and add the event count for the file!. The last step is important to get an accurate byte count because Splunk "loses" one byte per event when it dumps the last newline character of each event :

source="/var/log/anaconda.log" | eval esize=len(raw) | stats sum(esize) AS sumesize, count | eval fsize=sum_esize + count | fields fsize


alt text

View solution in original post

Communicator

Here is something that will make this task a lot easier.
You can use a REST API call of the following format:
https://xx.xx.xx.xx:8089/services/admin/inputstatus/TailingProcessor:FileStatus

you can run that as a curl command from CLI or just put that in the web browser. Replace the xxx with the IP address of the host where the Universal Forwarder is installed (or the indexer if its the indexer that is monitoring files directly).

This REST API call will present a page that will show you each file that falls under the monitor: stanza and what is the status on it's read.
Here is an examply:

/opt/splunk/var/log/splunk/metrics.log.2

file position 25000134
file size 25000134
parent $SPLUNK_HOME/var/log/splunk
percent 100.00
type finished reading

Pretty self-explanatory.
If the file has not been fully read yet the percentage will be less then 100.
if there is a reason why Splunk hasn't started reading the file (like CRC) the reason will be stated there. Really good for troubleshooting.

0 Karma

Splunk Employee
Splunk Employee

Checking the line count of a source file against the number of lines indexed by Splunk can be easily achieved. Here is an example with a file that numbers 1117 lines indexed as 7 events :

  • Count the number of lines in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/Xorg.0.log | wc -l
1116

  • In Splunk, search for all events for that specific source over all time, and aggregate the values of the "linecount" field :

source="/var/log/Xorg.0.log" | stats sum(linecount)


alt text




The two numbers should match, provided that you do not work on a live file that is part of a rotation (example : /var/log/messages or $SPLUNK_HOME/var/log/splunk/metrics.log) or that you are routing events from this file to the null queue.



Another method, although often less accurate, is to measure the byte count of the source file (again, excluding empty lines) and compare it against the aggregated byte count for all events indexed for that source :

  • Count the number of bytes in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/anaconda.log | wc -c
841067

  • In Splunk, search for all events for that specific source over all time, calculate and aggregate the size of all events in the "esize" field using the len() eval function on the event raw data (the "_raw" field) and add the event count for the file!. The last step is important to get an accurate byte count because Splunk "loses" one byte per event when it dumps the last newline character of each event :

source="/var/log/anaconda.log" | eval esize=len(raw) | stats sum(esize) AS sumesize, count | eval fsize=sum_esize + count | fields fsize


alt text

View solution in original post

Splunk Employee
Splunk Employee

And, another additional...

If you export events from search results for one source, \r\n will be added to construct a file. As a result, sometimes file size of exported events for one file and its original source file are different if the original source file had only \n as new line character.

Splunk Employee
Splunk Employee

This answer is very great.

Additional info;

At Indexing time, splunk removes character(s) which was parsed as line breaking character. So, if \r\n was the line breaking characters, two bytes will be removed. So, if there is an empty line, \r\n at the empty line will be removed.
However, we counts these removed bytes at licensing volume as these characters were existed in source file or source from network (tcp or udp).