Solved: How can I check that Splunk indexed the entire con...

hexx · ‎05-18-2011

I would like to check that a given file has been fully indexed by Splunk.

I tried counting the lines in the source file using "wc -l" against the number of events indexed in Splunk, but this doesn't match up because some of my events include multiple lines.

How can I do this?

hexx · ‎05-18-2011

Checking the line count of a source file against the number of lines indexed by Splunk can be easily achieved. Here is an example with a file that numbers 1117 lines indexed as 7 events :

Count the number of lines in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/Xorg.0.log | wc -l

1116

In Splunk, search for all events for that specific source over all time, and aggregate the values of the "linecount" field :

source="/var/log/Xorg.0.log" | stats sum(linecount)

The two numbers should match, provided that you do not work on a live file that is part of a rotation (example : /var/log/messages or $SPLUNK_HOME/var/log/splunk/metrics.log) or that you are routing events from this file to the null queue.

Another method, although often less accurate, is to measure the byte count of the source file (again, excluding empty lines) and compare it against the aggregated byte count for all events indexed for that source :

Count the number of bytes in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/anaconda.log | wc -c

841067

In Splunk, search for all events for that specific source over all time, calculate and aggregate the size of all events in the "esize" field using the len() eval function on the event raw data (the "_raw" field) and add the event count for the file!. The last step is important to get an accurate byte count because Splunk "loses" one byte per event when it dumps the last newline character of each event :

source="/var/log/anaconda.log" | eval esize=len(_raw) | stats sum(esize) AS sum_esize, count | eval fsize=sum_esize + count | fields fsize

View solution in original post

MedralaG · ‎05-10-2018

Here is something that will make this task a lot easier.
You can use a REST API call of the following format:
https://xx.xx.xx.xx:8089/services/admin/inputstatus/TailingProcessor:FileStatus

you can run that as a curl command from CLI or just put that in the web browser. Replace the xxx with the IP address of the host where the Universal Forwarder is installed (or the indexer if its the indexer that is monitoring files directly).

This REST API call will present a page that will show you each file that falls under the monitor: stanza and what is the status on it's read.
Here is an examply:

/opt/splunk/var/log/splunk/metrics.log.2

file position 25000134
file size 25000134
parent $SPLUNK_HOME/var/log/splunk
percent 100.00
type finished reading

Pretty self-explanatory.
If the file has not been fully read yet the percentage will be less then 100.
if there is a reason why Splunk hasn't started reading the file (like CRC) the reason will be stated there. Really good for troubleshooting.

hexx · ‎05-18-2011

Checking the line count of a source file against the number of lines indexed by Splunk can be easily achieved. Here is an example with a file that numbers 1117 lines indexed as 7 events :

Count the number of lines in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/Xorg.0.log | wc -l

1116

In Splunk, search for all events for that specific source over all time, and aggregate the values of the "linecount" field :

source="/var/log/Xorg.0.log" | stats sum(linecount)

The two numbers should match, provided that you do not work on a live file that is part of a rotation (example : /var/log/messages or $SPLUNK_HOME/var/log/splunk/metrics.log) or that you are routing events from this file to the null queue.

Another method, although often less accurate, is to measure the byte count of the source file (again, excluding empty lines) and compare it against the aggregated byte count for all events indexed for that source :

Count the number of bytes in the source file excluding all lines empty or exclusively containing white space characters, which Splunk doesn't index :

[root@beefysup01 ~]# grep -v -e "^\s*$" /var/log/anaconda.log | wc -c

841067

In Splunk, search for all events for that specific source over all time, calculate and aggregate the size of all events in the "esize" field using the len() eval function on the event raw data (the "_raw" field) and add the event count for the file!. The last step is important to get an accurate byte count because Splunk "loses" one byte per event when it dumps the last newline character of each event :

source="/var/log/anaconda.log" | eval esize=len(_raw) | stats sum(esize) AS sum_esize, count | eval fsize=sum_esize + count | fields fsize

Masa · ‎06-11-2014

And, another additional...

If you export events from search results for one source, \r\n will be added to construct a file. As a result, sometimes file size of exported events for one file and its original source file are different if the original source file had only \n as new line character.

Masa · ‎06-11-2014

This answer is very great.

Additional info;

At Indexing time, splunk removes character(s) which was parsed as line breaking character. So, if \r\n was the line breaking characters, two bytes will be removed. So, if there is an empty line, \r\n at the empty line will be removed.
However, we counts these removed bytes at licensing volume as these characters were existed in source file or source from network (tcp or udp).

How can I check that Splunk indexed the entire contents of a given file?

Observe and Secure All Apps with Splunk

Splunk Decoded: Business Transactions vs Business IQ

Fastest way to demo Observability

Are you a member of the Splunk Community?

How can I check that Splunk indexed the entire contents of a given file?

Observe and Secure All Apps with Splunk

Splunk Decoded: Business Transactions vs Business IQ

Fastest way to demo Observability