I have a FTP data collector which pulls in files from an FTP server and dumps them into a directory monitored by Splunk.
The files are all of the IDA00*.dat files and are sourced from ftp://ftp2.bom.gov.au/anon/gen/fwo/
My script checks this ftp server about every 6 hours and if the modified date has changed on the files it will redownload them and replace them in /home/phoenix/data/bom/
Splunk is setup to monitor this directory with the following conf files
[monitor:///home/phoenix/data/bom] disabled = 0 followTail = 0 host = BOM index = bom crcSalt = <SOURCE>
[source::...[/\\]bom[/\\]IDA00001.dat] KV_MODE = none SHOULD_LINEMERGE = false sourcetype = bomIDA00001 REPORT-extractIDA00001 = IDA00001_Fields priority = 100
priority 100 required as Splunk ignores .dat files by default. I have had to remove .dat from /opt/splunk/etc/default/props.conf as well recently as the priority stopped working for some reason and the data was being treated as binary (but thats for another topic)
[IDA00001_Fields] DELIMS = "#" FIELDS = loc_id,location,state,forecast_date,issue_date,issue_time,min_0,max_0,min_1,max_1,min_2,max_2,min_3,max_3,min_4,max_4,min_5,max_5,min_6,max_6,min_7,max_7,forecast_0,forecast_1,forecast_2,forecast_3,forecast_4,forecast_5,forecast_6,forecast_7,dummy
Now this seemed to be working ok for a while but for some reason it has stopped indexing files even though new files are coming in with completely different data (in particular the forecast_date). I have can only see data in the index=bom from the 28th of Sept and back. It is the 29th and there should be data in Splunk for that.
Running the following returns some actions on the files in question
grep IDA00001.dat /opt/splunk/var/log/splunk/splunkd.log
09-29-2011 13:52:24.489 +1000 INFO WatchedFile - File too small to check seekcrc, probably truncated. Will re-read entire file='/home/phoenix/data/bom/IDA00001.dat'. 09-29-2011 14:48:50.167 +1000 INFO WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/home/phoenix/data/bom/IDA00001.dat'. 09-29-2011 14:48:50.167 +1000 INFO WatchedFile - Will begin reading at offset=0 for file='/home/phoenix/data/bom/IDA00001.dat'.
So it seems like Splunk is working on the files. Are they being indexed though as the data is not showing up?
Any help would be appreciated.
Is it possible the timestamping has changed? Just thinking it might be indexing the data but its been put with a different date/time to that which you are expecting
Unfortunately no after clearing monitored directory then clearing the indexes with the command
/opt/splunk/bin/splunk stop; /opt/splunk/bin/splunk clean eventdata -f -index bom; /opt/splunk/bin/splunk clean eventdata -f -index bom_summary; /opt/splunk/bin/splunk start
I retrieve the files again and Splunk shows zero events in the index.
i think you have to clean the _fishbuket index on the forwarder, that's the location were splunk stores the information which file is indexed or not
Something I just remembered about this issue.
The file had the extension .dat and this is classified as a binary file by one of the splunk configuration files.
We ended up removing it from /etc/system/default/props.conf under the stanza
[source::....(0t|a|ali|asa|au|bmp|cg|cgi|class|d|dat|deb|del|dot|dvi|dylib|elc|eps|exe|ftn|gif|hlp|hqx|hs|icns|ico|inc|iso|jame|jin|jpeg|jpg|kml|la|lhs|lib|lo|lock|mcp|mid|mp3|mpg|msf|nib|o|obj|odt|ogg|ook|opt|os|pal|pbm|pdf|pem|pgm|plo|png|po|pod|pp|ppd|ppm|ppt|prc|ps|psd|psym|pyc|pyd|rast|rb|rde|rdf|rdr|rgb|ro|rpm|rsrc|so|ss|stg|strings|tdt|tif|tiff|tk|uue|vhd|xbm|xlb|xls|xlw)] sourcetype = known_binary
Obviously the correct way to do this would be to add this to your props.conf in your app which should override this default.