I am trying to figure out an approach to a multiline log file problem I have, the device that generates the file does so like a regular running log file however it is FIFO at the point that it reaches 10MB. The only way I can get this file is via FTP and have Splunk monitor the download path, I managed to get all my multiline event breaking working correctly for the most part aside from a few stray events that are truncated from the source but I can live with that. The issue I have is that if I simply overwrite the file with a newly downloaded copy it duplicates many events since the first 256 bytes of the file has a different CRC than before and so does the last 256 bytes of the file. It's really much of the middle portion that is potentially the same .
Is there any tweak or method anyone can suggest to deal with this situation with the goal of not indexing any duplicate events?
First FTP of File Example
MSCi MSS01 2010-12-08 09:43:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:44:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:45:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:46:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:47:09
40+ random lines
END OF REPORT
Second FTP of File ~1 hour later
MSCi MSS01 2010-12-08 10:43:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:47:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:46:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:45:09
40+ random lines
END OF REPORT
MSCi MSS01 2010-12-08 09:44:09
Thanks
Jerrad
Rather than have splunk index the ftp'd file, you could perhaps have a script running after each ftp to extract just unique events into a new file and have splunk monitor that.