I am planning a Splunk deployment that involves indexing large number of gz files FTP from multiple sources.
Can I configure Splunk to monitor the directories containing these files directly?
My concern is that the directory I am thinking to have Splunk monitored is the FTP upload folder, and because the files are in the gz format. Will Splunk be confused when it sees a gz file still in the middle of transfer? Can I monitor the upload folder directly with Splunk? The plan is to use universal forwarder for the monitoring.
It depends on your environment if you can monitor the upload folder directly with splunk or not.
If the file under transferring via ftp have contains file extension like ".inprogress", you can avoid to index the file until the ftp transferring is done. Please use bracklist setting like as bellow for such a case.
In inputs.conf put a "blacklist = .inprogress$"
If the file under transferring via ftp does not have contains any extra extension like above, you should not monitor the file directly. In such a case, you need to move uploaded file to monitored directory by using script outside of splunk.
Hope this help.
Thanks for your reply. How does the gz file applied in your answer? I mean if the files were regular text files, could I monitor the FTP upload folder directly provided that there would be no special file extension to distinguish from completed files and files in transit.
Splunk uncompress gz file before indexing it, then splunk index the uncompressed text file. You can monitor the FTP upload folder directly and Splunk index the uploaded file. But I do not recommend it. If ftp transferring have netowrk connectivity issue, ftp client try to resend the file. However splunk can not distingush the file is new one or already index one. Since splunk recognize the file with hash algorithm, splunk does not understand the gz file is new one or not. It means there is possibiliy for splunk to index duplicated events. I faced this issue before. So, I do not recommend.