Actually, thinking about this for a minute, I will try the following workaround:
Since Splunk will reindex the file every X minutes(X is the interval of our internal scheduled task), then why not include ONLY entries that had been added to the original log file in the last X minutes?
Here is what the new process will look like:
At midnight, we start with an empty filtered log file on the shared network.
X minutes later, our scheduled task runs and adds log entries that were recorded in the last X minutes.
Splunk indexes the added log entries.
X minutes later, our scheduled task runs again. This time Log Parser replaces the filtered log file with a new file that has entries recorded in the last X minutes. Hence, no duplicate entries will be added.
Splunk will index the new log entries. No duplicates should be added.
Splunk should not get more than 10-20 MBs in any given day from this data source.
Will try this revision and post the results.
Update:
Have been running the process shown above for the last couple hours. The log file has reached 1.25 MB during that time(expected). This means that by 24 hours, we would have an index volume of roughly 12 MB for that server which is what we would expect. Compare this to 200 MB in the last couple days!
... View more