I have a couple of clusters with logfiles that reside on a shared cluster filesystem that all hosts in the cluster logs data to. The clustered applications can execute on any node but will always log to the same directory and file.
How can I prevent multiple cluster hosts running splunk from indexing the same logfile? The filename and paths to the logfiles does not contain any useful information such as hostname that could help me narrow the input stanza.
I suppose if it's a shared filesystem, you could just have a single instance of Splunk monitor the entire log filesystem and send everything. If simply looking at the filesystem provides no way to determine where the original file came from, I'm not sure what else there is to do. If you want to spread the forwarding load out, you could just partition the set of files to different Splunk instances to read and forward, e.g., one node reads a whitelist:
/var/log/[a-m]* and another reads
/var/log/[n-z]*. The nodes aren't reading the files they wrote, but I don't know if that matters.
The problem with having only one node monitor everything is that we would need to reconfigure the forwarder on another node if the first one is down for whatever reason (maintenance, crash, etc). These are all high availability clusters designed to be able to loose one or more nodes without requiring reconfigurations.
It would have been great if one splunk instance could let other instances know that certain files are already being indexed (perhaps using some sort of lockfile).
Put the forwarder's splunk code on the shared filesystem and run it as a cluster resource as well. The forwarder's internal index will keep up with which logfiles have been forwarded and how much of them -- allowing it to intelligently restart during a cluster event.
Sounds like a workable solution, I would have to run n+1 splunk forwarders as I still need to index local files and run scripts on each node, but that's minor problem.
I was thinking, would it be possible to just let all nodes index the logfiles in question and then delete all duplicates on the central splunk server using a scheduled search of some kind? I haven't been able to find a search command that could be used to list all duplicate events that I could feed to "| delete", does something like this exist?
I do not think you would be happy with 'delete' due to its speed (it is not very fast) and the fact that space in the index would not be reclaimed.
Speed and space is probably not an issue as the log volume would be quite small, what kind of search could I use to only find the duplicates (something like an opposite of dedupe)?
Another way would be to just forward all events to be indexed at all times, even if you get duplicates or not, and then use the dedup command on the entire raw event (| dedup _raw) to filter out duplicates on the fly while you search.
Obviously, this is not as elegant, but at least you will never risk missing events under any conditions or unexpected scenarios, assuming here that is the more important requirement.