Working with a hosting provider (Pantheon), they allow access to the access logs, but not to install a forwarder on their servers. So I installed a forwarder on a server i have control over and set up a process to pull in the log files, and configured the forwarder to index those files.
This Question has been edited after additional information was found
I came across a scenario, where i was unaware of the round robin DNS in place, so when connecting to a hostname to pull in the log file, i could have been in one of a few different locations. Each time i pulled in the log file it would overwrite the previous copy, and if i didnt connect to the same server as last time, the log file would appear as a new file to splunk and get indexed. And again, and again, and again, as the DNS result switched between the various end locations.
This caused a very strange result of multiple duplicate entries, many many multiple entries, and the random nature of DNS made it hard to spot a pattern. This was confused further by a comment that mentioned log file truncation, and trying to explain the results. It appeared there was a circular log rotation practice in place, but this was not the case.
The new question
Now that we are connecting to multiple servers to index a file of the same name, with server names that could change over time, would it be better to rename the file to include the hostname, and monitor the single directory, or to create a directory for each host and force the forwarder to recursively monitor all files in all directories under the root of where you store files?
in order to index a file with the same name across multiple sub directories it was simple to add a monitor with the recursive ability built into the path "/../"
./splunk add monitor /path/to/logs/.../nginx-access.log
under /path/to/logs/ i have a directory for each host that could appear, and each one contains a nginx-access.log file
this seems to be the best way to do this. once the strange behavior was explained then the rest was very straight forward. Thanks to Splunk and Pantheon support for helping get this figured out.
in order to index a file with the same name across multiple sub directories it was simple to add a monitor with the recursive ability built into the path "/../"
./splunk add monitor /path/to/logs/.../nginx-access.log
under /path/to/logs/ i have a directory for each host that could appear, and each one contains a nginx-access.log file
this seems to be the best way to do this. once the strange behavior was explained then the rest was very straight forward. Thanks to Splunk and Pantheon support for helping get this figured out.
Wow, that's awful. I would initially echo dwaddle's suggestion to find a provider that offers a more sane experience.
As an implementer of the file monitor data input in Splunk, I've seen a lot of odd logging/filesystem/NFS/etc behaviors - this is among the worst. This is the first time I've seen someone actually observe a FIFO-like file (they should call it the Logness Monster.. you hear about it, and you never see it, but if you do -- get the eff out!).
You mentioned you "set up a process to pull in the log files" on your forwarder box -- what is this process? Do you have NFS access to the logs, or are you rsync'ing them across, or something else? If you have direct access to the raw logs, can you see how "tail -F" behaves on the file? Does it continue to follow the file, or is the file being truncated and thus confusing even tail?
If tail -F works reasonably well, you may be able to get away with a scripted input that just tails the log and dumps the contents to stdout - you would reduce the chance of dupes, but increase the chance of gaps if your forwarder has to be restarted (or if file truncation trips up tail -F).
If you need higher log fidelity than this (honestly, this would be "good enough" for web trends and the like), you're out of luck. If you need help setting up a scripted input to capture this data, feel free to ask or stop by in EFNet/#splunk - we can help.
I have revised the question, there were other forces at work that made this appear to be the case.
You could also look into changing the logging to use rotatelogs, but I'm assuming they won't give you direct access to the httpd.conf.
If that's the case I would look into Ducky's suggestion
they actually do use logrotate, the question has been edited to reflect the current findings
So are they using a logfile as a "circular log" - as in wrapping back around to the beginning? Or are they actually rewriting the whole file with some early records removed?
Either way, this is bad juju. I think you will struggle to get this into Splunk correctly. This is the type of thing that would make me remove a provider from my list of suppliers I do business with ...
That.
If they want to save space, have them keep five rotated logfiles at a fifth the size.