Hey all:
I'm very interested in setting Splunk up to have it monitor all of my logs. One of such main requirements are Apache log files on my web host. I've got access to these files remotely via HTTP (htaccess protected) and FTP. Due to how the log files are handled, their file names include the date.
Due to these servers belonging to ISPs, I cannot install the normal Splunk forwarding agents.
The same scenario exists with a client that I'm looking to set Splunk up for (minus the HTTP access to the logs).
I also have Cron access on my personal web host, as well as CGI capabilities, however I would rather have this be setup as a pull type solution so that I don't need to start opening things up on the firewall.
I'm sure that the above scenario is pretty common, so I would imagine one of the Splunk aficionados out there has already tackled this many times.
Many thanks in advance,
Greg
I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.
cron
, I would assume it's unix based. So if you can use rsync
to push files to your local system (or preferably, rsync
via ssh
) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.)wget
or curl
to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.Here is a quick example of doing a recursive copy using wget
(option 2.1). This will copy logs to directory structure under C:\RemoteLogs
.
wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\
NOTES: This assumes you've installed a wget
client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.
Hi Lowell,
OK - so I've been in touch with the other ISP, and now better understand the options on both ISPs. Both provide SSH access (I'm working on getting it activated on the one). My personal provider provides scheduled tasks, the other does not. Both have CGI access for things like PHP/PERL. Both have FTP access to the log files. Both automatically name, rotate, and GZIP the log files.
The one provider has the log files appear under domain.com/logs - access controllable by permissions and .htaccess. The other keeps the log files in ~/logs, where ~/www is where the domain web root is.
I do not have any Unix systems; and the boss does not want any. So Cygwin could be a good option to use wget --continuous. However I may be better off trying to convince the boss that having one Linux box that is only doing monitoring is not a risk to the company. It is seeming as if this option is becoming more and more interesting.
I am not concerned with the data being encrypted. It's just web site traffic data. No forms or anything like that exist on the site.
I've checked for rsync, wget, and curl on the hosting providers. My personal host has curl, but that is it. Neither have wget or rsync.
Cheers,
Greg
Hi Lowell,
Thanks for your clear and insightful response. A couple of added questions / clarifications however...
The Forwarder: Can it be setup to run simply using cron? I can only run things using cron (or CGI of course) on my ISP, but I thought that perhaps there was a way to execute a "collect data and report" type function for the Forwarder.
Re: what platform. Oops... sorry for not including this point! The web servers are Unix, and the servers running Splunk are Windows (test machine XP Home OEM, production server will be Windows 2003 Server R2).
There is a very strong likely hood that I will end up using the option to pull the log files daily/weekly. Perhaps I'm lazy, or expecting way more from Splunk than it can provide, but is there anyway to setup such scheduled transfers easily? I could of course find some tool to do recurring scheduled FTP downloads, but this starts to make things more complex and kludgey.
Thanks for your help,
Greg
Greg, looks like we are going to need some more info to help you out. First, do you have ssh
access to your hosted environment? If so that opens up a few options. Also, do you care if your log data is sent across the internet unencrypted? Next, can you find out which of these tools you can run on your hosted environment? Check for rsync
, wget
, and curl
. If you can get shell access and run <program> --version
you can confirm that they are installed or not. Also, do you know if your provider allows directory listings of your log files?
Lowell's 5 star answer gives all the key points to think about, but I might consider using a simple shellscript wrapping wget -C or curl to get the logfiles updated partially as they're built out on the remote site. This is somewhat dependent upon the provider having reliable behavior, but they probably do.
geva, this would be a possible replacement for the rsync
approach. It's a bit more quick-n-dirty than rsync. However, it may be your only option if your can't run rsync on your hosted system. I assume that Jrodman is suggesting a log pulling approach here, where as rsync could be setup more easily in either push or pull mode.
Just a minor correction. I believe Jrodman means wget -c
not wget -C
. The args are case-sensitive. (This is short for wget --continue
, if you prefer the more verbose style arguments for readability.)
They are pretty reliable, so no prob there. What would this accomplish? (I'm not a *nix expert)
I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.
cron
, I would assume it's unix based. So if you can use rsync
to push files to your local system (or preferably, rsync
via ssh
) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.)wget
or curl
to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.Here is a quick example of doing a recursive copy using wget
(option 2.1). This will copy logs to directory structure under C:\RemoteLogs
.
wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\
NOTES: This assumes you've installed a wget
client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.
No, I don't believe the forwarder can be setup to run via cron. Even if you could rig up something like this, it's probably not a good idea. Running on windows will make the rsync option more difficult (and jrodman's curl
or wget
). If you have a local unix setup than you could setup a forward on that machine and forward your events to your central indexer, or you could setup something like cygwin
and run rsync
or other unix tools on windows directly. The last time I tried "rsync" compiled for windows, I wasn't very impressed; but cygwin should be fine.
Hi Lowell,
Thanks for your clear and insightful response. A couple of added questions / clarifications however...
The Forwarder: Can it be setup to run simply using cron? I can only run things using cron (or CGI of course) on my ISP, but I thought that perhaps there was a way to execute a "collect data and report" type function for the Forwarder.
Re: what platform. Oops... sorry for not including this point! The web servers are Unix, and the servers running Splunk are Windows (test machine XP Home OEM, production server will be Windows 2003 Server R2).