- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey all:
I'm very interested in setting Splunk up to have it monitor all of my logs. One of such main requirements are Apache log files on my web host. I've got access to these files remotely via HTTP (htaccess protected) and FTP. Due to how the log files are handled, their file names include the date.
Due to these servers belonging to ISPs, I cannot install the normal Splunk forwarding agents.
The same scenario exists with a client that I'm looking to set Splunk up for (minus the HTTP access to the logs).
I also have Cron access on my personal web host, as well as CGI capabilities, however I would rather have this be setup as a pull type solution so that I don't need to start opening things up on the firewall.
I'm sure that the above scenario is pretty common, so I would imagine one of the Splunk aficionados out there has already tackled this many times.
Many thanks in advance,
Greg
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.
- Forwarder: If using a splunk forwarder your log files are monitored and forwarder very quickly and normally from withing a search you see your events in under a minute after they occurred. That is ideal, but it sounds like this isn't an option for you. Also, I'm not sure what the overall performance is like when forwarding over the Internet (everything we have is intranet, so I can't speak to that. Although I assume enabling compression and encryption would be very helpful options here.)
- Synchronization:
- Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned
cron
, I would assume it's unix based. So if you can usersync
to push files to your local system (or preferably,rsync
viassh
) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.) - Do a recursive web copy using wget (Thanks to Jrodman for this idea). You could use
wget
orcurl
to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.
- Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned
- Pull entire log files You could setup a daily (or weekly?) job to pull the log files of of FTP and put them in a local location that is monitored by splunk. You would want to do this after your log file rotation so that you don't have to worry about event being written to it after you copy the file. You may also be able to gzip the file on the hosted environment before you pick it up, not sure what flexibility you have here. If you only want your data indexed once a day, than this would work fine. Once the file is copied locally, splunk provides a much of different ways to have it loaded. You could use the de facto "monitor", or if you want the local copies deleted after they are indexed, you could use a "batch" input. And then you have various one-shot input options too. In fact I think you can even upload a log file via an HTTP POST if you really want to (think this is limited to 500Mb, but I could be wrong, this probably would be the best option to start with anyways.)
Here is a quick example of doing a recursive copy using wget
(option 2.1). This will copy logs to directory structure under C:\RemoteLogs
.
wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\
NOTES: This assumes you've installed a wget
client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lowell,
OK - so I've been in touch with the other ISP, and now better understand the options on both ISPs. Both provide SSH access (I'm working on getting it activated on the one). My personal provider provides scheduled tasks, the other does not. Both have CGI access for things like PHP/PERL. Both have FTP access to the log files. Both automatically name, rotate, and GZIP the log files.
The one provider has the log files appear under domain.com/logs - access controllable by permissions and .htaccess. The other keeps the log files in ~/logs, where ~/www is where the domain web root is.
I do not have any Unix systems; and the boss does not want any. So Cygwin could be a good option to use wget --continuous. However I may be better off trying to convince the boss that having one Linux box that is only doing monitoring is not a risk to the company. It is seeming as if this option is becoming more and more interesting.
I am not concerned with the data being encrypted. It's just web site traffic data. No forms or anything like that exist on the site.
I've checked for rsync, wget, and curl on the hosting providers. My personal host has curl, but that is it. Neither have wget or rsync.
Cheers,
Greg
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lowell,
Thanks for your clear and insightful response. A couple of added questions / clarifications however...
The Forwarder: Can it be setup to run simply using cron? I can only run things using cron (or CGI of course) on my ISP, but I thought that perhaps there was a way to execute a "collect data and report" type function for the Forwarder.
Re: what platform. Oops... sorry for not including this point! The web servers are Unix, and the servers running Splunk are Windows (test machine XP Home OEM, production server will be Windows 2003 Server R2).
There is a very strong likely hood that I will end up using the option to pull the log files daily/weekly. Perhaps I'm lazy, or expecting way more from Splunk than it can provide, but is there anyway to setup such scheduled transfers easily? I could of course find some tool to do recurring scheduled FTP downloads, but this starts to make things more complex and kludgey.
Thanks for your help,
Greg
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greg, looks like we are going to need some more info to help you out. First, do you have ssh
access to your hosted environment? If so that opens up a few options. Also, do you care if your log data is sent across the internet unencrypted? Next, can you find out which of these tools you can run on your hosted environment? Check for rsync
, wget
, and curl
. If you can get shell access and run <program> --version
you can confirm that they are installed or not. Also, do you know if your provider allows directory listings of your log files?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content


Lowell's 5 star answer gives all the key points to think about, but I might consider using a simple shellscript wrapping wget -C or curl to get the logfiles updated partially as they're built out on the remote site. This is somewhat dependent upon the provider having reliable behavior, but they probably do.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
geva, this would be a possible replacement for the rsync
approach. It's a bit more quick-n-dirty than rsync. However, it may be your only option if your can't run rsync on your hosted system. I assume that Jrodman is suggesting a log pulling approach here, where as rsync could be setup more easily in either push or pull mode.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just a minor correction. I believe Jrodman means wget -c
not wget -C
. The args are case-sensitive. (This is short for wget --continue
, if you prefer the more verbose style arguments for readability.)
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
They are pretty reliable, so no prob there. What would this accomplish? (I'm not a *nix expert)
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.
- Forwarder: If using a splunk forwarder your log files are monitored and forwarder very quickly and normally from withing a search you see your events in under a minute after they occurred. That is ideal, but it sounds like this isn't an option for you. Also, I'm not sure what the overall performance is like when forwarding over the Internet (everything we have is intranet, so I can't speak to that. Although I assume enabling compression and encryption would be very helpful options here.)
- Synchronization:
- Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned
cron
, I would assume it's unix based. So if you can usersync
to push files to your local system (or preferably,rsync
viassh
) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.) - Do a recursive web copy using wget (Thanks to Jrodman for this idea). You could use
wget
orcurl
to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.
- Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned
- Pull entire log files You could setup a daily (or weekly?) job to pull the log files of of FTP and put them in a local location that is monitored by splunk. You would want to do this after your log file rotation so that you don't have to worry about event being written to it after you copy the file. You may also be able to gzip the file on the hosted environment before you pick it up, not sure what flexibility you have here. If you only want your data indexed once a day, than this would work fine. Once the file is copied locally, splunk provides a much of different ways to have it loaded. You could use the de facto "monitor", or if you want the local copies deleted after they are indexed, you could use a "batch" input. And then you have various one-shot input options too. In fact I think you can even upload a log file via an HTTP POST if you really want to (think this is limited to 500Mb, but I could be wrong, this probably would be the best option to start with anyways.)
Here is a quick example of doing a recursive copy using wget
(option 2.1). This will copy logs to directory structure under C:\RemoteLogs
.
wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\
NOTES: This assumes you've installed a wget
client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, I don't believe the forwarder can be setup to run via cron. Even if you could rig up something like this, it's probably not a good idea. Running on windows will make the rsync option more difficult (and jrodman's curl
or wget
). If you have a local unix setup than you could setup a forward on that machine and forward your events to your central indexer, or you could setup something like cygwin
and run rsync
or other unix tools on windows directly. The last time I tried "rsync" compiled for windows, I wasn't very impressed; but cygwin should be fine.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lowell,
Thanks for your clear and insightful response. A couple of added questions / clarifications however...
The Forwarder: Can it be setup to run simply using cron? I can only run things using cron (or CGI of course) on my ISP, but I thought that perhaps there was a way to execute a "collect data and report" type function for the Forwarder.
Re: what platform. Oops... sorry for not including this point! The web servers are Unix, and the servers running Splunk are Windows (test machine XP Home OEM, production server will be Windows 2003 Server R2).
