Getting Data In

How can I add Apache logs from an external web host to Splunk on a private network?

Explorer

Hey all:

I'm very interested in setting Splunk up to have it monitor all of my logs. One of such main requirements are Apache log files on my web host. I've got access to these files remotely via HTTP (htaccess protected) and FTP. Due to how the log files are handled, their file names include the date.

Due to these servers belonging to ISPs, I cannot install the normal Splunk forwarding agents.

The same scenario exists with a client that I'm looking to set Splunk up for (minus the HTTP access to the logs).

I also have Cron access on my personal web host, as well as CGI capabilities, however I would rather have this be setup as a pull type solution so that I don't need to start opening things up on the firewall.

I'm sure that the above scenario is pretty common, so I would imagine one of the Splunk aficionados out there has already tackled this many times.

Many thanks in advance,

Greg

Tags (2)
1 Solution

Super Champion

I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.

  1. Forwarder: If using a splunk forwarder your log files are monitored and forwarder very quickly and normally from withing a search you see your events in under a minute after they occurred. That is ideal, but it sounds like this isn't an option for you. Also, I'm not sure what the overall performance is like when forwarding over the Internet (everything we have is intranet, so I can't speak to that. Although I assume enabling compression and encryption would be very helpful options here.)
  2. Synchronization:
    1. Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned cron, I would assume it's unix based. So if you can use rsync to push files to your local system (or preferably, rsync via ssh) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.)
    2. Do a recursive web copy using wget (Thanks to Jrodman for this idea). You could use wget or curl to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.
  3. Pull entire log files You could setup a daily (or weekly?) job to pull the log files of of FTP and put them in a local location that is monitored by splunk. You would want to do this after your log file rotation so that you don't have to worry about event being written to it after you copy the file. You may also be able to gzip the file on the hosted environment before you pick it up, not sure what flexibility you have here. If you only want your data indexed once a day, than this would work fine. Once the file is copied locally, splunk provides a much of different ways to have it loaded. You could use the de facto "monitor", or if you want the local copies deleted after they are indexed, you could use a "batch" input. And then you have various one-shot input options too. In fact I think you can even upload a log file via an HTTP POST if you really want to (think this is limited to 500Mb, but I could be wrong, this probably would be the best option to start with anyways.)

Here is a quick example of doing a recursive copy using wget (option 2.1). This will copy logs to directory structure under C:\RemoteLogs.

wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\

NOTES: This assumes you've installed a wget client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.

View solution in original post

Explorer

Hi Lowell,

OK - so I've been in touch with the other ISP, and now better understand the options on both ISPs. Both provide SSH access (I'm working on getting it activated on the one). My personal provider provides scheduled tasks, the other does not. Both have CGI access for things like PHP/PERL. Both have FTP access to the log files. Both automatically name, rotate, and GZIP the log files.

The one provider has the log files appear under domain.com/logs - access controllable by permissions and .htaccess. The other keeps the log files in ~/logs, where ~/www is where the domain web root is.

I do not have any Unix systems; and the boss does not want any. So Cygwin could be a good option to use wget --continuous. However I may be better off trying to convince the boss that having one Linux box that is only doing monitoring is not a risk to the company. It is seeming as if this option is becoming more and more interesting.

I am not concerned with the data being encrypted. It's just web site traffic data. No forms or anything like that exist on the site.

I've checked for rsync, wget, and curl on the hosting providers. My personal host has curl, but that is it. Neither have wget or rsync.

Cheers,

Greg

0 Karma

Explorer

Hi Lowell,

Thanks for your clear and insightful response. A couple of added questions / clarifications however...

The Forwarder: Can it be setup to run simply using cron? I can only run things using cron (or CGI of course) on my ISP, but I thought that perhaps there was a way to execute a "collect data and report" type function for the Forwarder.

Re: what platform. Oops... sorry for not including this point! The web servers are Unix, and the servers running Splunk are Windows (test machine XP Home OEM, production server will be Windows 2003 Server R2).

There is a very strong likely hood that I will end up using the option to pull the log files daily/weekly. Perhaps I'm lazy, or expecting way more from Splunk than it can provide, but is there anyway to setup such scheduled transfers easily? I could of course find some tool to do recurring scheduled FTP downloads, but this starts to make things more complex and kludgey.

Thanks for your help,

Greg

0 Karma

Super Champion

Greg, looks like we are going to need some more info to help you out. First, do you have ssh access to your hosted environment? If so that opens up a few options. Also, do you care if your log data is sent across the internet unencrypted? Next, can you find out which of these tools you can run on your hosted environment? Check for rsync, wget, and curl. If you can get shell access and run <program> --version you can confirm that they are installed or not. Also, do you know if your provider allows directory listings of your log files?

0 Karma

Splunk Employee
Splunk Employee

Lowell's 5 star answer gives all the key points to think about, but I might consider using a simple shellscript wrapping wget -C or curl to get the logfiles updated partially as they're built out on the remote site. This is somewhat dependent upon the provider having reliable behavior, but they probably do.

Super Champion

geva, this would be a possible replacement for the rsync approach. It's a bit more quick-n-dirty than rsync. However, it may be your only option if your can't run rsync on your hosted system. I assume that Jrodman is suggesting a log pulling approach here, where as rsync could be setup more easily in either push or pull mode.

0 Karma

Super Champion

Just a minor correction. I believe Jrodman means wget -c not wget -C. The args are case-sensitive. (This is short for wget --continue, if you prefer the more verbose style arguments for readability.)

0 Karma

Explorer

They are pretty reliable, so no prob there. What would this accomplish? (I'm not a *nix expert)

0 Karma

Super Champion

I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.

  1. Forwarder: If using a splunk forwarder your log files are monitored and forwarder very quickly and normally from withing a search you see your events in under a minute after they occurred. That is ideal, but it sounds like this isn't an option for you. Also, I'm not sure what the overall performance is like when forwarding over the Internet (everything we have is intranet, so I can't speak to that. Although I assume enabling compression and encryption would be very helpful options here.)
  2. Synchronization:
    1. Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned cron, I would assume it's unix based. So if you can use rsync to push files to your local system (or preferably, rsync via ssh) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.)
    2. Do a recursive web copy using wget (Thanks to Jrodman for this idea). You could use wget or curl to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.
  3. Pull entire log files You could setup a daily (or weekly?) job to pull the log files of of FTP and put them in a local location that is monitored by splunk. You would want to do this after your log file rotation so that you don't have to worry about event being written to it after you copy the file. You may also be able to gzip the file on the hosted environment before you pick it up, not sure what flexibility you have here. If you only want your data indexed once a day, than this would work fine. Once the file is copied locally, splunk provides a much of different ways to have it loaded. You could use the de facto "monitor", or if you want the local copies deleted after they are indexed, you could use a "batch" input. And then you have various one-shot input options too. In fact I think you can even upload a log file via an HTTP POST if you really want to (think this is limited to 500Mb, but I could be wrong, this probably would be the best option to start with anyways.)

Here is a quick example of doing a recursive copy using wget (option 2.1). This will copy logs to directory structure under C:\RemoteLogs.

wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\

NOTES: This assumes you've installed a wget client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.

View solution in original post

Super Champion

No, I don't believe the forwarder can be setup to run via cron. Even if you could rig up something like this, it's probably not a good idea. Running on windows will make the rsync option more difficult (and jrodman's curl or wget). If you have a local unix setup than you could setup a forward on that machine and forward your events to your central indexer, or you could setup something like cygwin and run rsync or other unix tools on windows directly. The last time I tried "rsync" compiled for windows, I wasn't very impressed; but cygwin should be fine.

0 Karma

Explorer

Hi Lowell,

Thanks for your clear and insightful response. A couple of added questions / clarifications however...

The Forwarder: Can it be setup to run simply using cron? I can only run things using cron (or CGI of course) on my ISP, but I thought that perhaps there was a way to execute a "collect data and report" type function for the Forwarder.

Re: what platform. Oops... sorry for not including this point! The web servers are Unix, and the servers running Splunk are Windows (test machine XP Home OEM, production server will be Windows 2003 Server R2).

0 Karma