Getting Data In

How can I add Apache logs from an external web host to Splunk on a private network?

geva
Explorer

Hey all:

I'm very interested in setting Splunk up to have it monitor all of my logs. One of such main requirements are Apache log files on my web host. I've got access to these files remotely via HTTP (htaccess protected) and FTP. Due to how the log files are handled, their file names include the date.

Due to these servers belonging to ISPs, I cannot install the normal Splunk forwarding agents.

The same scenario exists with a client that I'm looking to set Splunk up for (minus the HTTP access to the logs).

I also have Cron access on my personal web host, as well as CGI capabilities, however I would rather have this be setup as a pull type solution so that I don't need to start opening things up on the firewall.

I'm sure that the above scenario is pretty common, so I would imagine one of the Splunk aficionados out there has already tackled this many times.

Many thanks in advance,

Greg

Tags (2)
1 Solution

Lowell
Super Champion

I think the biggest thing you have to decide how up-to-date you need your log data to be. Here are some options starting from the most "live" to the most infrequent.

  1. Forwarder: If using a splunk forwarder your log files are monitored and forwarder very quickly and normally from withing a search you see your events in under a minute after they occurred. That is ideal, but it sounds like this isn't an option for you. Also, I'm not sure what the overall performance is like when forwarding over the Internet (everything we have is intranet, so I can't speak to that. Although I assume enabling compression and encryption would be very helpful options here.)
  2. Synchronization:
    1. Push files from hosted environment: Another option is to synchronize your files to your local network. You didn't mention what platform you are on, but since you mentioned cron, I would assume it's unix based. So if you can use rsync to push files to your local system (or preferably, rsync via ssh) you could setup a job to push your log files to you incrementally, and then your log files could be indexed incrementally, as often as your setup your cron job to run. (You would probably want to play around with rsync's settings on how it handles temporary files. It may be preferable to have your log files appended, instead of copied to a temporary file which is then updated and renamed over-top of the original file. I image splunk would do well with any of these, but there is probably some options that will work better than others.)
    2. Do a recursive web copy using wget (Thanks to Jrodman for this idea). You could use wget or curl to recursively pull your web log files over http or ftp. If you combine this with a "continue" mode, then the logs could be pull incrementally.
  3. Pull entire log files You could setup a daily (or weekly?) job to pull the log files of of FTP and put them in a local location that is monitored by splunk. You would want to do this after your log file rotation so that you don't have to worry about event being written to it after you copy the file. You may also be able to gzip the file on the hosted environment before you pick it up, not sure what flexibility you have here. If you only want your data indexed once a day, than this would work fine. Once the file is copied locally, splunk provides a much of different ways to have it loaded. You could use the de facto "monitor", or if you want the local copies deleted after they are indexed, you could use a "batch" input. And then you have various one-shot input options too. In fact I think you can even upload a log file via an HTTP POST if you really want to (think this is limited to 500Mb, but I could be wrong, this probably would be the best option to start with anyways.)

Here is a quick example of doing a recursive copy using wget (option 2.1). This will copy logs to directory structure under C:\RemoteLogs.

wget --user=USER --password=PASS --continue --recursive --level=3 --no-parent http://my.site.example.net/path/to/my/logs -O C:\RemoteLogs\

NOTES: This assumes you've installed a wget client for windows. As with all HTTP traffic, your content will be transfered in an unencrypted form, so if your data is sensitive this may not be an acceptable solution. Also be aware that I haven't actually tested this, and some of the options are purely a guess, so be prepared to tweak the example. To run this on a regular basis, you should be able to set this up as batch file and set it up as a scheduled task. For disk space management purposes, this approach will require some sort or log file cleanup process(es) that firstly, removes old log files from the hosted environment, and then secondarily, removes the local copy of the log files.

View solution in original post

Get Updates on the Splunk Community!

Say goodbye to manually analyzing phishing and malware threats with Splunk Attack ...

In today’s evolving threat landscape, we understand you’re constantly bombarded with phishing and malware ...

AppDynamics is now part of Splunk Ideas

Hello Splunkers, We have exciting news for you! AppDynamics has been added to the Splunk Ideas Portal. Which ...

Advanced Splunk Data Management Strategies

Join us on Wednesday, May 14, 2025, at 11 AM PDT / 2 PM EDT for an exclusive Tech Talk that delves into ...