Re: Splunk goes over limit when it monitors a shar...

Kinan · ‎05-07-2012

We wanted to index the log file for one of our IIS web servers. Given the fact that IIS by default writes a lot of data to the log files(log entry for every request to every asset), we had to come up with a way to reduce the log file before feeding it to Splunk and stay under our daily limit(we average 100 MB on a busy day and we have other servers that we'd like to monitor without upgrading the license).

Fun Facts: IIS either logs everything or nothing. You can't tell it to log certain activities or requests to certain assets out of the box.

Here is what we did:

Filterd the log entries on the original IIS log file using Log Parser. We ignore entries that fetch jpg, JS and CSS assets. We achieved this by scheduling an OS task that runs every 5 minutes and executes a batch script to write the log entries to a shared log file.
Setup Splunk to read the shared log file.
IIS was setup in our environment to replace the log file daily at midnight. Hence, our filtered log file will be reset at midnight every night. The filtered log file will grow from 0 MB at midnight to around 15 MB at the end of the day.

On the first day we added the log file above to Splunk, we went over our limit. Splunk has indexed 200+ MB worth of data out of the log file. My guess is that Splunk is thinking it should consume the file again every time it detects it got replaced by LogParser.

My question is:

How do we avoid indexing the same entries over and over when the file gets replaced?
Do we need to figure out a way for Log Parser to append to the existing shared log file and grab only log entries that were recorded in the last 5 minutes?
Alternatively, is there a way to tell Splunk to only index entries that have not been already indexed from a given source?

I think the main difference between a standard IIS log file and our filtered log file is that IIS appends to the log file whereas we(Log Parser) replace it and hence its CRC/header/footer information gets changed. Hence, Splunk resets its pointer for that file and thinks it needs to go back to the beginning and reindex everything.

Much obliged for any answers/comments,

Kinan · ‎05-07-2012

Actually, thinking about this for a minute, I will try the following workaround:
Since Splunk will reindex the file every X minutes(X is the interval of our internal scheduled task), then why not include ONLY entries that had been added to the original log file in the last X minutes?

Here is what the new process will look like:

At midnight, we start with an empty filtered log file on the shared network.
X minutes later, our scheduled task runs and adds log entries that were recorded in the last X minutes.
Splunk indexes the added log entries.
X minutes later, our scheduled task runs again. This time Log Parser replaces the filtered log file with a new file that has entries recorded in the last X minutes. Hence, no duplicate entries will be added.
Splunk will index the new log entries. No duplicates should be added.
Splunk should not get more than 10-20 MBs in any given day from this data source.

Will try this revision and post the results.

Update:
Have been running the process shown above for the last couple hours. The log file has reached 1.25 MB during that time(expected). This means that by 24 hours, we would have an index volume of roughly 12 MB for that server which is what we would expect. Compare this to 200 MB in the last couple days!

Splunk goes over limit when it monitors a shared log file

Splunk MCP & Agentic AI: Machine Data Without Limits

Finding Based Detections General Availability

Get Your Hands Dirty (and Your Shoes Comfy): The Splunk Experience

Join the Conversation

Splunk goes over limit when it monitors a shared log file

Splunk MCP & Agentic AI: Machine Data Without Limits

Finding Based Detections General Availability

Get Your Hands Dirty (and Your Shoes Comfy): The Splunk Experience