Archive

Excessive Sources and Index Latency

Communicator

First my environment consists of an aggregation server which essentially is a syslog that writes to file, a universal forward setup to watch that directory which then forwards all events to an index pool of 6 servers and finally a search head to query the 6 index servers.

The problem I'm seeing is two fold, first I am constantly behind when it comes to events, sometimes hours sometimes days and other times it just seems to stop indexing all together. The second issue I'm noticing is when I monitor a Real-Time/All-Time search the events are coming in at very random times, some events are a day ago, some are a few hours ago, so on and so forth.

I have SoS installed and checked for blocked queues, excessive index load and warnings/errors on my search peers and everything is within reason. The only thing that jumps out at me is the sourcetype I'm working with has excessive index latency (>300,000 seconds).

My theory is that syslog is writing events so fast and rotating files so often to disk, that it is creating so many files for splunk to index which would explain the random index times for events. I was toying with the idea of enabling the tail option for my sourcetype to skip past all the old data and get caught up but I'm not confident that it will permanently solve my problem.

Syslog is configured to rotate the file every 20 minutes or when it reaches 512MB in size. It never hits the 20 minute mark, and rotates at 512 MB about every 4 minutes (lots of logs :)..) Thoughts?

Tags (1)
1 Solution

Communicator

Thanks lguinn!!

Your suggestions definitely put me on the right path. The forwarder was monitoring approximately 13k files, obviously way to many. I setup file system management scripts to help alleviate that and also bumped up the file size rotation limit to 4GB since I knew we would be adding more ISA servers soon and wanted some headroom.

The other half of this was caused by the 256KBps throughput limit that Universal Forwarders have by default. I had looked at this before but forgot that the actual setting of "256" is under "../etc/apps/SplunkForwarder/limits.conf" and not the typical "../etc/system/default/limits.conf". Increasing this throughput drastically improved performance and my indexers are keeping up just fine.

Thanks for everyone's help 🙂

View solution in original post

Communicator

Thanks lguinn!!

Your suggestions definitely put me on the right path. The forwarder was monitoring approximately 13k files, obviously way to many. I setup file system management scripts to help alleviate that and also bumped up the file size rotation limit to 4GB since I knew we would be adding more ISA servers soon and wanted some headroom.

The other half of this was caused by the 256KBps throughput limit that Universal Forwarders have by default. I had looked at this before but forgot that the actual setting of "256" is under "../etc/apps/SplunkForwarder/limits.conf" and not the typical "../etc/system/default/limits.conf". Increasing this throughput drastically improved performance and my indexers are keeping up just fine.

Thanks for everyone's help 🙂

View solution in original post

Legend

Great! Now that you have changed the size/number of files, I'll bet that the forwarder's CPU and memory usage have also dropped to reasonable levels!

0 Karma

Legend

So, syslog is creating about 1500 files per day for a single forwarder to manage. And you are asking a single forwarder to push about 180 Gb per day of data across the network (about 2 Mb per second average).

I think your problem is on the forwarder, and not on the index tier... Here are some questions. The answers may reveal the problem (and the solution):

\1. What does CPU, memory and network utilization look like on the forwarder?

\2. The forwarder will be more effective in managing fewer but larger files. If you set your rotation to 2Gb files, that's not really very big and the forwarder will be happier. This can make a huge difference in performance.

\3. How many files OVERALL in the directory? Remember that Splunk will continue to monitor "old" files even after they have been indexed. The forwarder cannot know that you will never add more data to the file. If there are more than a few thousand files in the directory (seems likely), then performance on the forwarder will suffer. Write a script that moves the files older than a certain date (like a week) to a different directory - and Splunk performance may improve dramatically. Try this command to see what the forwarder is actually monitoring: ./splunk list monitor

\4. 180 Gb per day may not sound like that big a deal - but load is rarely flat over a 24 hour period. What do the peaks look like?

My personal bet is that addressing questions #2 and #3 will completely clear up your problem. 180 Gb per day may be approaching the upper bound of what Splunk can forward per day from a single server - but I think it is still reasonable. But not for thousands of little files.

Splunk Employee
Splunk Employee

johnathan_cooper,

I'd check to make sure your timestamps are being recognized properly by Splunk.

If you do a RT search over all time and are seeing data coming in at different times (like you said, several hours in the past or in the future), then either your timestamps are wrong in the logs, or Splunk isn't recognizing the timestamps. I don't think it has anything to do with your indexers being overloaded or your queues being blocked.

Communicator

The sourcetype is CEF (Common Event Format) and the timestamp is in epoch (millisecond format, 14 digits) and is stored in the "rt" field.

My props.conf has TIME_PREFIX set to "\srt=" and I performed quite a few spot checks in the real-time search to verify that the timestamps splunk displays matches the epoch value in the "rt" field. Everything seems to match up perfectly.

Is this an adequate method of verifying or should I be looking elsewhere?

0 Karma