Getting Data In

Diagnosing Indexing Latency

Communicator

I am monitoring logs across the LAN within the same datacenter. I have a single server indexer/splunk server. Windows 2008 OS. The server is 4 core and 6 GB memory.

The files being monitored are all on remote on other Windows servers. Installing forwarders is not an option on each of the hundreds of servers.

The server is not breathing hard at all. Average CPU is about 25%. High is 40%. Almost 3 GB in free memory available.

Does not seem to be a server issue.

I run this search for the last 24 hours to look at latency.

index=*|eval latency = (time - _indextime) * -1 | eval indextime=strftime(indextime,"%+") | search latency > 0 | stats avg(latency) max(latency) min(latency) by host

The results are all over the place even for servers within the same datacenter. The servers that I am monitoring are not terribly busy.

Find it hard to believe that fiber is the issue with regards to the LAN within the datacenter.

How do I diagnose this?

Tags (2)
0 Karma

Legend

First, you are calculating latency as the delta between the timestamp of an event and the time the event was indexed. This is a valid measure only if you are just indexing current data.

Whenever you create a new input, Splunk has to "catch up" by indexing any existing data in the file. For example, if you decide to index "mylog.log" and it has data for the past 3 months in it - the first events will show a 3-month lag between their timestamp and the index time. This is going to wildly skew your reporting.

But, let's assume that your environment is "steady state" - no new inputs or new Windows servers are being monitored - and you are still seeing the latency. My first questions are

  • Does it matter? Do searches run quickly? Are there any errors in splunkd.log that would indicate that the system is unable to keep up with the incoming data? If you can't find any other evidence that something is wrong, then you may not actually need to do anything.
  • How many remote Windows servers are you monitoring?
  • How much data are you collecting from each server?

The Windows mechanisms - remote event logs, WMI - do not scale very well. I don't have the expertise to advise on this problem, but it is not unique to Splunk. You may already have the expertise to track this down; if not, you should be able to find resources to help you figure out what is happening with WMI, etc.

You might also want to look on the Splunk Wiki for Splunk-related tips: Troubleshooting Monitor Inputs and Performance Troubleshooting might give you some fresh insights.

0 Karma

Champion

I'm not suggesting you aren't, but when you schedule a savedsearch are you allowing a few minutes for each search? e.g. each one runs -13m@m to -3m@m, instead of -10m to now?

Communicator

Yes, that is wildly annoying with regards to the "backfile" conversion of log and their lag. In any case, I am looking purely at steady state events.

This matters because I am running 100's of savedsearch alerts. I am currently missing events in those searches because of the delays on indexing. Real-time searches in that quantity will kill the system so I am doing scheduled searches with a time window.

200+ remote servers. Not much data at all. Overall doing less than 1 GB per day.

0 Karma