I am monitoring logs across the LAN within the same datacenter. I have a single server indexer/splunk server. Windows 2008 OS. The server is 4 core and 6 GB memory.
The files being monitored are all on remote on other Windows servers. Installing forwarders is not an option on each of the hundreds of servers.
The server is not breathing hard at all. Average CPU is about 25%. High is 40%. Almost 3 GB in free memory available.
Does not seem to be a server issue.
I run this search for the last 24 hours to look at latency.
index=*|eval latency = (time - _indextime) * -1 | eval indextime=strftime(indextime,"%+") | search latency > 0 | stats avg(latency) max(latency) min(latency) by host
The results are all over the place even for servers within the same datacenter. The servers that I am monitoring are not terribly busy.
Find it hard to believe that fiber is the issue with regards to the LAN within the datacenter.
How do I diagnose this?
First, you are calculating latency as the delta between the timestamp of an event and the time the event was indexed. This is a valid measure only if you are just indexing current data.
Whenever you create a new input, Splunk has to "catch up" by indexing any existing data in the file. For example, if you decide to index "mylog.log" and it has data for the past 3 months in it - the first events will show a 3-month lag between their timestamp and the index time. This is going to wildly skew your reporting.
But, let's assume that your environment is "steady state" - no new inputs or new Windows servers are being monitored - and you are still seeing the latency. My first questions are
The Windows mechanisms - remote event logs, WMI - do not scale very well. I don't have the expertise to advise on this problem, but it is not unique to Splunk. You may already have the expertise to track this down; if not, you should be able to find resources to help you figure out what is happening with WMI, etc.
Yes, that is wildly annoying with regards to the "backfile" conversion of log and their lag. In any case, I am looking purely at steady state events.
This matters because I am running 100's of savedsearch alerts. I am currently missing events in those searches because of the delays on indexing. Real-time searches in that quantity will kill the system so I am doing scheduled searches with a time window.
200+ remote servers. Not much data at all. Overall doing less than 1 GB per day.