I have 11 indexing servers all with 16 cpu's RAID 10 configuration 1Gb full duplex no swap useage, and they all sit at about 60-80% idle. Something is not right that I am seeing indexing lag of up to 5 hours on most of the servers. Are there any tuning parameters that I need check within splunk to have better throughput on the indexer side?
Some suggestions :
1 - Install the Splunk on Splunk app on your search-head. Take a look at the "Indexing Performance" view. Are all queues blocked down to the indexer queue or is there blockage upstream of that?
2 - In the SoS app "Indexing Performance" view, do you see latency across the board in the "Measured indexing latency" table at the top of the page, or is it only affecting a subset of hosts/sourcetypes/sources/splunk_server?
3 - In the SoS app "Errors" view, do you see any reports of indexing throttling because some bucket may contain too many tsdix files?
4 - We might want to check the size of your metadata files, particularly your Sources.data. Run the following command against your
$SPLUNK_DB and report the output :
find $SPLUNK_DB -name "*.data" -maxdepth 3 -size +25M | xargs ls -lh
You can get
$SPLUNK_HOME/etc/splunk-launcher.conf. By default,
$SPLUNK_DB is set to
If this command finds any metadata files larger than 25MB, that could be one of the reasons for your indexing performance degradation.
Rick, this could be due to a few things.
First - check the index time to be sure Splunk is seeing and indexing the data later then expecting. Below, the search will find the delay in seconds:
source=<your delayed source> | eval delay=_indextime-_time | fields delay
Typically, delays of hours will mean that the indexer is backed up OR the data is being read in at a slower pace then expected. To see if the indexer is backed up (we call it blocked), search as follows:
index=_internal source=*metrics.log blocked
If this returns events, this means the system is being blocked. If there are a lot of them, that is not a good sign and you should contact support. Support can determine if it's disk speed or something else by identifying the particular part of the queue system that is backed up.
Another thing to check is the maximum thruput for the indexers and forwarders. There is a maximum thruput setting within limits.conf that will be set to 256 kb per second on light weight forwarders:
[thruput] maxKBps = 256
The thruput will be applied on indexers or forwarders, meaning any splunk instance.
The distribution of the blocked events and which queue they are from will tell us where to look next. If you are constantly adding new data sets and they are very large, then I suspect you need to tune some of the new inputs so they are parsed faster. It is also possible that your disks just can't keep up with your thruput (particularly if it is running well over 3 mb/sec of indexing thruput per indexer).
you could run a search like this:
index=_internal host=* source=*metrics.log group=queue blocked=true | rename host as Indexer | chart count(blocked) as "Queue Blocks" by Indexer, name
to create a chart of the count of blocks by indexer and queue.
So what I have now is 800 events for index=_internal source=*metrics.log blocked - is there a setting for index throughput? Problem is we're adding new sources daily and need to figure this out before the lag starts affecting more realtime searches. Calling support now!