According to my Deployment monitor app one of my indexer shows backed up. I need help find out if it is some thing due to slow disk or some complex regex.
I am providing the following logs as evidence to my issue
indexer's splunkd.log
11-16-2010 17:23:14.625 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
11-16-2010 17:23:39.674 INFO TailingProcessor - ...continuing.
11-16-2010 17:23:43.608 WARN DateParserVerbose - Accepted time (Fri Oct 22 05:29:30 2010) is suspiciously far away from the previous event's time (Tue Nov 16 17:11:52 2010), but still accepted because it was extracted by the same pattern. Context="source::/var/genesys/GVHC/GVHC_Stat_Server2.log.20101116_163632_193.log|host::tuk1cc-g2|genesys_statserver_log|remoteport::37189"
1041 similar messages suppressed. First occurred at: Tue Nov 16 17:18:38 2010
11-16-2010 17:23:43.608 WARN DateParserVerbose - Failed to parse timestamp for event. Context="source::/var/genesys/QwestHSI/QwestHSI_UR_Server2.log.20101116_045930_983.log|host::cer1cc-g2|genesys_urserver_log|remoteport::47624" Text=" AttributeCustomerID 'QwestHSI'..."
25777 similar messages suppressed. First occurred at: Tue Nov 16 17:18:38 2010
11-16-2010 17:23:43.608 WARN DateParserVerbose - Failed to parse timestamp for event. Context="source::/var/genesys/QwestHSI/QwestHSI_UR_Server2.log.20101116_045930_983.log|host::cer1cc-g2|genesys_urserver_log|remoteport::47624" Text=" AttributeANI '606837491'..."
11-16-2010 17:23:43.608 WARN DateParserVerbose - Failed to parse timestamp for event. Context="source::/var/genesys/QwestHSI/QwestHSI_UR_Server2.log.20101116_045930_983.log|host::cer1cc-g2|genesys_urserver_log|remoteport::47624" Text=" AttributeDNIS '8665313546'..."
indexer's metrics.log
11-16-2010 17:26:41.086 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size=1000, filled_count=11, empty_count=7307, current_size=1000, largest_size=1000, smallest_size=1
11-16-2010 17:26:41.086 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size=1000, filled_count=4, empty_count=15508, current_size=1000, largest_size=1000, smallest_size=1
11-16-2010 17:27:35.067 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size=1000, filled_count=1, empty_count=0, current_size=1000, largest_size=1000, smallest_size=817
11-16-2010 17:27:35.067 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
11-16-2010 17:27:35.067 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
forwarder's metrics.log
11-16-2010 17:27:16.324 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size=1000, filled_count=4, empty_count=1132, current_size=1000, largest_size=1000, smallest_size=1
11-16-2010 17:27:16.324 INFO Metrics - group=tcpout_connections, apa-splunk, blocked=true, current_entries_count=1000, queue_size=1000
11-16-2010 17:28:18.903 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
11-16-2010 17:28:18.903 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
11-16-2010 17:28:18.904 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
11-16-2010 17:28:18.904 INFO Metrics - group=queue, name=tcpout_apa-splunk, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
11-16-2010 17:28:18.904 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
11-16-2010 17:28:18.904 INFO Metrics - group=tcpout_connections, apa-splunk, blocked=true, current_entries_count=1000, queue_size=1000
Here is an snapshot of the DM status
http://picpaste.com/pics/indexer-ObvwPnn9.1289928881.png
Here is little bit more details of the queue on the indexer
A queue is blocked if the queues downstream from it are blocked. The furthest downstream queue you have that is showing blocked is the indexqueue on the indexer. Below that are only two things: The thruput throttle (set in limits.conf) and the OS/disk itself. (These two items are not queues really and are not recorded in metrics.log) The thruput throttle might be set accidentally in limits.conf ([thruput] maxKBps
), though this is unlikely. If that's not the problem, then that points to slowness in writing out to disk.
I see no outputs.conf file in etc/systems/local
yes very late.. with autolb 97% data gets indexed by the other indexer and 3% by this one. they (two indexers) both are on same subnet. let me check the outputs.conf
And, also, is it permanently stuck, or just slow? i.e., do events eventually make it through and get indexed, just late?
Actually, I guess there's one other thing, and that would be if your indexer was configured to index and forward as well...this would be another accidental config, specifying an output group in outputs.conf.
I see maxKBps = 0 for the indexers. so I guess that boils down to slow disk on the slow indexer.
An indexer is "backed up" if its parsingQueue is over 50% full most of the time. It seems like this is the case based on your queue stats (parsingQueue size seems to be >500 and often 1000).
It's very likely that one of your regexes to parse events is too complex/inefficient.
Well actually since I have two indexers built the same way, regex should not be issue. It might be just slow disks on one of the server.