One out of the eight indexer has the two queues filled up for a couple of hours - parsing and aggregation queues. What can be done besides waiting for them to clear? I believe we use the default queue sizes, which are relatively small...
index=_internal host=<indexer name> "ERROR" sourcetype=splunkd
doesn't show much besides communication errors.
@danielbb You can go to Monitoring Console -> Indexing -> Input -> Data Quality and look for any parsing or aggregation issues (line breaks , timestamps etc.) and if they are more in number you can try fixing that source type, it should help reducing your parsing and aggregation queues.
If that's the case, then why does it appear only on one indexer?
probably those 2 indexers receiving more data from a particular host most of the time that has parsing or aggregation issues. If all the indexers and configurations are identical, then ideally it should have been same. Do all indexers have similar specifications?
All the indexers are the same - what's the query to find out parsing issues?
If you got on the monitoring console -> Indexing-> Input-> Data Quality you can see the list of source type Sourcetype Total Issues Host Count Source Count Line Breaking Issues Timestamp Parsing Issues Aggregation Issues.
You can click a row with highest count in Line Breaking Issues and you will get the detailed information in logs, similarly you can click on timestamp parsing issues count.
I checked and no such issues for this indexer.
If parsing and aggregation queue blocks for longer time then it will start blocking your splunktcpin and tcpin queues and if you are not using useACK
on Forwarder then there might be possibility that you'll lose data over network.
Thank you - how can I clear these two queues?
In your case it looks like due to Aggregation Queue full & back pressure parsing queue is also getting full. Have a look at https://wiki.splunk.com/Community:HowIndexingWorks , in aggregation queue Line Merging and Timestamp parsing happens. I'll suggest to define TIME_FORMAT
for as much as log you can so that splunk will parse time stamp quickly.
Also can you please let us know whether typing queue and indexing queue was also full at same time on that indexer ?
Ok, do you know how to detect parsing issues recorded in _internal
?
Try below query to find out time parsing issue.
index=_internal host=<your Indexer> source="/opt/splunk/var/log/splunk/splunkd.log" component=DateParserVerbose
Even if you don't have timestamp parsing issue, I'll suggest to configure TIME_FORMAT for sources which are ingesting more data so that splunk do not need to find different timeformat in your log.
I have enhance the search to provide more details.
index=_internal host=<your Indexer> source="/opt/splunk/var/log/splunk/splunkd.log" component=DateParserVerbose
| rex "Context: source=(?P<sourcetypeissue>\w+)\Shost=(?P<sourcehost>\w+)"
| stats list(sourcetypeissue) as file_name list(sourcehost)
The issue was fixed by changing MAX_TIMESTAMP_LOOKAHEAD
from 23 to 35. The timestamp was of 30 characters.