Production Splunk stopped showing search results

prathapkcsc · ‎12-09-2018

HI Team,

We have been facing the issue with Splunk for 6 hours. Suddenly our Splunk stopped showing results of all dashboards. The splunk node(centos box) also not responding properly. I am attaching the entire splunkd.log .
I am seeing so many WARN messages in the splunkd.log . Please any inputs on this as its production cluster.
Quick help will be appreciated.link text

vr2312 · ‎12-09-2018

Hello @prathapkcsc

From the UF, the data to HF/IDX is blocked cause of queues majorly. You might need to see why the queues are blocked at the IDX end. Probably due to low disk or probably due to an extensive search being run or network connectivity

prathapkcsc · ‎12-09-2018

Hello,
The indexer having 36 TB free space and entire application having no no network issues.

vr2312 · ‎12-09-2018

Could you post any error messages from the IDX ? Are all IDXs functioning well ?

Are search results returning ?

prathapkcsc · ‎12-09-2018

Hi ,
After stopping the antivirus scanning process server performance was improved. I am able to get the some search results(partially) in splunk but splunkd.log showing so many errors

12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:03 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.676 -0500 WARN LineBreakingProcessor - Truncating line because limit of 10000 bytes has been exceeded with a line length >= 32773 - data_source="/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/syslog", data_host="sfiappnwh021.statefarm-dss.com", data_sourcetype="syslog"
12-10-2018 00:50:21.960 -0500 WARN HandleJobsDataProvider - Provenance argument was in an invalid format.
12-10-2018 00:50:21.998 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.
12-10-2018 00:50:24.156 -0500 WARN LineBreakingProcessor - Truncating line because limit of 10000 bytes has been exceeded with a line length >= 34066 - data_source="/data4/yarn/container-logs/application_1537825532257_21753/container_e42_1537825532257_21753_01_016922/syslog", data_host="sfiappnwh018.statefarm-dss.com", data_sourcetype="syslog"
12-10-2018 00:50:29.069 -0500 WARN HandleJobsDataProvider - Provenance argument was in an invalid format.
12-10-2018 00:50:29.106 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.
12-10-2018 00:50:30.096 -0500 WARN HandleJobsDataProvider - Provenance argument was in an invalid format.
12-10-2018 00:50:30.552 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.

vr2312 · ‎12-09-2018

These are basic errors that has been there for a long time. That is not the cause of the issue. Also

12-10-2018 00:50:21.998 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.

Raise a ticket with Splunk for this.

Also it might be due to some new services running, do you run crowdstrike or something ?
I remember one of my instances when a new product was installed and it took our indexers down.

Check top in the IDX and see what is being populated.

prathapkcsc · ‎12-09-2018

Are this can be ignored?

12-10-2018 06:01:23.883 +0000 INFO TailReader - Ignoring file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' due to: binary

12-10-2018 06:01:23.883 +0000 WARN FileClassifierManager - The file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' is invalid. Reason: binary

12-10-2018 06:01:23.734 +0000 INFO TailReader - Ignoring file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' due to: binary

12-10-2018 06:01:23.734 +0000 WARN FileClassifierManager - The file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' is invalid. Reason: binary

12-10-2018 06:01:23.655 +0000 INFO WatchedFile - Will begin reading at offset=0 for file='/var/log/hue/metrics-hue_server/metrics.log'.

12-10-2018 06:01:23.655 +0000 INFO TailReader - Ignoring file '/var/log/hue/metrics-hue_server/tmpZnYoeF' due to: failed_stat

12-10-2018 06:01:23.655 +0000 WARN FileClassifierManager - The file '/var/log/hue/metrics-hue_server/tmpZnYoeF' is invalid. Reason: failed_stat

prathapkcsc · ‎12-09-2018

We have not installed any new service. The environment was fine until yesterday morning. we have not performed any activity recently.

vr2312 · ‎12-09-2018

You do not need to ignore that, but i do not find them to be the cause of the infrastructure issue you are facing.

http://docs.splunk.com/Documentation/Splunk/6.3.3/data/Configurecharactersetencoding#Comprehensive_l...

Check this for looking at failed_stat and binary errors.

These are certain things that needs to be addressed, but you must pay attention to the bigger issue,

prathapkcsc · ‎12-09-2018

I am very thankful for your quick inputs. I am looking forward to raise a case to Splunk support team. Once the issue gets fixed, i will update the resolving steps here.
Thank you!

amiftah_splunk · ‎12-09-2018

If you're using a Heavy forwarder go to Settings>Monitoring console>Indexing>Indexing performance. In the snapshot panel, see the status of your indexing.
If you're not using a HF, then follow the same steps in your indexer(s) to see what blocks.

ddrillic · ‎12-09-2018

From where is this splunkd.log? It seems to be from the forwarder.

We can see many messages such as -

12-09-2018 01:07:08.755 -0500 INFO  TailReader - Could not send data to output queue (parsingQueue), retrying...

prathapkcsc · ‎12-09-2018

The above log is from Splunk Master. Yesterday i restarted the splunk master, but after 1 hour every thing became bad as i said above.

prathapkcsc · ‎12-09-2018

I am also seeing below logs info continuously repeated(look for data_source line)
AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:01:52.337 -0500 WARN DateParserVerbose - A possible timestamp match (Tue Dec 24 06:51:04 2019) is outside of the acceptable time window. If this timestamp is correct, consider adjusting MAX_DAYS_AGO and MAX_DAYS_HENCE. Context: source::/var/log/hive/metrics-hivemetastore/metrics.log|host::sfiappnwh026.statefarm-dss.com|metrics|83303
12-09-2018 02:01:52.338 -0500 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:02:24.493 -0500 INFO TailReader - ...continuing.
12-09-2018 02:02:24.503 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Sun Dec 9 01:45:01 2018). Context: source::/var/log/rabbitmq_queue_size__check.out|host::sfisvlnwh007.statefarm-dss.com|breakable_text|233395
12-09-2018 02:03:01.494 -0500 INFO TailReader - Could not send data to output queue (parsingQueue), retrying...
12-09-2018 02:03:27.099 -0500 WARN PeriodicReapingTimeout - Spent 85084ms updating search-related banner messages
12-09-2018 02:03:27.105 -0500 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (96007 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
12-09-2018 02:03:28.717 -0500 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:28.717 -0500 WARN AggregatorMiningProcessor - Changing breaking behavior for event stream because MAX_EVENTS (256) was exceeded without a single event break. Will set BREAK_ONLY_BEFORE_DATE to False, and unset any MUST_NOT_BREAK_BEFORE or MUST_NOT_BREAK_AFTER rules. Typically this will amount to treating this data as single-line only. - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:28.792 -0500 INFO TailReader - ...continuing.
12-09-2018 02:03:28.872 -0500 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:28.872 -0500 WARN AggregatorMiningProcessor - Changing breaking behavior for event stream because MAX_EVENTS (256) was exceeded without a single event break. Will set BREAK_ONLY_BEFORE_DATE to False, and unset any MUST_NOT_BREAK_BEFORE or MUST_NOT_BREAK_AFTER rules. Typically this will amount to treating this data as single-line only. - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:33.795 -0500 INFO TailReader - Could not send data to output queue (parsingQueue), retrying...

vr2312 · ‎12-09-2018

Okay, if these are from the Cluster Master, can you share the log files of one of the Indexers ? If the Master is unable to forward its data to the IDX, we would need to identify if its an IDX issue.

prathapkcsc · ‎12-09-2018

Hi,
The below is the splunk universal forwarder log. 10.61.1.81 is the master node. I am afraid 10.61.1.81 indexer too.

12-10-2018 04:16:20.474 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:17:20.475 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:17:57.809 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 200 seconds.
12-10-2018 04:18:20.478 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:19:20.482 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:19:37.826 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 300 seconds.
12-10-2018 04:20:20.485 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:20:27.834 +0000 WARN TcpOutputProc - Raw connection to ip=10.61.1.81:9997 timed out
12-10-2018 04:20:27.834 +0000 INFO TcpOutputProc - Ping connection to idx=10.61.1.81:9997 timed out. continuing connections
12-10-2018 04:21:17.842 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 400 seconds.
12-10-2018 04:21:20.489 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:22:20.492 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:22:57.859 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 500 seconds.
12-10-2018 04:23:20.495 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:24:20.498 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:24:37.876 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 600 seconds.

vr2312 · ‎12-09-2018

@prathapkcsc

From the logs i can deduce the following :

The queues are blocked and filled which is allowing no data to be searched or indexed
I would see if there is a resource intrusive search being run somewhere due to which the R/W operations are impacted.
You may do a quick restart of the box but that would just interrupt the search operation for a limited time and then initiate this issue again, only if the search is a continuous one.

vr2312 · ‎12-09-2018

Also let us know from where did you deduce the log, cause its tough to understand as we can just give a broader perspective.

Production Splunk stopped showing search results

Splunk AI Assistant for SPL | Key Use Cases to Unlock the Power of SPL

Buttercup Games: Further Dashboarding Techniques (Part 5)

Customers Increasingly Choose Splunk for Observability