All Apps and Add-ons

Production Splunk stopped showing search results

prathapkcsc
Explorer

HI Team,

We have been facing the issue with Splunk for 6 hours. Suddenly our Splunk stopped showing results of all dashboards. The splunk node(centos box) also not responding properly. I am attaching the entire splunkd.log .
I am seeing so many WARN messages in the splunkd.log . Please any inputs on this as its production cluster.
Quick help will be appreciated.link text

0 Karma

vr2312
Contributor

Hello @prathapkcsc

From the UF, the data to HF/IDX is blocked cause of queues majorly. You might need to see why the queues are blocked at the IDX end. Probably due to low disk or probably due to an extensive search being run or network connectivity

0 Karma

prathapkcsc
Explorer

Hello,
The indexer having 36 TB free space and entire application having no no network issues.

0 Karma

vr2312
Contributor

Could you post any error messages from the IDX ? Are all IDXs functioning well ?

Are search results returning ?

0 Karma

prathapkcsc
Explorer

Hi ,
After stopping the antivirus scanning process server performance was improved. I am able to get the some search results(partially) in splunk but splunkd.log showing so many errors

12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:15 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.162 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Mon Dec 10 00:50:03 2018). Context: source::/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/stdout|host::sfiappnwh021.statefarm-dss.com|stdout-too_small|1431345
12-10-2018 00:50:19.676 -0500 WARN LineBreakingProcessor - Truncating line because limit of 10000 bytes has been exceeded with a line length >= 32773 - data_source="/data5/yarn/container-logs/application_1537825532257_22021/container_e42_1537825532257_22021_01_000002/syslog", data_host="sfiappnwh021.statefarm-dss.com", data_sourcetype="syslog"
12-10-2018 00:50:21.960 -0500 WARN HandleJobsDataProvider - Provenance argument was in an invalid format.
12-10-2018 00:50:21.998 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.
12-10-2018 00:50:24.156 -0500 WARN LineBreakingProcessor - Truncating line because limit of 10000 bytes has been exceeded with a line length >= 34066 - data_source="/data4/yarn/container-logs/application_1537825532257_21753/container_e42_1537825532257_21753_01_016922/syslog", data_host="sfiappnwh018.statefarm-dss.com", data_sourcetype="syslog"
12-10-2018 00:50:29.069 -0500 WARN HandleJobsDataProvider - Provenance argument was in an invalid format.
12-10-2018 00:50:29.106 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.
12-10-2018 00:50:30.096 -0500 WARN HandleJobsDataProvider - Provenance argument was in an invalid format.
12-10-2018 00:50:30.552 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.

0 Karma

vr2312
Contributor

These are basic errors that has been there for a long time. That is not the cause of the issue. Also

12-10-2018 00:50:21.998 -0500 WARN AdminManager - Handler 'summarization' has not performed any capability checks for this operation (requestedAction=list, customAction="", item=""). This may be a bug.

Raise a ticket with Splunk for this.

Also it might be due to some new services running, do you run crowdstrike or something ?
I remember one of my instances when a new product was installed and it took our indexers down.

Check top in the IDX and see what is being populated.

0 Karma

prathapkcsc
Explorer

Are this can be ignored?

12-10-2018 06:01:23.883 +0000 INFO TailReader - Ignoring file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' due to: binary

12-10-2018 06:01:23.883 +0000 WARN FileClassifierManager - The file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' is invalid. Reason: binary

12-10-2018 06:01:23.734 +0000 INFO TailReader - Ignoring file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' due to: binary

12-10-2018 06:01:23.734 +0000 WARN FileClassifierManager - The file '/var/log/flume-ng/flume-cmf-flume-AGENT-sfiappnwh040.statefarm-dss.com.log' is invalid. Reason: binary

12-10-2018 06:01:23.655 +0000 INFO WatchedFile - Will begin reading at offset=0 for file='/var/log/hue/metrics-hue_server/metrics.log'.

12-10-2018 06:01:23.655 +0000 INFO TailReader - Ignoring file '/var/log/hue/metrics-hue_server/tmpZnYoeF' due to: failed_stat

12-10-2018 06:01:23.655 +0000 WARN FileClassifierManager - The file '/var/log/hue/metrics-hue_server/tmpZnYoeF' is invalid. Reason: failed_stat

0 Karma

prathapkcsc
Explorer

We have not installed any new service. The environment was fine until yesterday morning. we have not performed any activity recently.

0 Karma

vr2312
Contributor

You do not need to ignore that, but i do not find them to be the cause of the infrastructure issue you are facing.

http://docs.splunk.com/Documentation/Splunk/6.3.3/data/Configurecharactersetencoding#Comprehensive_l...

Check this for looking at failed_stat and binary errors.

These are certain things that needs to be addressed, but you must pay attention to the bigger issue,

0 Karma

prathapkcsc
Explorer

I am very thankful for your quick inputs. I am looking forward to raise a case to Splunk support team. Once the issue gets fixed, i will update the resolving steps here.
Thank you!

0 Karma

amiftah_splunk
Splunk Employee
Splunk Employee

If you're using a Heavy forwarder go to Settings>Monitoring console>Indexing>Indexing performance. In the snapshot panel, see the status of your indexing.
If you're not using a HF, then follow the same steps in your indexer(s) to see what blocks.

0 Karma

ddrillic
Ultra Champion

From where is this splunkd.log? It seems to be from the forwarder.

We can see many messages such as -

12-09-2018 01:07:08.755 -0500 INFO  TailReader - Could not send data to output queue (parsingQueue), retrying...
0 Karma

prathapkcsc
Explorer

The above log is from Splunk Master. Yesterday i restarted the splunk master, but after 1 hour every thing became bad as i said above.

0 Karma

prathapkcsc
Explorer

I am also seeing below logs info continuously repeated(look for data_source line)
AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:01:52.337 -0500 WARN DateParserVerbose - A possible timestamp match (Tue Dec 24 06:51:04 2019) is outside of the acceptable time window. If this timestamp is correct, consider adjusting MAX_DAYS_AGO and MAX_DAYS_HENCE. Context: source::/var/log/hive/metrics-hivemetastore/metrics.log|host::sfiappnwh026.statefarm-dss.com|metrics|83303
12-09-2018 02:01:52.338 -0500 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:02:24.493 -0500 INFO TailReader - ...continuing.
12-09-2018 02:02:24.503 -0500 WARN DateParserVerbose - Failed to parse timestamp. Defaulting to timestamp of previous event (Sun Dec 9 01:45:01 2018). Context: source::/var/log/rabbitmq_queue_size__check.out|host::sfisvlnwh007.statefarm-dss.com|breakable_text|233395
12-09-2018 02:03:01.494 -0500 INFO TailReader - Could not send data to output queue (parsingQueue), retrying...
12-09-2018 02:03:27.099 -0500 WARN PeriodicReapingTimeout - Spent 85084ms updating search-related banner messages
12-09-2018 02:03:27.105 -0500 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (96007 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
12-09-2018 02:03:28.717 -0500 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:28.717 -0500 WARN AggregatorMiningProcessor - Changing breaking behavior for event stream because MAX_EVENTS (256) was exceeded without a single event break. Will set BREAK_ONLY_BEFORE_DATE to False, and unset any MUST_NOT_BREAK_BEFORE or MUST_NOT_BREAK_AFTER rules. Typically this will amount to treating this data as single-line only. - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:28.792 -0500 INFO TailReader - ...continuing.
12-09-2018 02:03:28.872 -0500 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:28.872 -0500 WARN AggregatorMiningProcessor - Changing breaking behavior for event stream because MAX_EVENTS (256) was exceeded without a single event break. Will set BREAK_ONLY_BEFORE_DATE to False, and unset any MUST_NOT_BREAK_BEFORE or MUST_NOT_BREAK_AFTER rules. Typically this will amount to treating this data as single-line only. - data_source="/var/log/hive/metrics-hivemetastore/metrics.log", data_host="sfiappnwh026.statefarm-dss.com", data_sourcetype="metrics"
12-09-2018 02:03:33.795 -0500 INFO TailReader - Could not send data to output queue (parsingQueue), retrying...

0 Karma

vr2312
Contributor

Okay, if these are from the Cluster Master, can you share the log files of one of the Indexers ? If the Master is unable to forward its data to the IDX, we would need to identify if its an IDX issue.

0 Karma

prathapkcsc
Explorer

Hi,
The below is the splunk universal forwarder log. 10.61.1.81 is the master node. I am afraid 10.61.1.81 indexer too.

12-10-2018 04:16:20.474 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:17:20.475 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:17:57.809 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 200 seconds.
12-10-2018 04:18:20.478 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:19:20.482 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:19:37.826 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 300 seconds.
12-10-2018 04:20:20.485 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:20:27.834 +0000 WARN TcpOutputProc - Raw connection to ip=10.61.1.81:9997 timed out
12-10-2018 04:20:27.834 +0000 INFO TcpOutputProc - Ping connection to idx=10.61.1.81:9997 timed out. continuing connections
12-10-2018 04:21:17.842 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 400 seconds.
12-10-2018 04:21:20.489 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:22:20.492 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:22:57.859 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 500 seconds.
12-10-2018 04:23:20.495 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:24:20.498 +0000 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.61.1.78_8089_sfiappnwh027.statefarm-dss.com_sfiappnwh027.statefarm-dss.com_E56FB3B9-46F9-4430-8F91-AA8496ED0C2A
12-10-2018 04:24:37.876 +0000 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 600 seconds.

0 Karma

vr2312
Contributor

@prathapkcsc

From the logs i can deduce the following :

  1. The queues are blocked and filled which is allowing no data to be searched or indexed
  2. I would see if there is a resource intrusive search being run somewhere due to which the R/W operations are impacted.
  3. You may do a quick restart of the box but that would just interrupt the search operation for a limited time and then initiate this issue again, only if the search is a continuous one.
0 Karma

vr2312
Contributor

Also let us know from where did you deduce the log, cause its tough to understand as we can just give a broader perspective.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...