Monitoring Splunk

Timed out while waiting for splunkd daemon to respond : Corrupted Index

asarolkar
Builder

All,

We recently upgrade to Splunk 5.0.2 in an environment with one search head and multiple search indexers (from version 4.3.4)

We have multiple forwarders on different OS that push data into the search head.

We also upgraded from *Nix app 4.5 to 4.6 (and reverted it because our search head is in Windows and there were some issues with version 4.6).

We also installed something called Splunk for WinSSHD.

I get the following error when go to Manager -> Indexes

Timed out while waiting for splunkd daemon to respond. Splunkd is hung

I have a hunch that installing one of those apps caused one of our indexes to get corrupted.



Or, maybe this is a not-yet-known issue with v5.0.2

Anybody know how to proceed with investigating the indexes ?
Familiarity with root cause would be helpful here.

0 Karma

hexx
Splunk Employee
Splunk Employee

This issue will be fixed in our next maintenance release - version 5.0.3. Customers with a support contract in good standing can have access to the fix right away in the form of a patch (5.0.2.2). Just open a case with Splunk Support reporting these errors and mention bug SPL-61718 / patch 5.0.2.2.

jguarini
Path Finder

I am also having this issue, only differences between me, the original poster, and responders is that I have no indexes that I have added. I have a fresh install and get the timeout message. I can add a new index via CLI, though it does seem to take a while.

splunk 5.0.2 running on windows 7 64bit

0 Karma

nazdrynau
Explorer

I have exactly the same issue. I have couple of big indexes. Whole index size about 230GB.
Getting same error after I have upgraded to 5.0.2.
I will open ticket with splunk.

0 Karma

asarolkar
Builder

We did check the logs that MuS wanted us to check.

Looking for "errors"/"crash logs" has not been of any use till now.

0 Karma

asarolkar
Builder

We have already installed Splunk on Splunk and there are no obvious errors in splunkd logs as reported under "Errors"/"Crash logs".

Splunkd is timing out and there are no errors linking this to bucket issues or indexing spikes (we are under the daily licensing limit). We ran fsck just in case and no issues were found with our cold/thawed/all buckets for every index.

How do we fix the timeout issue if there is an issue about timeouts between splunk and splunkd

0 Karma

Drainy
Champion

What makes you believe that the indexes are corrupted? The timeout between Splunkweb and Splunkd is hardcoded and as I recall was around 30 seconds. This means that if the system is under load it can cause splunkd to not reply in time. Quite possibly it could still be doing a bucket conversion or a spike in indexing. Download the Splunk on Splunk app and try to install that (if you can). Plus checking the logs as MuS said is a good start

Drainy
Champion

I meant them both separately. You could use the supplied script to gather the data but I meant just looking at the system load directly. On SoS there is a dashboard called "Indexing Performance". This can help to give an indication of if the system is under load, by default the middle chart will display four of the main internal queues Splunk uses, the lower one called indexQueue is the point where Splunk writes the disk. If these are starting to block then something is impacting Splunk somewhere. The timeout is hardcoded and cannot be altered.

0 Karma

asarolkar
Builder

Can you clarify what you mean by "queues" ?

I assume you mean IO/CPU/Memory and scheduler activity (all of which are provided by S.o.S)

~~~

Our search head is on a Windows Server. Out-of-the-box S.o.S. on a Windows Server does not capture those metrics.

We still have to enable the script to capture this information.

0 Karma

Drainy
Champion

Have you checked the system load? Looking at IO, CPU and MEM (although mainly CPU). Also in SOS are any of the queues filling up?

0 Karma

asarolkar
Builder

We have already installed Splunk on Splunk and there are no obvious errors in splunkd logs as reported under "Errors"/"Crash logs".

Splunkd is timing out and there are no errors linking this to bucket issues or indexing spikes (we are under the daily licensing limit). We ran fsck just in case and no issues were found with our cold/thawed/all buckets for every index.

How do we fix the timeout issue if there is an issue about timeouts between splunk and splunkd?

0 Karma

MuS
SplunkTrust
SplunkTrust

Hi asarolkar, as always a good start is to check the splunkd.log in var/log/splunk under your %splunk_home% install directory.

0 Karma