topic Re: Why did my indexers have a large spike in io? in Getting Data In

Why did my indexers have a large spike in io?

paimonsoror — Fri, 10 Feb 2017 23:11:43 GMT

Hi Folks;

Wondering if someone could help me out here. I just had a big issue with Splunk. 3 of my Indexers just crashed for a bit (replication factor of 3). One of the services crashed with a bucket replication error (i fixed this), server 2 the service crashed and was simply restarted, server 3 completely got hung and required a reboot.

After taking a quick peek, all of the stats are looking 'normal' including cpu/ physical/ storage, however, there was something that jumped out at me which was the iostats:

Any particular reason this would start to happen? I just checked my forwarders and I dont see anything out of the ordinary with a large ramp in data ingestion

I am working with my Linux team to restore one of my servers and they are stating that there was a "kernel level CPU soft lockup"

Any Advice would be helpful in triaging this!

Re: Why did my indexers have a large spike in io?

ddrillic — Sat, 11 Feb 2017 00:51:30 GMT

Really interesting. We had recently a similar situation.

This query can help in identifying stress on the indexers. If you run it for the past week, it would be interesting to see the results -

index=_internal group=queue blocked name=indexqueue | timechart count by host

In our case, the indexers' queues got filled up and the 9997 port on some of them were closed for a couple of days. Only bouncing the indexers opened up the 9997 ports. You can run the following - netstat -plnt | grep 9997 to check whether they are open. We also created a monitoring page for the 9997 ports to detect this type of situation. We increased the indexers' queues and we are doing much better.

Support recommends to run iostat 1 5 and they say that %iowait shouldn't pass 1% consistently over time . They didn't explain the reasoning behind the threshold of the 1%.

Re: Why did my indexers have a large spike in io?

paimonsoror — Sat, 11 Feb 2017 20:56:31 GMT

Thanks for the quick response as always!! I have updated my original post with the query results for both my production environment and my test environment.

Re: Why did my indexers have a large spike in io?

paimonsoror — Sat, 11 Feb 2017 20:59:19 GMT

Couldn't add more attachments to my original post @ddrillic so hopefully this works:

Test Environment (Using about 200GB of license / day)

Prod Environment (Using about 1TB of license / day)

Re: Why did my indexers have a large spike in io?

paimonsoror — Mon, 13 Feb 2017 13:27:44 GMT

Not sure if this helps in telling anymore of the story, but our performance team came back with the following data showing 4 of my 5 production indexers;

Re: Why did my indexers have a large spike in io?

jtacy — Mon, 13 Feb 2017 14:01:56 GMT

I think the crucial question is when the excessive I/O started; before or after the first failure. If a host fails, Splunk is going to immediately begin trying to get back to the intended replication and search factors to protect your data. That could be a hard-hitting process if you have a lot of data and you're on shared disk. I imagine it could lead to a chain reaction in an extreme case. Actually, if you're on shared disk I wonder if someone else might have triggered this; what kind of storage are you working with?

Re: Why did my indexers have a large spike in io?

paimonsoror — Mon, 13 Feb 2017 14:23:31 GMT

I am almost positive that we are on dedicated LUNs for our Splunk servers, but I will certainly validate. Also, the screenshot above is my production environment, which was not part of the outage that I mentioned in my first post. Sorry for the confusion.

Re: Why did my indexers have a large spike in io?

esix_splunk — Mon, 13 Mar 2017 02:28:11 GMT

Out of curiosity, is this with virtual storage, say a SAN on the backend?

Reason I ask, this is consistent across your whole indexing tier, at the same time. So Id either look at data ingestion to see if you had a huge spike in ingestion, or if something else occured with the underlying infra. If its SAN, or shared storage on a platform like VMware, perhaps there was some type of controller issue that happened....

Support should be able to help also. Keep us updated.