<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why did my indexers have a large spike in io? in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295100#M56064</link>
    <description>&lt;P&gt;I am almost positive that we are on dedicated LUNs for our Splunk servers, but I will certainly validate.  Also, the screenshot above is my production environment, which was not part of the outage that I mentioned in my first post.  Sorry for the confusion.&lt;/P&gt;</description>
    <pubDate>Mon, 13 Feb 2017 14:23:31 GMT</pubDate>
    <dc:creator>paimonsoror</dc:creator>
    <dc:date>2017-02-13T14:23:31Z</dc:date>
    <item>
      <title>Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295094#M56058</link>
      <description>&lt;P&gt;Hi Folks;&lt;/P&gt;

&lt;P&gt;Wondering if someone could help me out here.  I just had a big issue with Splunk.  3 of my Indexers just crashed for a bit (replication factor of 3).  One of the services crashed with a bucket replication error (i fixed this), server 2 the service crashed and was simply restarted, server 3 completely got hung and required a reboot.  &lt;/P&gt;

&lt;P&gt;After taking a quick peek, all of the stats are looking 'normal' including cpu/ physical/ storage, however, there was something that jumped out at me which was the iostats:&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="alt text"&gt;&lt;img src="https://community.splunk.com/t5/image/serverpage/image-id/2481i5E04C601DF8383B4/image-size/large?v=v2&amp;amp;px=999" role="button" title="alt text" alt="alt text" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;Any particular reason this would start to happen?  I just checked my forwarders and I dont see anything out of the ordinary with a large ramp in data ingestion&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="alt text"&gt;&lt;img src="https://community.splunk.com/t5/image/serverpage/image-id/2482i4A2EE8DE8D145CB4/image-size/large?v=v2&amp;amp;px=999" role="button" title="alt text" alt="alt text" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;I am working with my Linux team to restore one of my servers and they are stating that there was a "kernel level CPU soft lockup"&lt;/P&gt;

&lt;P&gt;Any Advice would be helpful in triaging this!&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 23:11:43 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295094#M56058</guid>
      <dc:creator>paimonsoror</dc:creator>
      <dc:date>2017-02-10T23:11:43Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295095#M56059</link>
      <description>&lt;P&gt;Really interesting. We had recently a similar situation.&lt;/P&gt;

&lt;P&gt;This query can help in identifying stress on the indexers. If you run it for the past week, it would be interesting to see the results -&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=_internal group=queue blocked name=indexqueue | timechart count by host
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;In our case, the indexers' queues got filled up and the 9997 port on some of them were closed for a couple of days. Only bouncing the indexers opened up the 9997 ports. You can run the following - &lt;CODE&gt;netstat  -plnt | grep 9997&lt;/CODE&gt; to check whether they are open. We also created a monitoring page for the 9997 ports to detect this type  of situation. We increased the indexers' queues and we are doing much better.&lt;/P&gt;

&lt;P&gt;Support recommends to run &lt;CODE&gt;iostat 1 5&lt;/CODE&gt; and they say that %iowait shouldn't pass 1% consistently over time . They didn't explain the reasoning behind the threshold of the 1%.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Feb 2017 00:51:30 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295095#M56059</guid>
      <dc:creator>ddrillic</dc:creator>
      <dc:date>2017-02-11T00:51:30Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295096#M56060</link>
      <description>&lt;P&gt;Thanks for the quick response as always!! I have updated my original post with the query results for both my production environment and my test environment.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Feb 2017 20:56:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295096#M56060</guid>
      <dc:creator>paimonsoror</dc:creator>
      <dc:date>2017-02-11T20:56:31Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295097#M56061</link>
      <description>&lt;P&gt;Couldn't add more attachments to my original post @ddrillic so hopefully this works:&lt;/P&gt;

&lt;P&gt;Test Environment (Using about 200GB of license / day)&lt;BR /&gt;
&lt;span class="lia-inline-image-display-wrapper" image-alt="alt text"&gt;&lt;img src="https://community.splunk.com/t5/image/serverpage/image-id/2478i8013D7C6667F9594/image-size/large?v=v2&amp;amp;px=999" role="button" title="alt text" alt="alt text" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;Prod Environment (Using about 1TB of license / day)&lt;BR /&gt;
&lt;span class="lia-inline-image-display-wrapper" image-alt="alt text"&gt;&lt;img src="https://community.splunk.com/t5/image/serverpage/image-id/2479i909C33A812C58CB2/image-size/large?v=v2&amp;amp;px=999" role="button" title="alt text" alt="alt text" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 11 Feb 2017 20:59:19 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295097#M56061</guid>
      <dc:creator>paimonsoror</dc:creator>
      <dc:date>2017-02-11T20:59:19Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295098#M56062</link>
      <description>&lt;P&gt;Not sure if this helps in telling anymore of the story, but our performance team came back with the following data showing 4 of my 5 production indexers;&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="alt text"&gt;&lt;img src="https://community.splunk.com/t5/image/serverpage/image-id/2480i2E64A594A06297DB/image-size/large?v=v2&amp;amp;px=999" role="button" title="alt text" alt="alt text" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Feb 2017 13:27:44 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295098#M56062</guid>
      <dc:creator>paimonsoror</dc:creator>
      <dc:date>2017-02-13T13:27:44Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295099#M56063</link>
      <description>&lt;P&gt;I think the crucial question is when the excessive I/O started; before or after the first failure. If a host fails, Splunk is going to immediately begin trying to get back to the intended replication and search factors to protect your data. That could be a hard-hitting process if you have a lot of data and you're on shared disk. I imagine it could lead to a chain reaction in an extreme case. Actually, if you're on shared disk I wonder if someone else might have triggered this; what kind of storage are you working with?&lt;/P&gt;</description>
      <pubDate>Mon, 13 Feb 2017 14:01:56 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295099#M56063</guid>
      <dc:creator>jtacy</dc:creator>
      <dc:date>2017-02-13T14:01:56Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295100#M56064</link>
      <description>&lt;P&gt;I am almost positive that we are on dedicated LUNs for our Splunk servers, but I will certainly validate.  Also, the screenshot above is my production environment, which was not part of the outage that I mentioned in my first post.  Sorry for the confusion.&lt;/P&gt;</description>
      <pubDate>Mon, 13 Feb 2017 14:23:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295100#M56064</guid>
      <dc:creator>paimonsoror</dc:creator>
      <dc:date>2017-02-13T14:23:31Z</dc:date>
    </item>
    <item>
      <title>Re: Why did my indexers have a large spike in io?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295101#M56065</link>
      <description>&lt;P&gt;Out of curiosity, is this with virtual storage, say a SAN on the backend?&lt;/P&gt;

&lt;P&gt;Reason I ask, this is consistent across your whole indexing tier, at the same time. So Id either look at data ingestion to see if you had a huge spike in ingestion, or if something else occured with the underlying infra. If its SAN, or shared storage on a platform like VMware, perhaps there was some type of controller issue that happened....&lt;/P&gt;

&lt;P&gt;Support should be able to help also. Keep us updated.&lt;/P&gt;</description>
      <pubDate>Mon, 13 Mar 2017 02:28:11 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-did-my-indexers-have-a-large-spike-in-io/m-p/295101#M56065</guid>
      <dc:creator>esix_splunk</dc:creator>
      <dc:date>2017-03-13T02:28:11Z</dc:date>
    </item>
  </channel>
</rss>

