Monitoring Splunk

Why am I not getting introspection data for these indexers?

SplunkTrust
SplunkTrust

I have several indexers in my cluster that are not producing introspection data, despite being identically configured to their cluster mates. This is causing my DMC to show unknown status (question marks) for these indexers on several DMC status panels. Why?

My indexers are running on RHEL 6.

1 Solution

SplunkTrust
SplunkTrust

(Warning: Medium length troubleshooting section follows, enjoy or skip to the Teal Deer at the end)

Normally introspection data is collected by a helper process that starts as a child when splunkd does. This process is a copy of splunkd with the argument "instrument-resource-usage". On some hosts we see this process:

$ ps -ef | grep splunk | grep -v search
splunk  18731     1 36 Feb26 ?        6-03:18:49 splunkd -p 8089 restart
splunk  18732 18731  0 Feb26 ?        01:28:01 [splunkd pid=18731] splunkd -p 8089 restart [process-runner]
splunk  18957 18732  0 Feb26 ?        00:39:53 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089 --with-kvstore

On some we don't:

$ ps -ef | grep splunkd | grep -v "search"
splunk  30198     1 39 02:44 ?        03:07:52 splunkd -p 8089 restart
splunk  30199 30198  0 02:44 ?        00:02:00 [splunkd pid=30198] splunkd -p 8089 restart [process-runner]

And this missing process correlates with the hosts where we are not getting introspection data. Normally this process should get started every time at Splunk start. We can see it get started in splunkd.log during our last restart:

$ egrep instrument-resource-usage $SPLUNK_HOME/var/log/splunk/splunkd.log
03-14-2016 02:44:10.822 -0400 INFO  ExecProcessor - New scheduled exec process: /opt/splunk/bin/splunkd instrument-resource-usage

But it's not running long after the restart, as it's gone when we look only seconds later. We can try to run it ourselves and see if we get output:

$  /opt/splunk/bin/splunkd instrument-resource-usage
INFO  RU_main - I-data gathering (Resource Usage) starting; period=10s

Cond 'swapSize_free <= ru._swap.get()' false; line 443, collect_hostwide__Linux()

Oof. An assertion it looks like, related to swap? Strange. Let's see what our swap looks like:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         12921      12629        291          0        237      10158
-/+ buffers/cache:       2234      10686
Swap:        14855 1759218580     256081

Okay. Weird. We have LOTS more swap free than we do used. How does free come up with those numbers? Time to dig out the source. (tick tock tick tock tick tock) Turns out free gets those numbers from /proc/meminfo - and that is messed up too.

$ grep Swap /proc/meminfo
SwapCached:        19784 kB
SwapTotal:      15212520 kB
SwapFree:       262221008 kB

(And here's the TL;DR)

Now we're getting somewhere. If /proc/meminfo is wrong, then it's most likely a kernel bug (whee!). A little googling finds https://access.redhat.com/solutions/1571043. If you don't have access behind Red Hat's paywall - the gist is that there was a bug in kernel-2.6.32-573.1.1.el6.x86_64 that causes SwapFree to get counted wrong. You can reboot to fix it temporarily but it's only temporary. Fixing it permanently requires updating to a newer kernel. This problem likely also occurs in CentOS at the same kernel level.

View solution in original post

SplunkTrust
SplunkTrust

(Warning: Medium length troubleshooting section follows, enjoy or skip to the Teal Deer at the end)

Normally introspection data is collected by a helper process that starts as a child when splunkd does. This process is a copy of splunkd with the argument "instrument-resource-usage". On some hosts we see this process:

$ ps -ef | grep splunk | grep -v search
splunk  18731     1 36 Feb26 ?        6-03:18:49 splunkd -p 8089 restart
splunk  18732 18731  0 Feb26 ?        01:28:01 [splunkd pid=18731] splunkd -p 8089 restart [process-runner]
splunk  18957 18732  0 Feb26 ?        00:39:53 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089 --with-kvstore

On some we don't:

$ ps -ef | grep splunkd | grep -v "search"
splunk  30198     1 39 02:44 ?        03:07:52 splunkd -p 8089 restart
splunk  30199 30198  0 02:44 ?        00:02:00 [splunkd pid=30198] splunkd -p 8089 restart [process-runner]

And this missing process correlates with the hosts where we are not getting introspection data. Normally this process should get started every time at Splunk start. We can see it get started in splunkd.log during our last restart:

$ egrep instrument-resource-usage $SPLUNK_HOME/var/log/splunk/splunkd.log
03-14-2016 02:44:10.822 -0400 INFO  ExecProcessor - New scheduled exec process: /opt/splunk/bin/splunkd instrument-resource-usage

But it's not running long after the restart, as it's gone when we look only seconds later. We can try to run it ourselves and see if we get output:

$  /opt/splunk/bin/splunkd instrument-resource-usage
INFO  RU_main - I-data gathering (Resource Usage) starting; period=10s

Cond 'swapSize_free <= ru._swap.get()' false; line 443, collect_hostwide__Linux()

Oof. An assertion it looks like, related to swap? Strange. Let's see what our swap looks like:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         12921      12629        291          0        237      10158
-/+ buffers/cache:       2234      10686
Swap:        14855 1759218580     256081

Okay. Weird. We have LOTS more swap free than we do used. How does free come up with those numbers? Time to dig out the source. (tick tock tick tock tick tock) Turns out free gets those numbers from /proc/meminfo - and that is messed up too.

$ grep Swap /proc/meminfo
SwapCached:        19784 kB
SwapTotal:      15212520 kB
SwapFree:       262221008 kB

(And here's the TL;DR)

Now we're getting somewhere. If /proc/meminfo is wrong, then it's most likely a kernel bug (whee!). A little googling finds https://access.redhat.com/solutions/1571043. If you don't have access behind Red Hat's paywall - the gist is that there was a bug in kernel-2.6.32-573.1.1.el6.x86_64 that causes SwapFree to get counted wrong. You can reboot to fix it temporarily but it's only temporary. Fixing it permanently requires updating to a newer kernel. This problem likely also occurs in CentOS at the same kernel level.

View solution in original post