One of our Splunk forwarders has stopped forwarding anything to the Indexer. End of /opt/splunkforwarder/var/log/splunk/splunkd.log looks like this (after restart):
...
08-17-2016 16:25:09.384 -0700 INFO TailingProcessor - Adding watch on path: /var/c3d/logs/prod/tunnelconnect.log.
08-17-2016 16:25:09.384 -0700 INFO BatchReader - State transitioning from 2 to 0 (initOrResume).
08-17-2016 16:25:09.386 -0700 ERROR BTree - unable to reader 4088 bytes: bytes=4080 Success
08-17-2016 16:25:09.386 -0700 ERROR TailingProcessor - Ignoring path="/opt/splunkforwarder/etc/splunk.version" due to: BTree::Exception: Node::readLE failure in Node::Node(1) node offset: 4112 order: 255 keys: { } children: { }
08-17-2016 16:25:09.389 -0700 ERROR BTree - unable to reader 4088 bytes: bytes=0 Success
08-17-2016 16:25:09.389 -0700 ERROR TailingProcessor - Ignoring path="/var/c3d/logs/dev3/install.log" due to: BTree::Exception: Node::readLE failure in Node::Node(1) node offset: 8200 order: 255 keys: { } children: { }
08-17-2016 16:25:09.436 -0700 ERROR BTree - unable to reader 4088 bytes: bytes=0 Success
08-17-2016 16:25:09.436 -0700 ERROR TailingProcessor - Ignoring path="/opt/splunkforwarder/var/log/splunk/metrics.log.3" due to: BTree::Exception: Node::readLE failure in Node::Node(1) node offset: 8200 order: 255 keys: { } children:
{ }
...
[many lines like this!]
...
08-17-2016 16:25:11.634 -0700 ERROR BTree - unable to reader 4088 bytes: bytes=0 Success
08-17-2016 16:25:11.634 -0700 ERROR TailingProcessor - Ignoring path="[one of our log file path here]" due to: BTree::Exception: Node::readLE failure in Node::Node(1) node offset: 8200 order: 255 keys: { } children: { }
08-17-2016 16:26:09.333 -0700 INFO TcpOutputProc - Connected to idx=[removed] using ACK.
This seems to be related to things like moving indexes around or hardware/filesystem issues. If this is a Linux-based forwarder, there is a good chance that you need to increase the open file descriptors from the default of 1024 to perhaps 4096 and also change the coredump(blocks) to unlimited.
I had the same issues and resolved it by fixing ownership of splunk_private_db.old. chown -R splunk:splunk /opt/splunk.........
This resolution worked for me. I noticed that splunk_private_db.old was owned by root, so I simply chown'd it. Fixed the problem!
This seems to be related to things like moving indexes around or hardware/filesystem issues. If this is a Linux-based forwarder, there is a good chance that you need to increase the open file descriptors from the default of 1024 to perhaps 4096 and also change the coredump(blocks) to unlimited.
FWIW this solution also solved a similar problem we had with different symptom. After doing some (extensive) crash testing on one of our systems with a Splunk forwarder, the forwarder started leaving crash logs (e.g. [splunkforwarder]/var/logs/splunk/crash-2016-09-27-21:25:27.log) like this:
...
Backtrace:
[0x00007F1C68F88CC9] gsignal + 57 (/lib/x86_64-linux-gnu/libc.so.6)
[0x00007F1C68F8C0D8] abort + 328 (/lib/x86_64-linux-gnu/libc.so.6)
[0x00007F1C68F81B86] ? (/lib/x86_64-linux-gnu/libc.so.6)
[0x00007F1C68F81C32] ? (/lib/x86_64-linux-gnu/libc.so.6)
[0x0000000000E8440D] _ZN2bt5BTree4Node6readLEERK14FileDescriptor + 525 (splunkd)
[0x0000000000E848B1] _ZN2bt5BTree4NodeC2ERK14FileDescriptorPKNS0_3KeyEj + 97 (splunkd)
[0x0000000000E84EC4] _ZNK2bt5BTree4Node6acceptERK14FileDescriptorS4_RNS0_7VisitorEj + 148 (splunkd)
[0x0000000000E82BC2] _ZNK2bt5BTree10checkValidERNS0_6RecordE + 130 (splunkd)
[0x0000000000E8833B] _ZN2bt7BTreeCP10checkpointEv + 27 (splunkd)
[0x0000000000E8A07A] _ZN2bt7BTreeCP9addUpdateERKNS_5BTree3KeyERKNS1_6RecordE + 458 (splunkd)
[0x0000000000C3C4E6] _ZN11FileTracker9addUpdateERK5CRC_tRK10FishRecordb + 214 (splunkd)
[0x000000000099A7D8] _ZN3WTF11resetOffsetEv + 280 (splunkd)
[0x000000000099CF42] _ZN3WTF13loadFishStateEb + 1058 (splunkd)
[0x000000000098895F] _ZN10TailReader8readFileER15WatchedTailFileP11TailWatcherP11BatchReader + 191 (splunkd)
[0x0000000000988B7F] _ZN11TailWatcher8readFileER15WatchedTailFile + 127 (splunkd)
[0x000000000098ADFC] _ZN11TailWatcher11fileChangedEP16WatchedFileStateRK7Timeval + 700 (splunkd)
[0x0000000000EB8AA2] _ZN30FilesystemChangeInternalWorker15callFileChangedER7TimevalP16WatchedFileState + 114 (splunkd)
[0x0000000000EBA430] _ZN30FilesystemChangeInternalWorker12when_expiredERy + 464 (splunkd)
[0x0000000000F49B5D] _ZN11TimeoutHeap18runExpiredTimeoutsER7Timeval + 301 (splunkd)
[0x0000000000EB3CB8] _ZN9EventLoop3runEv + 744 (splunkd)
[0x000000000098916D] _ZN11TailWatcher3runEv + 141 (splunkd)
[0x000000000098E91A] _ZN13TailingThread4mainEv + 154 (splunkd)
[0x0000000000F4768E] _ZN6Thread8callMainEPv + 62 (splunkd)
[0x00007F1C6931F182] ? (/lib/x86_64-linux-gnu/libpthread.so.0)
[0x00007F1C6904C47D] clone + 109 (/lib/x86_64-linux-gnu/libc.so.6)
...
The suggested btprobe cmd fixed this too.
This sounds reasonable, as we do currently have
$ulimit -n
1024
eryder suggested in direct email to us to run:
$ ./bin/splunk cmd btprobe -d [your path to...]/fishbucket/splunk_private_db -r
This appears to have fixed the problem. Errors no longer seen after Splunk restart.
Thanks!
Thanks I saw that issue too, but it wasn't clear to me what their solution was. Also we have different concerns, since this error is on one of our forwarders, not the indexer. We don't have backups of the forwarder filesystem.
I was hoping since this a forwarder the solution might be easier. For example re-indexing some of the data from the forwarder would be acceptable...
Really interesting because I only found this single thread with reference to an error with BTree.