I upgraded my distributed environment to Splunk 6.0 and now my Indexers are continually crashing. I looked in the log and the crashing thread is "FwdDataReceiverThread". Has anyone else seen this? What can be done to fix or work around the problem?
What you're seeing here is SPL-75796, a known issue. There is a workaround available for the problem. In inputs.conf on the indexer(s), you can use this value in the splunktcp stanza:
negotiateNewProtocol = [true|false]
* If set to true, allow forwarders that connect to this indexer (or specific port) to send data using the new forwarder protocol.
* If set to false, deny the use of the new forwarder protocol during connection negotation.
* Defaults to true.
On the indexer's inputs.conf, change negotiateNewProtocol = false.
This would force the indexer to use the pre-6.0 version of the S2S protocol. No need to modify the negotiateNewProtocol attribute on the Window forwarder 6.0 outputs.conf at this time.
After making these changes, a restart would be required. This setting has been successfully used to mitigate this problem in production 6.0 environments in more than one instance.
Yes, this is fixed in 6.1.
I don't see this bug mentioned in the 6.1 release notes. Can anybody confirm that it is fixed in 6.1?
And also in 6.0.4.
Yes, this is fixed in 6.1.
Adding negotiateNewProtocol = false to indexers inputs.conf reduced the crashing frequenzy, but adding it to the outputs.conf on every server talking directly to the indexers solved the rest of the problem.
The call stack from the crash*.log file at location $SPLUNK_HOME\var\log\splunk\ is as shown below.
Last crash-log:
[build 196940] 2014-03-18 12:58:15
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 30461 running under UID 0.
Crashing thread: FwdDataReceiverThread
Registers:
RIP: [0x0000003605432925] gsignal + 53 (/lib64/libc.so.6)
RDI: [0x00000000000076FD]
RSI: [0x0000000000007889]
RBP: [0x00000000012FE800]
RSP: [0x00007FA0BF7FE7D8]
RAX: [0x0000000000000000]
RBX: [0x00007FA0CC421000]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0xFEFEFEFEFEFEFEFF]
R9: [0x00007FA0CCAD8F60]
R10: [0x0000000000000008]
R11: [0x0000000000000202]
R12: [0x000000000125FD43]
R13: [0x00000000012FEFA0]
R14: [0x00007FA0A4A81400]
R15: [0x00007FA0AA114BB0]
EFL: [0x0000000000000202]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]
OS: Linux
Arch: x86-64
Backtrace:
[0x0000003605432925] gsignal + 53 (/lib64/libc.so.6)
[0x0000003605434105] abort + 373 (/lib64/libc.so.6)
[0x000000360542BA4E] ? (/lib64/libc.so.6)
[0x000000360542BB10] _assert_perror_fail + 0 (/lib64/libc.so.6)
[0x0000000000B4653F] _ZN20PipelineInputChannel16addNumberedFieldERK10StrSegment + 159 (splunkd)
[0x0000000000C49EE3] _ZN21S2SRegisteredChannels11createModelERK21S2SChannelDescriptionR15CowPipelineData + 1043 (splunkd)
[0x0000000000BABF1D] _ZN18StreamingS2SParser5parseEPKcS1 + 8845 (splunkd)
[0x0000000000869570] _ZN16CookedTcpChannel7consumeER18TcpAsyncDataBuffer + 144 (splunkd)
[0x00000000008684F1] _ZN16CookedTcpChannel13dataAvailableER18TcpAsyncDataBuffer + 49 (splunkd)
[0x0000000000D81F95] _ZN10TcpChannel11when_eventsE18PollableDescriptor + 757 (splunkd)
[0x0000000000D06BF0] _ZN8PolledFd8do_eventEv + 128 (splunkd)
[0x0000000000D07847] _ZN9EventLoop3runEv + 1239 (splunkd)
[0x0000000000D8478D] _ZN6Thread8callMainEPv + 61 (splunkd)
[0x00000036058079D1] ? (/lib64/libpthread.so.0)
[0x00000036054E8B6D] clone + 109 (/lib64/libc.so.6)
Linux / i01.splunk.nsc.no / 2.6.32-431.el6.x86_64 / #1 SMP Fri Nov 22 03:15:09 UTC 2013 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
Cannot open manifest file inside "/data/sas/db_1395142465_1395142438_38_9E5C49A8-6A36-41A9-A5F2-0B7F5547533E/rawdata": No such file or directory
Cannot open manifest file inside "/data/syslog/db_1395142453_1395142381_54_9E5C49A8-6A36-41A9-A5F2-0B7F5547533E/rawdata": No such file or directory
Cannot open manifest file inside "/data/tvinpeaks/db_1332070440_1332070320_837_9E5C49A8-6A36-41A9-A5F2-0B7F5547533E/rawdata": No such file or directory
Cannot open manifest file inside "/data/windows/db_1395142448_1395142391_83_9E5C49A8-6A36-41A9-A5F2-0B7F5547533E/rawdata": No such file or directory
splunkd: /opt/splunk/p4/splunk/branches/6.0.2/src/framework/PipelineInputChannel.cpp:706: void PipelineInputChannel::addNumberedField(const StrSegment&): Assertion `_refcnt == 1' failed.
/etc/redhat-release: CentOS release 6.5 (Final)
glibc version: 2.12
glibc release: stable
Last errno: 0
Threads running: 58
argv: [splunkd -p 8089 start]
Thread: "FwdDataReceiverThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7fa0bf82b450:
00000000 00 f7 7f bf a0 7f 00 00 |........|
00000008
First 512 bytes of PolledFd object @0x7fa0a4a7f000:
00000000 30 80 2a 01 00 00 00 00 00 00 00 00 00 00 00 00 |0.*.............|
00000010 a0 93 7d 01 00 00 00 00 00 00 00 00 00 00 00 00 |..}.............|
00000020 70 3d 22 32 30 31 34 2d 00 00 00 00 00 00 00 00 |p="2014-........|
00000030 3a 33 38 3a 31 36 2e 32 de 00 00 00 1f 00 00 00 |:38:16.2........|
00000040 c0 70 84 bf a0 7f 00 00 1b 00 00 00 4f 22 20 6c |.p..........O" l|
00000050 00 00 00 00 00 00 00 00 65 3d 22 63 6f 6d 2e 74 |........e="com.t|
00000060 e0 cd d7 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 d0 e4 d7 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 d0 02 d8 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 00 00 00 00 15 02 00 00 00 00 00 00 |................|
000000a0 fd 52 fe 52 37 73 6f 75 72 63 65 3a 3a 43 3a 5c |.R.R7source::C:|
000000b0 69 6e 65 74 70 75 62 5c 6c 6f 67 73 5c 4c 6f 67 |inetpub\logs\Log|
000000c0 46 69 6c 65 73 5c 57 33 53 56 43 31 5c 75 5f 65 |Files\W3SVC1\u_e|
000000d0 78 31 34 30 33 31 38 2e 6c 6f 67 18 68 6f 73 74 |x140318.log.host|
000000e0 3a 3a 53 4d 53 2d 4b 55 4e 44 45 44 49 41 4c 4f |::SMS-KUNDEDIALO|
000000f0 47 30 31 10 73 6f 75 72 63 65 74 79 70 65 3a 3a |G01.sourcetype::|
00000100 69 69 73 03 35 39 12 04 64 61 74 65 04 74 69 6d |iis.59..date.tim|
00000110 65 0a 73 5f 73 69 74 65 6e 61 6d 65 0e 73 5f 63 |e.s_sitename.s_c|
00000120 6f 6d 70 75 74 65 72 6e 61 6d 65 04 73 5f 69 70 |omputername.s_ip|
00000130 09 63 73 5f 6d 65 74 68 6f 64 0b 63 73 5f 75 72 |.cs_method.cs_ur|
00000140 69 5f 73 74 65 6d 0c 63 73 5f 75 72 69 5f 71 75 |i_stem.cs_uri_qu|
00000150 65 72 79 06 73 5f 70 6f 72 74 0b 63 73 5f 75 73 |ery.s_port.cs_us|
00000160 65 72 6e 61 6d 65 04 63 5f 69 70 0d 63 73 5f 55 |ername.c_ip.cs_U|
00000170 73 65 72 5f 41 67 65 6e 74 0a 63 73 5f 52 65 66 |ser_Agent.cs_Ref|
00000180 65 72 65 72 07 63 73 5f 68 6f 73 74 09 73 63 5f |erer.cs_host.sc_|
00000190 73 74 61 74 75 73 0c 73 63 5f 73 75 62 73 74 61 |status.sc_substa|
000001a0 74 75 73 0f 73 63 5f 77 69 6e 33 32 5f 73 74 61 |tus.sc_win32_sta|
000001b0 74 75 73 0a 74 69 6d 65 5f 74 61 6b 65 6e fc 52 |tus.time_taken.R|
000001c0 f5 84 01 82 c6 c0 02 09 04 05 5f 70 61 74 68 2e |.........._path.|
000001d0 43 3a 5c 69 6e 65 74 70 75 62 5c 6c 6f 67 73 5c |C:\inetpub\logs|
000001e0 4c 6f 67 46 69 6c 65 73 5c 57 33 53 56 43 31 5c |LogFiles\W3SVC1|
000001f0 75 5f 65 78 31 34 30 33 31 38 2e 6c 6f 67 04 07 |u_ex140318.log..|
00000200
In TcpChannel 0x7fa0a4a7f000, _tcloop=0x7fa0bf8541c0, no async write data, _data._shouldKill=N, r/w_timeouts=600000/240000, timeout_count=0
SSL: inactive
rbuf: ptr=0x7fa0a4a7f0a0, size=0x2000, rptr=0x0, wptr=0x215
TcpChannelAcceptor: tcloop=0x7fa0bf8541c0, activeList=0x7fa0aa17c000, activeCount=8, idleThreads=(nil), idleThreadCount=0, activeThreadCount=0, maxActiveThreadsInTimePeriod=0, idleThreadTrimmerActive=N, disabledReasons=0
x86 CPUID registers:
0: 0000000D 756E6547 6C65746E 49656E69
1: 000206D7 04200800 1FBEE3FF BFEBFBFF
2: 76035A01 00F0B2FF 00000000 00CA0000
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000040 00000040 00000003 00021120
6: 00000077 00000002 00000009 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000000 00000000 00000000 00000000
A: 07300403 00000000 00000000 00000603
B: 00000000 00000000 000000FD 00000004
C: 00000000 00000000 00000000 00000000
D: 00000000 00000000 00000000 00000000
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000001 2C100800
80000002: 20202020 49202020 6C65746E 20295228
80000003: 6E6F6558 20295228 20555043 322D3545
80000004: 20333436 20402030 30332E33 007A4847
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000100
80000008: 0000302E 00000000 00000000 00000000
Another, more surgical approach is to disable the new protocol on the Windows UF's. This allows the indexers to use 6.x Splunk2Splunk protocol for non-Windows systems. The 6.x S2S protocol has some advantages:
on the Windows UF's outputs.conf file I use this setting:
#SPL-75796 Windows UF Crashing Indexers Fix
[default]
negotiateNewProtocol = false
Presently this is slated for a fix in 6.0.3, but that might get expedited.
Can we send that stanza even to linux UF's and be sure that it wont be breaking anything? If not then is the only solution is to create two outputs.conf files(One for Windows UF's and one for Linux UF's)?
What you're seeing here is SPL-75796, a known issue. There is a workaround available for the problem. In inputs.conf on the indexer(s), you can use this value in the splunktcp stanza:
negotiateNewProtocol = [true|false]
* If set to true, allow forwarders that connect to this indexer (or specific port) to send data using the new forwarder protocol.
* If set to false, deny the use of the new forwarder protocol during connection negotation.
* Defaults to true.
On the indexer's inputs.conf, change negotiateNewProtocol = false.
This would force the indexer to use the pre-6.0 version of the S2S protocol. No need to modify the negotiateNewProtocol attribute on the Window forwarder 6.0 outputs.conf at this time.
After making these changes, a restart would be required. This setting has been successfully used to mitigate this problem in production 6.0 environments in more than one instance.
Still not fixed in 6.0.3 😞
No, it is still not fixed in 6.0.2:
Search for SPL-75796 in the release notes:
http://docs.splunk.com/Documentation/Splunk/6.0.2/ReleaseNotes/KnownIssues
Is it fixed in 6.0.2?
The diag and debug from splunkd.log on the indexer are attached to support case 154002.
@Glennpierce, I have added your comments to the bug. I haven't run into non-Windows UF's crashing, but I suppose you have.
Any more details you can provide about the circumstances of the crash would be appreciated and noted in the SPL bug too.
I tried applying the work-around suggested above but this did not stop our indexer splunkd process from crashing. After adding negotiateNewProtocol = false to the outputs.conf on all our tier 1 intermediate Splunk servers + search head + deployment server the crashes stopped.
Note that we have no Windows UF's connecting directly to our indexer, only Linux UF's or full Splunk instances built on Linux, so I'd suggest this issue might apply to more than just Windows UF's connecting to an indexer.
It is listed in 6.0.1 release notes now.
This is not fixed in 6.0.1.
It does not appear to be listed in the known issues for either - or any of the versions I've searched for that matter. Not sure if then number is correct. Check yourself:
http://docs.splunk.com/Documentation/Splunk/6.0.1/ReleaseNotes/KnownIssues
Does anyone know if this is fixed in Splunk 6.0.1?