Deployment Architecture
Highlighted

How to find out why Splunk is crashing with error "Crashing thread: TcpChannelThread"?

Path Finder

Splunk is crashing, related possibly to DMA (data model acceleration). Having an issue with one of my clustered indexers crashing. The following is the crash log. Any input or places to look would be awesome...

[build aeae3fe0c5af] 2017-07-05 15:49:56
Received fatal signal 6 (Aborted).
 Cause:
   Signal sent by PID 8568 running under UID 500.
 Crashing thread: TcpChannelThread
 Registers:
    RIP:  [0x00007F2B891B15E5] gsignal + 53 (libc.so.6 + 0x325E5)
    RDI:  [0x0000000000002178]
    RSI:  [0x00000000000023A2]
    RBP:  [0x00007F2B8C5AE7F8]
    RSP:  [0x00007F2B633FAEF8]
    RAX:  [0x0000000000000000]
    RBX:  [0x00007F2B8A722000]
    RCX:  [0xFFFFFFFFFFFFFFFF]
    RDX:  [0x0000000000000006]
    R8:  [0x0000000000000008]
    R9:  [0xFEFEFEFEFEFEFEFF]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000206]
    R12:  [0x00007F2B8C5AE848]
    R13:  [0x00007F2B8C6893A0]
    R14:  [0x0000000000000001]
    R15:  [0x0000000058F56601]
    EFL:  [0x0000000000000206]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]

OS: Linux
Arch: x86-64

Backtrace (PIC build):

  [0x00007F2B891B15E5] gsignal + 53 (libc.so.6 + 0x325E5)
  [0x00007F2B891B2DC5] abort + 373 (libc.so.6 + 0x33DC5)
  [0x00007F2B891AA70E] ? (libc.so.6 + 0x2B70E)
  [0x00007F2B891AA7D0] __assert_perror_fail + 0 (libc.so.6 + 0x2B7D0)
  [0x00007F2B8B2A65DA] _ZN14SummaryManager20readSummaries_lockedERKN12cachemanager10BucketTypeERSt3mapINS0_7CacheIdENS_7SummaryESt4lessIS5_ESaISt4pairIKS5_S6_EEE + 1546 (splunkd + 0xB2C5DA)
  [0x00007F2B8B2A7C89] _ZN14SummaryManager20handleSummaryChangesERK8JsonNode + 249 (splunkd + 0xB2DC89)
  [0x00007F2B8B2A8E5C] _ZN21SummaryManagerHandler12handleCreateER10ConfigInfo + 348 (splunkd + 0xB2EE5C)
  [0x00007F2B8B46357C] _ZN14MConfigHandler14executeHandlerER10ConfigInfo + 620 (splunkd + 0xCE957C)
  [0x00007F2B8B4739ED] _ZN14MConfigHandler2goER10ConfigInfo + 189 (splunkd + 0xCF99ED)
  [0x00007F2B8B4745B4] _ZN29AdminManagerReplyDataProvider2goEv + 804 (splunkd + 0xCFA5B4)
  [0x00007F2B8B50CDE8] _ZN33ServicesEndpointReplyDataProvider9rawHandleEv + 88 (splunkd + 0xD92DE8)
  [0x00007F2B8B50289F] _ZN18RawRestHttpHandler10getPreBodyEP21HttpServerTransaction + 31 (splunkd + 0xD8889F)
  [0x00007F2B8B943D50] _ZN32HttpThreadedCommunicationHandler11communicateER17TcpSyncDataBuffer + 272 (splunkd + 0x11C9D50)
  [0x00007F2B8AF82023] _ZN16TcpChannelThread4mainEv + 227 (splunkd + 0x808023)
  [0x00007F2B8B9CD130] _ZN6Thread8callMainEPv + 64 (splunkd + 0x1253130)
  [0x00007F2B8951AAA1] ? (libpthread.so.0 + 0x7AA1)
  [0x00007F2B89267AAD] clone + 109 (libc.so.6 + 0xE8AAD)
 Linux / jsspl3.verbosity.net / 2.6.32-642.6.1.el6.x86_64 / #1 SMP Wed Oct 5 00:36:12 UTC 2016 / x86_64
 Last few lines of stderr (may contain info on assertion failure, but also could be old):
    2017-07-05 15:45:59.290 -0400 splunkd started (build aeae3fe0c5af)
    splunkd: /home/build/build-src/kimono/src/pipeline/indexer/search/SummaryManager.cpp:63: void SummaryManager::readSummaries_locked(const cachemanager::BucketType&, SummaryManager::SummaryMap&): Assertion `fields.size() == summary_info_fields_size' failed.
    2017-07-05 15:49:34.566 -0400 splunkd started (build aeae3fe0c5af)
    splunkd: /home/build/build-src/kimono/src/pipeline/indexer/search/SummaryManager.cpp:63: void SummaryManager::readSummaries_locked(const cachemanager::BucketType&, SummaryManager::SummaryMap&): Assertion `fields.size() == summary_info_fields_size' failed.

/etc/redhat-release: CentOS release 6.8 (Final)
glibc version: 2.12
glibc release: stable
Last errno: 0
Threads running: 74
Runtime: 21.872963s
argv: [splunkd -p 8089 start]
Regex JIT disabled due to SELinux

using CLOCK_MONOTONIC

Thread: "TcpChannelThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f2b570e6b90:
00000000  00 c7 3f 63 2b 7f 00 00                           |..?c+...|
00000008
commandForThread=0, nextIdle=0x7f2b71571540, requestAfterThread=0, _tpfd=0x7f2b59ae5000, writeCorkCount=0, terminateCallback=(nil), ioError=No error, lastError=No error, terminateError=No error
giveCmd @0x7f2b570e6ce8: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, cmd=0, ok=Y, chan=0x7f2b5ed7a800
writeDataAvail @0x7f2b570e6d48: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, chan=0x7f2b5ed7a800
wbuf: ptr=0x7f2b570e6de8, size=0x8000, rptr=0x0, wptr=0x0
HttpListeningConnection: _transactionActive=Y, _haveHadTransaction=Y, _alreadyLoggedTimeout=N
HttpTcpConnection: peer=127.0.0.1, _desiredCompressionLevel=0
RestHttpServerTransaction: _restPath="admin/summaryman", namespaced=N, context=-/-, session=[user=splunk-system-user, refcnt=2, touched=1499284196, removed=N, id=aacf8b13120ab7d2c0f93e6c1ad67e4f, created=1499284184, createdBy=67C3A1EA-A1BB-4154-AB93-977DBDB331B4, sid="remote_jsspl5.verbosity.net_scheduler__nobody__f5__RMD5ed4f8c799ba150d6_at_1499283900_91172"]
HttpServerTransaction: _state=6, _shouldLog=Y, _startTime=1499284196.291574
REQUEST: POST /services/admin/summaryman HTTP/1.1
    User-Agent: Splunk/6.6.1 (Linux 2.6.32-642.6.1.el6.x86_64; arch=x86_64)
    TE: trailers, chunked
    Host: 127.0.0.1:8089
    Content-Length: 631
    Content-Type: application/x-www-form-urlencoded; charset=UTF-8
    Authorization: Splunk {value elided from dump}
  _bytesReceived=631, _maximumRequestDataSize=2147483648, _totalBytesExpectedOfRequestData=631
  _bytesLeftInRequestDataChunk=0, _requestTransferEncodingIsChunked=N, _receivingRequestDataForever=N
  _needToSetupRequestGunzip=N, _owedConsume=0, _wantSavedRequestData=N
  _100continue=0, _expectDisconnect=N, _overrideSourceState=0
POST arguments: {["name"] = "report_summaries", ["summary_changes"] = "{"f5-system_stats":{"dma":[{"summary_operation_type":"0","summary_cid":"dma|f5-system_stats~232~620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB|2A18DD68-5133-4DA5-98D6-14813FE663D7_DM_f5_bigip-tmstats-pool_member_stat","summary_path":"/opt/splunk_hw/f5-system_stats/datamodel_summary/232_620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB/2A18DD68-5133-4DA5-98D6-14813FE663D7/DM_f5_bigip-tmstats-pool_member_stat","summary_earliest_time":1499020655,"summary_latest_time":1499278827,"summary_size_on_disk":12288}]}}"}
REPLY: 200 
admin_handler="summaryman"
MConfigHandler: name=summaryman, _atomFormat=1, _customAction=
  caller args: id="report_summaries": { summary_changes -> { _dataType=string _isMultiValue=Y, _values: ["{"f5-system_stats":{"dma":[{"summary_operation_type":"0","summary_cid":"dma|f5-system_stats~232~620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB|2A18DD68-5133-4DA5-98D6-14813FE663D7_DM_f5_bigip-tmstats-pool_member_stat","summary_path":"/opt/splunk_hw/f5-system_stats/datamodel_summary/232_620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB/2A18DD68-5133-4DA5-98D6-14813FE663D7/DM_f5_bigip-tmstats-pool_member_stat","summary_earliest_time":1499020655,"summary_latest_time":1499278827,"summary_size_on_disk":12288}]}}"] } }
  _docShowEntry=Y, _didFilter=N, _didPaginate=N
  _maxCount=30, posOffset=0, _requestedAction=1
  _shouldFilter=N, _shouldReload=N, _shouldAutoList=N, _sortSpecified=N
  _strict_mode=N, _list_new=N, _force_stanza_overwite=N, _force_app_context_on_write=Y
  sort keys: ["name"]
  sort modes: ["auto"]
  supported args: ["name" type=0 (required), "summary_changes" type=0 (required)]
  Paginator: offset=0, count=30
  _customStatusCode=0, _supportedActions=0x43, hasSession=Y
  _forceBoolNormalization=N, _contextMode=0, _didCapCheck=Y
  _ranSetup=Y, _restartRequired=N, _listingOne=N
  _userName=splunk-system-user, _appName=search
ServicesEndpointReplyDataProvider: _setupState=0, _outputMode=1, _explicitOutputMode=N
GET args: {}
  _allowedMethods={GET,POST,PUT,DELETE,HEAD,OPTIONS}, _preconditionState=0
  _wantsSeparateThread=N, _alreadyBuiltHeaders=N, _needToSendBody=Y
  _bodyBytesWritten=0, _chunkedState=0, _isLastTransaction=N
  _varyBy=0x10, _redirectUrl="", _downloadFilename="", _totalScheduledLength=0
  _willSendDataLater=N, _toSendState=0, _toSendSafe=Y
  _knowCompleteLength=N, _desiredCompressionLevel=0
  _replyIsGzipCompressed=N, _cacheControl=0x10, _maxCacheSeconds=4294967295, _dontIncludeFrameOptions=N
In TcpChannel 0x7f2b59ae5000, _tcloop=0x7f2b88c42690, no async write data, _data._shouldKill=N, r/w_timeouts=5.000/300.000, timeout_count=0
SSL: version="TLSv1.2", state="SSL negotiation finished successfully", cipher="ECDHE-RSA-AES256-GCM-SHA384", compression="zlib compression"
rbuf: ptr=0x7f2b59ae50a0, size=0x2000, rptr=0x0, wptr=0x0
TcpChannelAcceptor: , tcloop=0x7f2b88c42690, _disabledReasons=0, _activeCount=16, _inflightSubordinateAccepts=0
HttpListener: ssl=Y, _maxActiveConnections=6826, _wellBelowConnectionLimit=Y, _maxThreads=2658
SplunkdHttpListener: PORT: _allowGzip=Y, bind=https://:8089
  conf: _sslopt={rootCAPath="", caCertFile="/opt/splunk/etc/auth/cacert.pem", certFile="/opt/splunk/etc/auth/server.pem", privateKeyFile="/opt/splunk/etc/auth/server.pem", privateKeyPassword_set=Y, commonNameToCheck="", altNameToCheck="", allowSslRenegotiation=Y, sslVersions="TLS1.2", cipherSuite="ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:AES256-GCM-SHA384:AES128-GCM-SHA256:AES128-SHA256", ecdhCurves="prime256v1, secp384r1, secp521r1", useCompression=Y, quietShutdown=NdhFile="", shouldVerifyClientCert=N}, _allowSslRenegotiation=Y, _frameOptionsSameOrigin=Y, _strictTransportSecurityHeader=N, _allowBasicAuth=Y, _allowCookieAuth=Y, _cookieAuthHttpOnly=Y, _cookieAuthSecure=Y
  conf: _streamInWriteTimeout=5.000, _maxContentLength=2147483648, _maxThreads=2658, _maxSockets=6826, _forceHttp10=0
_thread=0x7f2b570e6b80: commandForThread=0, nextIdle=0x7f2b71571540, requestAfterThread=0, _tpfd=0x7f2b59ae5000, writeCorkCount=0, terminateCallback=(nil), ioError=No error, lastError=No error, terminateError=No error
giveCmd @0x7f2b570e6ce8: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, cmd=0, ok=Y, chan=0x7f2b5ed7a800
writeDataAvail @0x7f2b570e6d48: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, chan=0x7f2b5ed7a800
wbuf: ptr=0x7f2b570e6de8, size=0x8000, rptr=0x0, wptr=0x0

x86 CPUID registers:

     0: 0000000B 756E6547 6C65746E 49656E69
     1: 00020651 0C010800 83B82203 0FABFBFF
     2: 55035A01 00F0B2FF 00000000 00CA0000
     3: 00000000 00000000 00000000 00000000
     4: 00000000 00000000 00000000 00000000
     5: 00000000 00000000 00000000 00000000
     6: 00000007 00000002 00000001 00000000
     7: 00000000 00000000 00000000 00000000
     8: 00000000 00000000 00000000 00000000
     9: 00000000 00000000 00000000 00000000
     A: 07300401 0000007F 00000000 00000000
     B: 00000000 00000000 000000CD 0000000C


 80000000: 80000008 00000000 00000000 00000000
  80000001: 00000000 00000000 00000001 28100800
  80000002: 65746E49 2952286C 6F655820 2952286E
  80000003: 55504320 20202020 20202020 45202020
  80000004: 35343635 20402020 30342E32 007A4847
  80000005: 00000000 00000000 00000000 00000000
  80000006: 00000000 00000000 01006040 00000000
  80000007: 00000000 00000000 00000000 00000100
  80000008: 0000302A 00000000 00000000 00000000
terminating...
Highlighted

Re: How to find out why Splunk is crashing with error "Crashing thread: TcpChannelThread"?

Splunk Employee
Splunk Employee

If one of the Indexers ran out-of-space, It may result in incomplete writes to .bucketSummaryManifest in the ../yourindex/summary Or ../yourindex/datamodel_summary paths. The crash could happen while Splunk reads the corrupt file.

As a workaround:

  • Run below command to identify possible corrupted files/lines (run from the splunk/var/lib/splunk): grep -vPn '\"[^\"]+\",\"[^\"]+\",\"[^\"]+\",\d+,\d+,\d+' */summary/.bucketSummaryManifest grep -vPn '\"[^\"]+\",\"[^\"]+\",\"[^\"]+\",\d+,\d+,\d+' */datamodel_summary/.bucketSummaryManifest
  1. Move the .bucketSummaryManifest file to temp folder outside Splunk then restart the indexer
  2. Confirm that a new .bucketSummaryManifest file is created and keep eyes on your indexer in a problem to see for any additional crash.

Also, we have been fixed that issue 7.0.0 onwards via SPL-141877
Hope it helps.

0 Karma