Splunk is crashing, related possibly to DMA (data model acceleration). Having an issue with one of my clustered indexers crashing. The following is the crash log. Any input or places to look would be awesome...
[build aeae3fe0c5af] 2017-07-05 15:49:56
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 8568 running under UID 500.
Crashing thread: TcpChannelThread
Registers:
RIP: [0x00007F2B891B15E5] gsignal + 53 (libc.so.6 + 0x325E5)
RDI: [0x0000000000002178]
RSI: [0x00000000000023A2]
RBP: [0x00007F2B8C5AE7F8]
RSP: [0x00007F2B633FAEF8]
RAX: [0x0000000000000000]
RBX: [0x00007F2B8A722000]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0x0000000000000008]
R9: [0xFEFEFEFEFEFEFEFF]
R10: [0x0000000000000008]
R11: [0x0000000000000206]
R12: [0x00007F2B8C5AE848]
R13: [0x00007F2B8C6893A0]
R14: [0x0000000000000001]
R15: [0x0000000058F56601]
EFL: [0x0000000000000206]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]
OS: Linux
Arch: x86-64
Backtrace (PIC build):
[0x00007F2B891B15E5] gsignal + 53 (libc.so.6 + 0x325E5)
[0x00007F2B891B2DC5] abort + 373 (libc.so.6 + 0x33DC5)
[0x00007F2B891AA70E] ? (libc.so.6 + 0x2B70E)
[0x00007F2B891AA7D0] __assert_perror_fail + 0 (libc.so.6 + 0x2B7D0)
[0x00007F2B8B2A65DA] _ZN14SummaryManager20readSummaries_lockedERKN12cachemanager10BucketTypeERSt3mapINS0_7CacheIdENS_7SummaryESt4lessIS5_ESaISt4pairIKS5_S6_EEE + 1546 (splunkd + 0xB2C5DA)
[0x00007F2B8B2A7C89] _ZN14SummaryManager20handleSummaryChangesERK8JsonNode + 249 (splunkd + 0xB2DC89)
[0x00007F2B8B2A8E5C] _ZN21SummaryManagerHandler12handleCreateER10ConfigInfo + 348 (splunkd + 0xB2EE5C)
[0x00007F2B8B46357C] _ZN14MConfigHandler14executeHandlerER10ConfigInfo + 620 (splunkd + 0xCE957C)
[0x00007F2B8B4739ED] _ZN14MConfigHandler2goER10ConfigInfo + 189 (splunkd + 0xCF99ED)
[0x00007F2B8B4745B4] _ZN29AdminManagerReplyDataProvider2goEv + 804 (splunkd + 0xCFA5B4)
[0x00007F2B8B50CDE8] _ZN33ServicesEndpointReplyDataProvider9rawHandleEv + 88 (splunkd + 0xD92DE8)
[0x00007F2B8B50289F] _ZN18RawRestHttpHandler10getPreBodyEP21HttpServerTransaction + 31 (splunkd + 0xD8889F)
[0x00007F2B8B943D50] _ZN32HttpThreadedCommunicationHandler11communicateER17TcpSyncDataBuffer + 272 (splunkd + 0x11C9D50)
[0x00007F2B8AF82023] _ZN16TcpChannelThread4mainEv + 227 (splunkd + 0x808023)
[0x00007F2B8B9CD130] _ZN6Thread8callMainEPv + 64 (splunkd + 0x1253130)
[0x00007F2B8951AAA1] ? (libpthread.so.0 + 0x7AA1)
[0x00007F2B89267AAD] clone + 109 (libc.so.6 + 0xE8AAD)
Linux / jsspl3.verbosity.net / 2.6.32-642.6.1.el6.x86_64 / #1 SMP Wed Oct 5 00:36:12 UTC 2016 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
2017-07-05 15:45:59.290 -0400 splunkd started (build aeae3fe0c5af)
splunkd: /home/build/build-src/kimono/src/pipeline/indexer/search/SummaryManager.cpp:63: void SummaryManager::readSummaries_locked(const cachemanager::BucketType&, SummaryManager::SummaryMap&): Assertion `fields.size() == summary_info_fields_size' failed.
2017-07-05 15:49:34.566 -0400 splunkd started (build aeae3fe0c5af)
splunkd: /home/build/build-src/kimono/src/pipeline/indexer/search/SummaryManager.cpp:63: void SummaryManager::readSummaries_locked(const cachemanager::BucketType&, SummaryManager::SummaryMap&): Assertion `fields.size() == summary_info_fields_size' failed.
/etc/redhat-release: CentOS release 6.8 (Final)
glibc version: 2.12
glibc release: stable
Last errno: 0
Threads running: 74
Runtime: 21.872963s
argv: [splunkd -p 8089 start]
Regex JIT disabled due to SELinux
using CLOCK_MONOTONIC
Thread: "TcpChannelThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f2b570e6b90:
00000000 00 c7 3f 63 2b 7f 00 00 |..?c+...|
00000008
commandForThread=0, nextIdle=0x7f2b71571540, requestAfterThread=0, _tpfd=0x7f2b59ae5000, writeCorkCount=0, terminateCallback=(nil), ioError=No error, lastError=No error, terminateError=No error
giveCmd @0x7f2b570e6ce8: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, cmd=0, ok=Y, chan=0x7f2b5ed7a800
writeDataAvail @0x7f2b570e6d48: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, chan=0x7f2b5ed7a800
wbuf: ptr=0x7f2b570e6de8, size=0x8000, rptr=0x0, wptr=0x0
HttpListeningConnection: _transactionActive=Y, _haveHadTransaction=Y, _alreadyLoggedTimeout=N
HttpTcpConnection: peer=127.0.0.1, _desiredCompressionLevel=0
RestHttpServerTransaction: _restPath="admin/summaryman", namespaced=N, context=-/-, session=[user=splunk-system-user, refcnt=2, touched=1499284196, removed=N, id=aacf8b13120ab7d2c0f93e6c1ad67e4f, created=1499284184, createdBy=67C3A1EA-A1BB-4154-AB93-977DBDB331B4, sid="remote_jsspl5.verbosity.net_scheduler__nobody__f5__RMD5ed4f8c799ba150d6_at_1499283900_91172"]
HttpServerTransaction: _state=6, _shouldLog=Y, _startTime=1499284196.291574
REQUEST: POST /services/admin/summaryman HTTP/1.1
User-Agent: Splunk/6.6.1 (Linux 2.6.32-642.6.1.el6.x86_64; arch=x86_64)
TE: trailers, chunked
Host: 127.0.0.1:8089
Content-Length: 631
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Authorization: Splunk {value elided from dump}
_bytesReceived=631, _maximumRequestDataSize=2147483648, _totalBytesExpectedOfRequestData=631
_bytesLeftInRequestDataChunk=0, _requestTransferEncodingIsChunked=N, _receivingRequestDataForever=N
_needToSetupRequestGunzip=N, _owedConsume=0, _wantSavedRequestData=N
_100continue=0, _expectDisconnect=N, _overrideSourceState=0
POST arguments: {["name"] = "report_summaries", ["summary_changes"] = "{"f5-system_stats":{"dma":[{"summary_operation_type":"0","summary_cid":"dma|f5-system_stats~232~620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB|2A18DD68-5133-4DA5-98D6-14813FE663D7_DM_f5_bigip-tmstats-pool_member_stat","summary_path":"/opt/splunk_hw/f5-system_stats/datamodel_summary/232_620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB/2A18DD68-5133-4DA5-98D6-14813FE663D7/DM_f5_bigip-tmstats-pool_member_stat","summary_earliest_time":1499020655,"summary_latest_time":1499278827,"summary_size_on_disk":12288}]}}"}
REPLY: 200
admin_handler="summaryman"
MConfigHandler: name=summaryman, _atomFormat=1, _customAction=
caller args: id="report_summaries": { summary_changes -> { _dataType=string _isMultiValue=Y, _values: ["{"f5-system_stats":{"dma":[{"summary_operation_type":"0","summary_cid":"dma|f5-system_stats~232~620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB|2A18DD68-5133-4DA5-98D6-14813FE663D7_DM_f5_bigip-tmstats-pool_member_stat","summary_path":"/opt/splunk_hw/f5-system_stats/datamodel_summary/232_620CCE54-6DF9-47E1-B3A3-9BE55F5C66BB/2A18DD68-5133-4DA5-98D6-14813FE663D7/DM_f5_bigip-tmstats-pool_member_stat","summary_earliest_time":1499020655,"summary_latest_time":1499278827,"summary_size_on_disk":12288}]}}"] } }
_docShowEntry=Y, _didFilter=N, _didPaginate=N
_maxCount=30, posOffset=0, _requestedAction=1
_shouldFilter=N, _shouldReload=N, _shouldAutoList=N, _sortSpecified=N
_strict_mode=N, _list_new=N, _force_stanza_overwite=N, _force_app_context_on_write=Y
sort keys: ["name"]
sort modes: ["auto"]
supported args: ["name" type=0 (required), "summary_changes" type=0 (required)]
Paginator: offset=0, count=30
_customStatusCode=0, _supportedActions=0x43, hasSession=Y
_forceBoolNormalization=N, _contextMode=0, _didCapCheck=Y
_ranSetup=Y, _restartRequired=N, _listingOne=N
_userName=splunk-system-user, _appName=search
ServicesEndpointReplyDataProvider: _setupState=0, _outputMode=1, _explicitOutputMode=N
GET args: {}
_allowedMethods={GET,POST,PUT,DELETE,HEAD,OPTIONS}, _preconditionState=0
_wantsSeparateThread=N, _alreadyBuiltHeaders=N, _needToSendBody=Y
_bodyBytesWritten=0, _chunkedState=0, _isLastTransaction=N
_varyBy=0x10, _redirectUrl="", _downloadFilename="", _totalScheduledLength=0
_willSendDataLater=N, _toSendState=0, _toSendSafe=Y
_knowCompleteLength=N, _desiredCompressionLevel=0
_replyIsGzipCompressed=N, _cacheControl=0x10, _maxCacheSeconds=4294967295, _dontIncludeFrameOptions=N
In TcpChannel 0x7f2b59ae5000, _tcloop=0x7f2b88c42690, no async write data, _data._shouldKill=N, r/w_timeouts=5.000/300.000, timeout_count=0
SSL: version="TLSv1.2", state="SSL negotiation finished successfully", cipher="ECDHE-RSA-AES256-GCM-SHA384", compression="zlib compression"
rbuf: ptr=0x7f2b59ae50a0, size=0x2000, rptr=0x0, wptr=0x0
TcpChannelAcceptor: , tcloop=0x7f2b88c42690, _disabledReasons=0, _activeCount=16, _inflightSubordinateAccepts=0
HttpListener: ssl=Y, _maxActiveConnections=6826, _wellBelowConnectionLimit=Y, _maxThreads=2658
SplunkdHttpListener: PORT: _allowGzip=Y, bind=https://:8089
conf: _sslopt={rootCAPath="", caCertFile="/opt/splunk/etc/auth/cacert.pem", certFile="/opt/splunk/etc/auth/server.pem", privateKeyFile="/opt/splunk/etc/auth/server.pem", privateKeyPassword_set=Y, commonNameToCheck="", altNameToCheck="", allowSslRenegotiation=Y, sslVersions="TLS1.2", cipherSuite="ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:AES256-GCM-SHA384:AES128-GCM-SHA256:AES128-SHA256", ecdhCurves="prime256v1, secp384r1, secp521r1", useCompression=Y, quietShutdown=NdhFile="", shouldVerifyClientCert=N}, _allowSslRenegotiation=Y, _frameOptionsSameOrigin=Y, _strictTransportSecurityHeader=N, _allowBasicAuth=Y, _allowCookieAuth=Y, _cookieAuthHttpOnly=Y, _cookieAuthSecure=Y
conf: _streamInWriteTimeout=5.000, _maxContentLength=2147483648, _maxThreads=2658, _maxSockets=6826, _forceHttp10=0
_thread=0x7f2b570e6b80: commandForThread=0, nextIdle=0x7f2b71571540, requestAfterThread=0, _tpfd=0x7f2b59ae5000, writeCorkCount=0, terminateCallback=(nil), ioError=No error, lastError=No error, terminateError=No error
giveCmd @0x7f2b570e6ce8: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, cmd=0, ok=Y, chan=0x7f2b5ed7a800
writeDataAvail @0x7f2b570e6d48: _queuedOn=(nil), ran=N, wantWake=N, wantFailIfLoopDone=N, chan=0x7f2b5ed7a800
wbuf: ptr=0x7f2b570e6de8, size=0x8000, rptr=0x0, wptr=0x0
x86 CPUID registers:
0: 0000000B 756E6547 6C65746E 49656E69
1: 00020651 0C010800 83B82203 0FABFBFF
2: 55035A01 00F0B2FF 00000000 00CA0000
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000000 00000000 00000000 00000000
6: 00000007 00000002 00000001 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000000 00000000 00000000 00000000
A: 07300401 0000007F 00000000 00000000
B: 00000000 00000000 000000CD 0000000C
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000001 28100800
80000002: 65746E49 2952286C 6F655820 2952286E
80000003: 55504320 20202020 20202020 45202020
80000004: 35343635 20402020 30342E32 007A4847
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000100
80000008: 0000302A 00000000 00000000 00000000
terminating...
If one of the Indexers ran out-of-space, It may result in incomplete writes to .bucketSummaryManifest in the ../your_index/summary Or ../your_index/datamodel_summary paths. The crash could happen while Splunk reads the corrupt file.
As a workaround:
Also, we have been fixed that issue 7.0.0 onwards via SPL-141877
Hope it helps.