I have a AIX 7.1 machine setup as a forwarder running Splunk 6.1.2. Splunk keeps crashing almost and I need help to figure out what is causing the crash.
Below the Splunk crash log.
Received fatal signal 11 (Segmentation fault).
Cause:
No memory mapped at address [0x352D32302D31372D].
Crashing thread: MainTailingThread
Registers:
IAR: [0x0900000000520C28] ?
MSR: [0xA00000000000D032]
R0: [0x0900000000520A24]
R1: [0x0000000116A515A0]
R2: [0x09001000A0396C80]
R3: [0x352D32302D31372D]
R4: [0x352D32302D31372E]
R5: [0x0000000116A52D90]
R6: [0x0000000000000000]
R7: [0x0000000116A52DF8]
R8: [0x0000000000000028]
R9: [0x0000000116A51898]
R10: [0x0000000000000001]
R11: [0x0000000000000000]
R12: [0x09001000A0391928]
R13: [0x0000000116A5D800]
R14: [0x000000011305D0C0]
R15: [0x000000000008B4C0]
R16: [0x0000000112D17AA0]
R17: [0x0000000000000080]
R18: [0x000000011308B660]
R19: [0x000000000005CF20]
R20: [0x0000000000000000]
R21: [0x0000000000000000]
R22: [0x0000000000000000]
R23: [0x0000000000000000]
R24: [0x0000000112D17AA0]
R25: [0x0000000000000080]
R26: [0x000000011308B660]
R27: [0x00000001123A4650]
R28: [0x0000000116A53E40]
R29: [0x0000000116A53E20]
R30: [0x0000000116A52C60]
R31: [0x0000000117833030]
CR: [0x0000000044000059]
XER: [0x0000000000000008]
LR: [0x0900000000520C24]
CTR: [0x0900000000520A00]
OS: AIX
Arch: PowerPC
Backtrace:
+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0
+++LCB 1.0 Sun Aug 17 17:30:04 2014 Generated by IBM AIX 7.1
+++ID Node 0 Process 10420352 Thread 29
***FAULT "SIGSEGV - Segmentation violation"
+++STACK
TidyQ3_3std7_LFS_ON12basic_stringXTcTQ2_3std11char_traitsXTc_TQ2_3std9allocatorXTcFb@AF278_62 : 0x00000028
__dtQ3_3std7_LFS_ON12basic_stringXTcTQ2_3std11char_traitsXTc_TQ2_3std9allocatorXTcFv : 0x00000020
__dt3StrFv : 0x00000050
_Destroy3stdH3Str_P3Str_v : 0x00000018
destroyQ2_3std9allocatorXT3Str_FP3Str : 0x00000018
_DestroyQ2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFP3StrT1 : 0x00000030
insertQ2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFQ2_3std6_PtritXT3StrTlTP3StrTR3StrTP3StrTR3Str_UlRC3Str : 0x00000290
insertQ2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFQ2_3std6_PtritXT3StrTlTP3StrTR3StrTP3StrTR3Str_RC3Str : 0x00000098
push_backQ2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFRC3Str : 0x0000007c
push_back9StrVectorFRC3Str : 0x0000001c
lineBreakFRC10StrSegmentR9StrVectorR3Str : 0x00000118
getLines21FileClassifierManagerFRC8PathnameP3StrUlR9StrVectorR3StrPUlT6 : 0x00000308
_getFileType21FileClassifierManagerFP13PropertiesMapRC8PathnameR9StrVectorRbT4PC3StrUl : 0x00000a70
getFileType21FileClassifierManagerFP13PropertiesMapRC8PathnamebPC3StrUl : 0x0000009c
classifySource10TailReaderCFR15CowPipelineDataRC8PathnameR3StrN23b : 0x00000194
setupSourcetype10TailReaderFR15WatchedTailFileRQ2_7Tailing10FileStatus : 0x0000020c
readFile10TailReaderFR15WatchedTailFileP11TailWatcherP11BatchReader : 0x000001b8
readFile11TailWatcherFR15WatchedTailFile : 0x0000024c
fileChanged11TailWatcherFP16WatchedFileStateRC7Timeval : 0x00000d0c
callFileChanged30FilesystemChangeInternalWorkerFR7TimevalP16WatchedFileState : 0x00000090
when_expired30FilesystemChangeInternalWorkerFRUL : 0x00000368
runExpiredTimeouts11TimeoutHeapFR7Timeval : 0x000001ac
run9EventLoopFv : 0x00000094
run11TailWatcherFv : 0x00000118
main13TailingThreadFv : 0x0000020c
callMain_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 29
+++ID Node 0 Process 10420352 Thread 1
+++STACK
poll_FPvUll : 0x00000024
run9EventLoopFv : 0x0000016c
main10MainThreadFv : 0x000000a0
run_10MainThreadFv : 0x00000030
main : 0x00002aa0
---STACK
---ID Node 0 Process 10420352 Thread 1
+++ID Node 0 Process 10420352 Thread 2
+++STACK
poll_FPvUll : 0x00000024
run9EventLoopFv : 0x0000016c
main19ProcessRunnerThreadFv : 0x00000058
callMain_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 2
+++ID Node 0 Process 10420352 Thread 3
+++STACK
event_wait : 0x00000344
_cond_wait_local : 0x0000035c
_cond_wait : 0x000000c8
pthread_cond_timedwait : 0x00000200
wait16PthreadConditionFR14ConditionMutexRC20ConditionWaitTimeout : 0x00000114
wait16PthreadConditionFR20ScopedConditionMutexRC20ConditionWaitTimeout : 0x00000028
remove15PersistentQueueFR15CowPipelineDataRC20ConditionWaitTimeout : 0x000000a4
remove21ProducerConsumerQueueFR15CowPipelineDataRC20ConditionWaitTimeout : 0x00000044
main18QueueServiceThreadFv : 0x00000074
callMain_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 3
+++ID Node 0 Process 10420352 Thread 4
+++STACK
poll_FPvUll : 0x00000024
run9EventLoopFv : 0x0000016c
run14TcpChannelLoopFv : 0x00000014
go17SplunkdHttpServerFv : 0x00000050
go20SingleRestHttpServerFv : 0x00000020
main18HTTPDispatchThreadFv : 0x00000264
callMain_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 4
+++ID Node 0 Process 10420352 Thread 5
+++STACK
event_wait : 0x00000344
_cond_wait_local : 0x0000035c
_cond_wait : 0x000000c8
pthread_cond_timedwait : 0x00000200
wait16PthreadConditionFR14ConditionMutexRC20ConditionWaitTimeout : 0x00000114
main23HttpClientPollingThreadFv : 0x0000087c
callMain_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 5
+++ID Node 0 Process 10420352 Thread 6
Another quick way to update the ulimits is to use the chuser command. For example, "chuser fsize=-1 root" would set the max file size to unlimited. Just remember that using this method would require you to log off the specified user (assuming you are logged in as that user) and log back in.
I would like to expand on this answer if I may. As Kyle mentioned AIX ulimit defaults are not over generous. Typically if your Splunk AIX instance crashes soon after startup, the first place to look for clues is $SPLUNK_HOME/var/log/splunk/splunkd.log
Look for "Splunk may not work due to ....." errors
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: virtual address space size: unlimited
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: data segment size: 134217728 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 WARN ulimit - Splunk may not work due to small data segment limit!
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: resident memory size: 33554432 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 WARN ulimit - Splunk may not work due to small resident memory size limit!
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: stack size: 33554432 bytes [hard maximum: 4294967296 bytes]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: core file size: 1073741312 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: data file size: unlimited
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: open files: 4096 files [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: cpu time: unlimited
The Data Segment Size (ulimit -d) needs to be at least 1 GB (1073741824 bytes)
The Resident Memory Size (ulimit -m) needs to be at least :
512MB (536870912 bytes) for a Universal Forwarder
1 GB (1073741824 bytes) for a Indexer
Max No Of Open Files (ulimit -n) should be increased to at least 8192
Datafile size (ulimit -f) may be set at unlimited as the max size of file is dictated by the OS / Filesystem
These values are set on a per user basis in /etc/security/limits (or via smit chuser)
It gets a little confusing because some of the values in /etc/security/limits are in 512b Blocks, the values from ulimit are in kB and the values in splunkd.log are in bytes.
Lets have a look at a worked example
A Worked Example
Save and commit changes.
This basically just edits /etc/security/lmits:
...
*
* Sizes are in multiples of 512 byte blocks, CPU time is in seconds
*
* fsize - soft file size in blocks
* core - soft core file size in blocks
* cpu - soft per process CPU time limit in seconds
* data - soft data segment size in blocks
* stack - soft stack segment size in blocks
* rss - soft real memory usage in blocks
* nofiles - soft file descriptor limit
* fsize_hard - hard file size in blocks
* core_hard - hard core file size in blocks
* cpu_hard - hard per process CPU time limit in seconds
* data_hard - hard data segment size in blocks
* stack_hard - hard stack segment size in blocks
* rss_hard - hard real memory usage in blocks
* nofiles_hard - hard file descriptor limit
*
* The following table contains the default hard values if the
* hard values are not explicitly defined:
*
* Attribute Value
* ========== ============
* fsize_hard set to fsize
* cpu_hard set to cpu
* core_hard -1
* data_hard -1
* stack_hard 8388608
* rss_hard -1
* nofiles_hard -1
*
* NOTE: A value of -1 implies "unlimited"
*
default:
fsize = 2097151
core = 2097151
cpu = -1
data = 262144
rss = 65536
stack = 65536
nofiles = 2000
root:
data = 2097152
rss = 1048576
nofiles = 8192
fsize = -1
daemon:
...
Logout root
Login root (to pick up the changes)
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 1048576
stack(kbytes) 32768
memory(kbytes) 524288
coredump(blocks) 2097151
nofiles(descriptors) 8192
threads(per process) unlimited
processes(per user) unlimited
The values look correct 🙂
start splunk
Check $SPLUNK_HOME/var/log/splunk/splunkd.log
....
03-31-2015 02:10:27.952 -0700 INFO LicenseMgr - Tracker init complete...
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: virtual address space size: unlimited
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: data segment size: 1073741824 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: resident memory size: 536870912 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: stack size: 33554432 bytes [hard maximum: 4294967296 bytes]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: core file size: 1073741312 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: data file size: unlimited
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: open files: 8192 files [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: cpu time: unlimited
03-31-2015 02:10:27.993 -0700 INFO loader - Splunkd starting (build 245427).
.....
Splunk is running and stable
As you can see the values for data and rss in splunkd.log agree with values from ulimit -a (as root) and /etc/security/limits
Data Segment Size: 1073741824 bytes (splunkd.log) = 1048576 KiB (ulimit) = 2097152 blocks (/etc/security/limits)
Resident Memory Size 536870912 bytes (splunkd.log) = 524288 KiB (ulimit) = 1048576 blocks (/etc/security/limits)
HTH
Shaky
You should check your data segment size , ulimit -d to make sure that this is set inline with what splunk asks for. By default on AIX systems this is set too low and it can create issues for splunk. Usually when this happens you will see lots of bad allocation error messages in the logs that look like the following
ERROR PropertiesMapConfig - Failed to save stanza /var/adm/sudo.log_Mon_Sep_22_16:37:27_2014_1998275973 to app learned: bad allocation
The data segment size (ulimit -d). With Splunk 4.2+, increase the value to at least 1 GB = 1073741824 bytes.
http://docs.splunk.com/Documentation/Splunk/6.1.3/Troubleshooting/ulimitErrors
Is there any PowerPC HW sizing guide for running Splunk Enterprise (all roles in distributed search) ??
Or even with Red Hat on PowerPC ?