Getting Data In

Why does Splunk 6.1.2 forwarder on an AIX 7.1 machine keep crashing?

Explorer

I have a AIX 7.1 machine setup as a forwarder running Splunk 6.1.2. Splunk keeps crashing almost and I need help to figure out what is causing the crash.
Below the Splunk crash log.

Received fatal signal 11 (Segmentation fault).
Cause:
No memory mapped at address [0x352D32302D31372D].
Crashing thread: MainTailingThread
Registers:
IAR: [0x0900000000520C28] ?
MSR: [0xA00000000000D032]
R0: [0x0900000000520A24]
R1: [0x0000000116A515A0]
R2: [0x09001000A0396C80]
R3: [0x352D32302D31372D]
R4: [0x352D32302D31372E]
R5: [0x0000000116A52D90]
R6: [0x0000000000000000]
R7: [0x0000000116A52DF8]
R8: [0x0000000000000028]
R9: [0x0000000116A51898]
R10: [0x0000000000000001]
R11: [0x0000000000000000]
R12: [0x09001000A0391928]
R13: [0x0000000116A5D800]
R14: [0x000000011305D0C0]
R15: [0x000000000008B4C0]
R16: [0x0000000112D17AA0]
R17: [0x0000000000000080]
R18: [0x000000011308B660]
R19: [0x000000000005CF20]
R20: [0x0000000000000000]
R21: [0x0000000000000000]
R22: [0x0000000000000000]
R23: [0x0000000000000000]
R24: [0x0000000112D17AA0]
R25: [0x0000000000000080]
R26: [0x000000011308B660]
R27: [0x00000001123A4650]
R28: [0x0000000116A53E40]
R29: [0x0000000116A53E20]
R30: [0x0000000116A52C60]
R31: [0x0000000117833030]
CR: [0x0000000044000059]
XER: [0x0000000000000008]
LR: [0x0900000000520C24]
CTR: [0x0900000000520A00]

OS: AIX
Arch: PowerPC

Backtrace:
+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0
+++LCB 1.0 Sun Aug 17 17:30:04 2014 Generated by IBM AIX 7.1

+++ID Node 0 Process 10420352 Thread 29
***FAULT "SIGSEGV - Segmentation violation"
+++STACK
TidyQ3_3std7_LFS_ON12basic_stringXTcTQ2_3std11char_traitsXTc_TQ2_3std9allocatorXTcFb@AF278_62 : 0x00000028
__dt
Q3_3std7_LFS_ON12basic_stringXTcTQ2_3std11char_traitsXTc_TQ2_3std9allocatorXTcFv : 0x00000020
__dt
3StrFv : 0x00000050
_Destroy
3stdH3Str_P3Str_v : 0x00000018
destroy
Q2_3std9allocatorXT3Str_FP3Str : 0x00000018
_Destroy
Q2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFP3StrT1 : 0x00000030
insert
Q2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFQ2_3std6_PtritXT3StrTlTP3StrTR3StrTP3StrTR3Str_UlRC3Str : 0x00000290
insert
Q2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFQ2_3std6_PtritXT3StrTlTP3StrTR3StrTP3StrTR3Str_RC3Str : 0x00000098
push_back
Q2_3std6vectorXT3StrTQ2_3std9allocatorXT3StrFRC3Str : 0x0000007c
push_back
9StrVectorFRC3Str : 0x0000001c
lineBreak
FRC10StrSegmentR9StrVectorR3Str : 0x00000118
getLines
21FileClassifierManagerFRC8PathnameP3StrUlR9StrVectorR3StrPUlT6 : 0x00000308
_getFileType
21FileClassifierManagerFP13PropertiesMapRC8PathnameR9StrVectorRbT4PC3StrUl : 0x00000a70
getFileType
21FileClassifierManagerFP13PropertiesMapRC8PathnamebPC3StrUl : 0x0000009c
classifySource
10TailReaderCFR15CowPipelineDataRC8PathnameR3StrN23b : 0x00000194
setupSourcetype
10TailReaderFR15WatchedTailFileRQ2_7Tailing10FileStatus : 0x0000020c
readFile
10TailReaderFR15WatchedTailFileP11TailWatcherP11BatchReader : 0x000001b8
readFile
11TailWatcherFR15WatchedTailFile : 0x0000024c
fileChanged
11TailWatcherFP16WatchedFileStateRC7Timeval : 0x00000d0c
callFileChanged
30FilesystemChangeInternalWorkerFR7TimevalP16WatchedFileState : 0x00000090
when_expired
30FilesystemChangeInternalWorkerFRUL : 0x00000368
runExpiredTimeouts
11TimeoutHeapFR7Timeval : 0x000001ac
run
9EventLoopFv : 0x00000094
run
11TailWatcherFv : 0x00000118
main
13TailingThreadFv : 0x0000020c
callMain
_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 29

+++ID Node 0 Process 10420352 Thread 1
+++STACK
poll_FPvUll : 0x00000024
run
9EventLoopFv : 0x0000016c
main
10MainThreadFv : 0x000000a0
run
_10MainThreadFv : 0x00000030
main : 0x00002aa0
---STACK
---ID Node 0 Process 10420352 Thread 1

+++ID Node 0 Process 10420352 Thread 2
+++STACK
poll_FPvUll : 0x00000024
run
9EventLoopFv : 0x0000016c
main
19ProcessRunnerThreadFv : 0x00000058
callMain
_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 2

+++ID Node 0 Process 10420352 Thread 3
+++STACK
event_wait : 0x00000344
_cond_wait_local : 0x0000035c
_cond_wait : 0x000000c8
pthread_cond_timedwait : 0x00000200
wait
16PthreadConditionFR14ConditionMutexRC20ConditionWaitTimeout : 0x00000114
wait
16PthreadConditionFR20ScopedConditionMutexRC20ConditionWaitTimeout : 0x00000028
remove
15PersistentQueueFR15CowPipelineDataRC20ConditionWaitTimeout : 0x000000a4
remove
21ProducerConsumerQueueFR15CowPipelineDataRC20ConditionWaitTimeout : 0x00000044
main
18QueueServiceThreadFv : 0x00000074
callMain
_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 3

+++ID Node 0 Process 10420352 Thread 4
+++STACK
poll_FPvUll : 0x00000024
run
9EventLoopFv : 0x0000016c
run
14TcpChannelLoopFv : 0x00000014
go
17SplunkdHttpServerFv : 0x00000050
go
20SingleRestHttpServerFv : 0x00000020
main
18HTTPDispatchThreadFv : 0x00000264
callMain
_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 4

+++ID Node 0 Process 10420352 Thread 5
+++STACK
event_wait : 0x00000344
_cond_wait_local : 0x0000035c
_cond_wait : 0x000000c8
pthread_cond_timedwait : 0x00000200
wait
16PthreadConditionFR14ConditionMutexRC20ConditionWaitTimeout : 0x00000114
main
23HttpClientPollingThreadFv : 0x0000087c
callMain
_6ThreadFPv : 0x000000b4
_pthread_body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 5

+++ID Node 0 Process 10420352 Thread 6

Tags (3)

Explorer

Another quick way to update the ulimits is to use the chuser command. For example, "chuser fsize=-1 root" would set the max file size to unlimited. Just remember that using this method would require you to log off the specified user (assuming you are logged in as that user) and log back in.

0 Karma

Splunk Employee
Splunk Employee

I would like to expand on this answer if I may. As Kyle mentioned AIX ulimit defaults are not over generous. Typically if your Splunk AIX instance crashes soon after startup, the first place to look for clues is $SPLUNK_HOME/var/log/splunk/splunkd.log

Look for "Splunk may not work due to ....." errors

02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: virtual address space size: unlimited
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: data segment size: 134217728 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 WARN ulimit - Splunk may not work due to small data segment limit!
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: resident memory size: 33554432 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 WARN ulimit - Splunk may not work due to small resident memory size limit!
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: stack size: 33554432 bytes [hard maximum: 4294967296 bytes]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: core file size: 1073741312 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: data file size: unlimited
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: open files: 4096 files [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: cpu time: unlimited

The Data Segment Size (ulimit -d) needs to be at least 1 GB (1073741824 bytes)

The Resident Memory Size (ulimit -m) needs to be at least :
512MB (536870912 bytes) for a Universal Forwarder
1 GB (1073741824 bytes) for a Indexer

Max No Of Open Files (ulimit -n) should be increased to at least 8192

Datafile size (ulimit -f) may be set at unlimited as the max size of file is dictated by the OS / Filesystem

These values are set on a per user basis in /etc/security/limits (or via smit chuser)
It gets a little confusing because some of the values in /etc/security/limits are in 512b Blocks, the values from ulimit are in kB and the values in splunkd.log are in bytes.

Lets have a look at a worked example

A Worked Example

  1. Login as root
  2. # smitty chuser Change the values for Soft DATA segment [2097152] Soft RSS size [1048576] Soft NOFILE descriptors [8192] Soft FILE size [-1]

Save and commit changes.

This basically just edits /etc/security/lmits:

...
*
* Sizes are in multiples of 512 byte blocks, CPU time is in seconds
*
* fsize - soft file size in blocks
* core - soft core file size in blocks
* cpu - soft per process CPU time limit in seconds
* data - soft data segment size in blocks
* stack - soft stack segment size in blocks
* rss - soft real memory usage in blocks
* nofiles - soft file descriptor limit
* fsize_hard - hard file size in blocks
* core_hard - hard core file size in blocks
* cpu_hard - hard per process CPU time limit in seconds
* data_hard - hard data segment size in blocks
* stack_hard - hard stack segment size in blocks
* rss_hard - hard real memory usage in blocks
* nofiles_hard - hard file descriptor limit
*
* The following table contains the default hard values if the
* hard values are not explicitly defined:
*
* Attribute Value
* ========== ============
* fsize_hard set to fsize
* cpu_hard set to cpu
* core_hard -1
* data_hard -1
* stack_hard 8388608
* rss_hard -1
* nofiles_hard -1
*
* NOTE: A value of -1 implies "unlimited"
*

default:
fsize = 2097151
core = 2097151
cpu = -1
data = 262144
rss = 65536
stack = 65536
nofiles = 2000

root:
data = 2097152
rss = 1048576
nofiles = 8192
fsize = -1

daemon:

...

  1. Logout root

  2. Login root (to pick up the changes)

  3. ulimit -a

    time(seconds) unlimited
    file(blocks) unlimited
    data(kbytes) 1048576
    stack(kbytes) 32768
    memory(kbytes) 524288
    coredump(blocks) 2097151
    nofiles(descriptors) 8192
    threads(per process) unlimited
    processes(per user) unlimited

The values look correct 🙂

  1. start splunk

  2. Check $SPLUNK_HOME/var/log/splunk/splunkd.log

....
03-31-2015 02:10:27.952 -0700 INFO LicenseMgr - Tracker init complete...
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: virtual address space size: unlimited
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: data segment size: 1073741824 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: resident memory size: 536870912 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: stack size: 33554432 bytes [hard maximum: 4294967296 bytes]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: core file size: 1073741312 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: data file size: unlimited
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: open files: 8192 files [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: cpu time: unlimited
03-31-2015 02:10:27.993 -0700 INFO loader - Splunkd starting (build 245427).
.....

Splunk is running and stable

As you can see the values for data and rss in splunkd.log agree with values from ulimit -a (as root) and /etc/security/limits
Data Segment Size: 1073741824 bytes (splunkd.log) = 1048576 KiB (ulimit) = 2097152 blocks (/etc/security/limits)
Resident Memory Size 536870912 bytes (splunkd.log) = 524288 KiB (ulimit) = 1048576 blocks (/etc/security/limits)

HTH
Shaky

Splunk Employee
Splunk Employee

You should check your data segment size , ulimit -d to make sure that this is set inline with what splunk asks for. By default on AIX systems this is set too low and it can create issues for splunk. Usually when this happens you will see lots of bad allocation error messages in the logs that look like the following

ERROR PropertiesMapConfig - Failed to save stanza /var/adm/sudo.log_Mon_Sep_22_16:37:27_2014_1998275973 to app learned: bad allocation

The data segment size (ulimit -d). With Splunk 4.2+, increase the value to at least 1 GB = 1073741824 bytes.

http://docs.splunk.com/Documentation/Splunk/6.1.3/Troubleshooting/ulimitErrors

Path Finder

Is there any PowerPC HW sizing guide for running Splunk Enterprise (all roles in distributed search) ??

Or even with Red Hat on PowerPC ?

0 Karma