Getting Data In

Why does Splunk 6.1.2 forwarder on an AIX 7.1 machine keep crashing?

Explorer

I have a AIX 7.1 machine setup as a forwarder running Splunk 6.1.2. Splunk keeps crashing almost and I need help to figure out what is causing the crash.
Below the Splunk crash log.

Received fatal signal 11 (Segmentation fault).
Cause:
No memory mapped at address [0x352D32302D31372D].
Crashing thread: MainTailingThread
Registers:
IAR: [0x0900000000520C28] ?
MSR: [0xA00000000000D032]
R0: [0x0900000000520A24]
R1: [0x0000000116A515A0]
R2: [0x09001000A0396C80]
R3: [0x352D32302D31372D]
R4: [0x352D32302D31372E]
R5: [0x0000000116A52D90]
R6: [0x0000000000000000]
R7: [0x0000000116A52DF8]
R8: [0x0000000000000028]
R9: [0x0000000116A51898]
R10: [0x0000000000000001]
R11: [0x0000000000000000]
R12: [0x09001000A0391928]
R13: [0x0000000116A5D800]
R14: [0x000000011305D0C0]
R15: [0x000000000008B4C0]
R16: [0x0000000112D17AA0]
R17: [0x0000000000000080]
R18: [0x000000011308B660]
R19: [0x000000000005CF20]
R20: [0x0000000000000000]
R21: [0x0000000000000000]
R22: [0x0000000000000000]
R23: [0x0000000000000000]
R24: [0x0000000112D17AA0]
R25: [0x0000000000000080]
R26: [0x000000011308B660]
R27: [0x00000001123A4650]
R28: [0x0000000116A53E40]
R29: [0x0000000116A53E20]
R30: [0x0000000116A52C60]
R31: [0x0000000117833030]
CR: [0x0000000044000059]
XER: [0x0000000000000008]
LR: [0x0900000000520C24]
CTR: [0x0900000000520A00]

OS: AIX
Arch: PowerPC

Backtrace:
+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0
+++LCB 1.0 Sun Aug 17 17:30:04 2014 Generated by IBM AIX 7.1

+++ID Node 0 Process 10420352 Thread 29
***FAULT "SIGSEGV - Segmentation violation"
+++STACK
TidyQ33std7LFSON12basicstringXTcTQ23std11chartraitsXTcTQ23std9allocatorXTcFb@AF27862 : 0x00000028
dtQ33std7LFSON12basicstringXTcTQ23std11chartraitsXTcTQ23std9allocatorXTcFv : 0x00000020
__dt
3StrFv : 0x00000050
Destroy3stdH3StrP3Strv : 0x00000018
destroy
Q23std9allocatorXT3StrFP3Str : 0x00000018
_Destroy
Q23std6vectorXT3StrTQ23std9allocatorXT3StrFP3StrT1 : 0x00000030
insert
Q23std6vectorXT3StrTQ23std9allocatorXT3StrFQ23std6PtritXT3StrTlTP3StrTR3StrTP3StrTR3StrUlRC3Str : 0x00000290
insertQ23std6vectorXT3StrTQ23std9allocatorXT3StrFQ23std6PtritXT3StrTlTP3StrTR3StrTP3StrTR3StrRC3Str : 0x00000098
push
backQ23std6vectorXT3StrTQ23std9allocatorXT3StrFRC3Str : 0x0000007c
pushback9StrVectorFRC3Str : 0x0000001c
lineBreak
FRC10StrSegmentR9StrVectorR3Str : 0x00000118
getLines
21FileClassifierManagerFRC8PathnameP3StrUlR9StrVectorR3StrPUlT6 : 0x00000308
_getFileType
21FileClassifierManagerFP13PropertiesMapRC8PathnameR9StrVectorRbT4PC3StrUl : 0x00000a70
getFileType
21FileClassifierManagerFP13PropertiesMapRC8PathnamebPC3StrUl : 0x0000009c
classifySource
10TailReaderCFR15CowPipelineDataRC8PathnameR3StrN23b : 0x00000194
setupSourcetype
10TailReaderFR15WatchedTailFileRQ27Tailing10FileStatus : 0x0000020c
readFile10TailReaderFR15WatchedTailFileP11TailWatcherP11BatchReader : 0x000001b8
readFile
11TailWatcherFR15WatchedTailFile : 0x0000024c
fileChanged11TailWatcherFP16WatchedFileStateRC7Timeval : 0x00000d0c
callFileChanged
30FilesystemChangeInternalWorkerFR7TimevalP16WatchedFileState : 0x00000090
whenexpired30FilesystemChangeInternalWorkerFRUL : 0x00000368
runExpiredTimeouts
11TimeoutHeapFR7Timeval : 0x000001ac
run
9EventLoopFv : 0x00000094
run
11TailWatcherFv : 0x00000118
main
13TailingThreadFv : 0x0000020c
callMain
6ThreadFPv : 0x000000b4
_pthread
body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 29

+++ID Node 0 Process 10420352 Thread 1
+++STACK
pollFPvUll : 0x00000024
run
9EventLoopFv : 0x0000016c
main10MainThreadFv : 0x000000a0
run
10MainThreadFv : 0x00000030
main : 0x00002aa0
---STACK
---ID Node 0 Process 10420352 Thread 1

+++ID Node 0 Process 10420352 Thread 2
+++STACK
pollFPvUll : 0x00000024
run
9EventLoopFv : 0x0000016c
main19ProcessRunnerThreadFv : 0x00000058
callMain
6ThreadFPv : 0x000000b4
pthreadbody : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 2

+++ID Node 0 Process 10420352 Thread 3
+++STACK
eventwait : 0x00000344
condwaitlocal : 0x0000035c
_cond
wait : 0x000000c8
pthreadcondtimedwait : 0x00000200
wait16PthreadConditionFR14ConditionMutexRC20ConditionWaitTimeout : 0x00000114
wait
16PthreadConditionFR20ScopedConditionMutexRC20ConditionWaitTimeout : 0x00000028
remove15PersistentQueueFR15CowPipelineDataRC20ConditionWaitTimeout : 0x000000a4
remove
21ProducerConsumerQueueFR15CowPipelineDataRC20ConditionWaitTimeout : 0x00000044
main18QueueServiceThreadFv : 0x00000074
callMain
6ThreadFPv : 0x000000b4
pthreadbody : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 3

+++ID Node 0 Process 10420352 Thread 4
+++STACK
pollFPvUll : 0x00000024
run
9EventLoopFv : 0x0000016c
run14TcpChannelLoopFv : 0x00000014
go
17SplunkdHttpServerFv : 0x00000050
go20SingleRestHttpServerFv : 0x00000020
main
18HTTPDispatchThreadFv : 0x00000264
callMain_6ThreadFPv : 0x000000b4
_pthread
body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 4

+++ID Node 0 Process 10420352 Thread 5
+++STACK
eventwait : 0x00000344
condwaitlocal : 0x0000035c
_cond
wait : 0x000000c8
pthreadcondtimedwait : 0x00000200
wait16PthreadConditionFR14ConditionMutexRC20ConditionWaitTimeout : 0x00000114
main
23HttpClientPollingThreadFv : 0x0000087c
callMain_6ThreadFPv : 0x000000b4
_pthread
body : 0x000000f0
---STACK
---ID Node 0 Process 10420352 Thread 5

+++ID Node 0 Process 10420352 Thread 6

Tags (3)

Explorer

Another quick way to update the ulimits is to use the chuser command. For example, "chuser fsize=-1 root" would set the max file size to unlimited. Just remember that using this method would require you to log off the specified user (assuming you are logged in as that user) and log back in.

0 Karma

Splunk Employee
Splunk Employee

I would like to expand on this answer if I may. As Kyle mentioned AIX ulimit defaults are not over generous. Typically if your Splunk AIX instance crashes soon after startup, the first place to look for clues is $SPLUNK_HOME/var/log/splunk/splunkd.log

Look for "Splunk may not work due to ....." errors

02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: virtual address space size: unlimited
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: data segment size: 134217728 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 WARN ulimit - Splunk may not work due to small data segment limit!
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: resident memory size: 33554432 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 WARN ulimit - Splunk may not work due to small resident memory size limit!
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: stack size: 33554432 bytes [hard maximum: 4294967296 bytes]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: core file size: 1073741312 bytes [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: data file size: unlimited
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: open files: 4096 files [hard maximum: unlimited]
02-25-2015 13:23:42.953 +0100 INFO ulimit - Limit: cpu time: unlimited

The Data Segment Size (ulimit -d) needs to be at least 1 GB (1073741824 bytes)

The Resident Memory Size (ulimit -m) needs to be at least :
512MB (536870912 bytes) for a Universal Forwarder
1 GB (1073741824 bytes) for a Indexer

Max No Of Open Files (ulimit -n) should be increased to at least 8192

Datafile size (ulimit -f) may be set at unlimited as the max size of file is dictated by the OS / Filesystem

These values are set on a per user basis in /etc/security/limits (or via smit chuser)
It gets a little confusing because some of the values in /etc/security/limits are in 512b Blocks, the values from ulimit are in kB and the values in splunkd.log are in bytes.

Lets have a look at a worked example

A Worked Example

  1. Login as root
  2. # smitty chuser Change the values for Soft DATA segment [2097152] Soft RSS size [1048576] Soft NOFILE descriptors [8192] Soft FILE size [-1]

Save and commit changes.

This basically just edits /etc/security/lmits:

...
*
* Sizes are in multiples of 512 byte blocks, CPU time is in seconds
*
* fsize - soft file size in blocks
* core - soft core file size in blocks
* cpu - soft per process CPU time limit in seconds
* data - soft data segment size in blocks
* stack - soft stack segment size in blocks
* rss - soft real memory usage in blocks
* nofiles - soft file descriptor limit
* fsizehard - hard file size in blocks
* core
hard - hard core file size in blocks
* cpuhard - hard per process CPU time limit in seconds
* data
hard - hard data segment size in blocks
* stackhard - hard stack segment size in blocks
* rss
hard - hard real memory usage in blocks
* nofileshard - hard file descriptor limit
*
* The following table contains the default hard values if the
* hard values are not explicitly defined:
*
* Attribute Value
* ========== ============
* fsize
hard set to fsize
* cpuhard set to cpu
* core
hard -1
* datahard -1
* stack
hard 8388608
* rsshard -1
* nofiles
hard -1
*
* NOTE: A value of -1 implies "unlimited"
*

default:
fsize = 2097151
core = 2097151
cpu = -1
data = 262144
rss = 65536
stack = 65536
nofiles = 2000

root:
data = 2097152
rss = 1048576
nofiles = 8192
fsize = -1

daemon:

...

  1. Logout root

  2. Login root (to pick up the changes)

  3. ulimit -a

    time(seconds) unlimited
    file(blocks) unlimited
    data(kbytes) 1048576
    stack(kbytes) 32768
    memory(kbytes) 524288
    coredump(blocks) 2097151
    nofiles(descriptors) 8192
    threads(per process) unlimited
    processes(per user) unlimited

The values look correct 🙂

  1. start splunk

  2. Check $SPLUNK_HOME/var/log/splunk/splunkd.log

....
03-31-2015 02:10:27.952 -0700 INFO LicenseMgr - Tracker init complete...
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: virtual address space size: unlimited
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: data segment size: 1073741824 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: resident memory size: 536870912 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: stack size: 33554432 bytes [hard maximum: 4294967296 bytes]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: core file size: 1073741312 bytes [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: data file size: unlimited
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: open files: 8192 files [hard maximum: unlimited]
03-31-2015 02:10:27.987 -0700 INFO ulimit - Limit: cpu time: unlimited
03-31-2015 02:10:27.993 -0700 INFO loader - Splunkd starting (build 245427).
.....

Splunk is running and stable

As you can see the values for data and rss in splunkd.log agree with values from ulimit -a (as root) and /etc/security/limits
Data Segment Size: 1073741824 bytes (splunkd.log) = 1048576 KiB (ulimit) = 2097152 blocks (/etc/security/limits)
Resident Memory Size 536870912 bytes (splunkd.log) = 524288 KiB (ulimit) = 1048576 blocks (/etc/security/limits)

HTH
Shaky

Splunk Employee
Splunk Employee

You should check your data segment size , ulimit -d to make sure that this is set inline with what splunk asks for. By default on AIX systems this is set too low and it can create issues for splunk. Usually when this happens you will see lots of bad allocation error messages in the logs that look like the following

ERROR PropertiesMapConfig - Failed to save stanza /var/adm/sudo.logMonSep2216:37:2720141998275973 to app learned: bad allocation

The data segment size (ulimit -d). With Splunk 4.2+, increase the value to at least 1 GB = 1073741824 bytes.

http://docs.splunk.com/Documentation/Splunk/6.1.3/Troubleshooting/ulimitErrors

Path Finder

Is there any PowerPC HW sizing guide for running Splunk Enterprise (all roles in distributed search) ??

Or even with Red Hat on PowerPC ?

0 Karma