Monitoring Splunk

Why is the splunkd process is consuming 100% swap memory, and what makes the splunkd process get killed so frequently?

Hemnaath
Motivator

We have two heavy forwarder/syslog instances running in the same server. HF is used to forward the data (syslog) to the 5 individual indexer instances and we have an F5 load balancer that is placed before the two HF servers to route the traffic.

Problem - Most of the time we get an alert from Unix team stating that the splunkd process is consuming more CPU/Swap memory. Sometimes the swap memory becomes almost zero and kills the splunkd process. In a month we are getting almost 20 alerts for this issue. Kindly let guide us in overcoming this problem.

System details:
Splunk version 6.2.1
OS - RedHat 6.6
Memory - 6GB
CPU - 3
VMware

Swap usage details -

free -m
total used free shared buffers cached
Mem: 15947 15633 313 0 468 2289
-/+ buffers/cache: 12875 3072
Swap: 3323 281 3042

Note - Above details was taken after restarting the splunkd service, as swap memory was completely utilized over a period of time.

Vmstat details -

vmstat 5

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 1 298140 207032 463984 2290492 4 2 89 106 2 0 4 2 9 2 1 0
1 2 298120 188092 471776 2294380 6 0 2021 1826 3122 1629 39 19 32 11 0
1 2 298368 182920 474908 2296296 6 54 1485 1402 3730 1638 38 20 25 16 0
2 1 298352 169932 476480 2299852 18 0 746 1537 3075 1330 33 17 29 21 0
1 1 298316 157688 478100 2302140 38 0 786 322 3024 1434 33 16 29 22 0
2 2 306260 225224 456940 2256152 27 1597 21979 6657 11180 2321 43 27 11 19 0
2 1 306236 239776 458292 2261316 11 0 970 1317 3390 1716 39 19 22 20 0
3 1 306204 190080 459812 2264728 29 0 958 1248 3899 1912 56 21 11 13 0
1 2 306176 214476 461364 2268292 20 0 973 431 3636 1791 46 20 18 15 0

From the above data we can see that both swap in and swap out reveals some swap activity is very less and some time its too high.

Disk Performance details -

avg-cpu: %user %nice %system %iowait %steal %idle
4.20 0.02 2.10 1.26 0.00 92.43

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 4.99 114.81 109.53 594190278 566872984
sdb 20.57 289.52 477.08 1498448544 2469197992
sdc 5.78 123.13 49.26 637280496 254949064
dm-0 4.87 62.10 24.80 321430218 128350456
dm-1 4.52 23.58 12.62 122034944 65294368
dm-2 110.57 437.66 598.45 2265167218 3097366832

avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 17.05 9.09 0.00 36.36

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 57.00 1795.20 590.40 8976 2952
sdb 73.60 1324.80 235.20 6624 1176
sdc 50.60 376.00 81.60 1880 408
dm-0 8.20 52.80 12.80 264 64
dm-1 11.00 38.40 49.60 192 248
dm-2 273.20 3366.40 844.80 16832 4224

avg-cpu: %user %nice %system %iowait %steal %idle
37.97 0.07 17.15 8.24 0.00 36.57

CPU Load Detail -
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4753 root 20 0 12.8g 11g 12m S 116.1 72.3 279:48.20 splunkd
8892 root 20 0 56808 3188 804 R 23.6 0.0 2187:55 syslog-ng

thanks in advance.

mahegstrom
Explorer

We are having this same issue, 7 years later on Splunk 9.04.1. Has anyone identified a solution? We have a cron job running every 6 hours which deletes all files. Additionally, we added an IgnoreOlderThan = 6h to our inputs.conf, but it did not make any impact and we are still using 100% swap memory, instead of system memory.

0 Karma

abhishekdharga
Engager

We are getting the same errors on indexers where we are not monitoring any large file....
Any idea to fix this issue?

0 Karma

sjohnson_splunk
Splunk Employee
Splunk Employee

What jkat54 said. If your syslog server doesn't have a good log rotation configuration, then the forwarder will be monitoring 100's or 1000's of files and that consumes a lot of memory and cpu.

I suspect that the oomkiller is killing the splunk process.

You should consider adding the following inputs.conf setting:

ignoreOlderThan = nonnegative integer [s|m|h|d]
* Causes the monitored input to stop checking files for updates if their
modtime has passed this threshold. This improves the speed of file tracking
operations when monitoring directory hierarchies with large numbers of
historical files (for example, when active log files are colocated with old
files that are no longer being written to).
* As a result, do not select a cutoff that could ever occur for a file
you wish to index. Take downtime into account!
Suggested value: 14d , which means 2 weeks
* A file whose modtime falls outside this time window when seen for the first
time will not be indexed at all.
* Default: 0, meaning no threshold.

jkat54
SplunkTrust
SplunkTrust

do you have any inputs that monitor large directories of files?

like

[monitor://.../*.log]

etc?

If so, this is probably the culprit.

Hemnaath
Motivator

yes, it reads the syslog's data which is running in the same server where the HF instance is running.

/opt/splunk/etc/apps/ADMIN-hvy-forwarders/default/inputs.conf

[monitor:///opt/syslogs/web_access/.../Common/*.log]
[monitor:///opt/syslogs/symantec/SymantecServer/...]
[monitor:///opt/syslogs/symantec/*semp*/...]
[monitor:///opt/syslogs/symantec/.../ID.log]
[monitor:///opt/syslogs/proxy/...]
sourcetype = bluecoat_syslog
[monitor:///opt/syslogs/dns/.../*.log]
sourcetype = syslog
[monitor:///opt/syslogs/webops_security/.../*.log]
sourcetype = syslog
[monitor:///opt/syslogs/firewall/.../*.log]
sourcetype = syslog
[monitor:///opt/syslogs/esx/.../*.log]
sourcetype = syslog
[monitor:///opt/syslogs/generic/DCESX*/*.log]
sourcetype = syslog
[monitor:///opt/syslogs/generic/dcqip*/*.log]
sourcetype = syslog
[monitor:///opt/syslogs/generic/dcesx*/*.log]
sourcetype = syslog
[monitor:///opt/syslogs/generic/GTSPCFW*/*.log]
sourcetype = syslog
[monitor:///opt/syslogs/generic/.../*.log]
sourcetype = syslog
[monitor:///opt/syslogs/generic/idsmgt/*.log]
[monitor:///opt/syslogs/yammer/Messages.csv]

so what will be the solution to make it stable ? I mean to make the swap memory stable. we have asked the Unix team to increase the swap memory from 3 GB to 8 GB will this work ?

we are getting almost 20GB of syslog data in each of the two heavy-forwarder for every 24 hours and we have set a log rotate/corn job to delete the files after hour-an hour to control the disk space.

Kindly guide us what will be the permanent solution for this.

0 Karma

jkat54
SplunkTrust
SplunkTrust

You should probably put an ignoreOlderThan=3d or something similar on all those inputs to ignore all files older than x. See sjohnson's reply below.

0 Karma
Get Updates on the Splunk Community!

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...