Deployment Architecture

Splunk 7.1.0 Memory and Replication Issues in Multi-Site Index Cluster

Path Finder

Upgrading from Splunk Enterprise 6.6.2 to 7.1.0 introduced two major problems:
1) Memory Usage on the Indexers would grow beyond Available Memory (and Swap space) resulting in oom-killer being invoked. oom-killer always targets the largest memory consumer, which was usually a Splunk Process. If it was a "child" process, Splunk would respawn as needed, but sometimes it would be the "parent" splunkd process, which would take down the Indexer. oom-killer would be invoked multiple times per hour, which would eventually take down an indexer multiple times each day and affected both Indexers in the Cluster. Search processes on the Indexer were dominant and far greater than any other processes. Indexing rates were low and the processing queues were almost always at 100%. Many messages were generated from Search Heads indicating blocked Indexing due to full queues.
2) Replication between Indexers would not resolve in a timely manner. Only ONCE in two weeks was the Replication Factor met. Crashing Indexers made "out of sync" pretty common, but HOURS of Replication activity was unable to bring the Indexers back in sync. Most common cause of delayed fixups were "bucket has not rolled" and "Waiting 'targetwaittime' before search factor fixup" (see answers.splunk.com/answers/660917/fixup-status-message-is-waiting-target-wait-time-b for my question on that point). It seems the "targetwaittime" is new in 7.1 - at least I did not see that message after I rolled back to 6.6.

Note that version 7.1.0 was applied to the Cluster Master, 2 Search Heads, and 2 Indexers, so all were on the identical version. Also note that the Search Heads did not display any Memory Management issues, and Swap usage remained at 0 (zero!) even after many days of service. Search Heads are 12-core/12GB, Indexers are 12-core/32GB and 12-core/20GB, Cluster Master is 12-core/12GB.

Downgrading the installation back to 6.6.2 immediately resolved both issues. Memory Management remained in-bounds (even when utilization reached 100%), and Swap space remained constrained to below 1 GB in the worst case. Index Replication resolved itself in about 1/2 hour.

Although both Indexers remained up and healthy through the night, with excellent Memory and Replication statistics in the morning, I still saw one oom-killer incident on one Indexer and three oom-killer incident on the other.

I conclude that Memory Management is a HUGE challenge in Splunk 7.1, though I can't say whether that is limited to mult-site Index Clusters. Memory Management is "tolerable" in Splunk 6.6 - but it still has risks. It would be nice for Splunk to issue guidance on how to tame oom-killer so that it does not target the Splunk Processes!

Multi-site Index Replication in Splunk 7.1 is unacceptable! I accept that it was exacerbated by the Memory Management / Crashing Indexers, but it should not take hours to sync up a few buckets.

Does anyone else have more data that can shed light on resolving these issues? Is it just multi-site clusters? Is 7.1 a memory-hog? Are these bugs?

Splunk Employee
Splunk Employee

2) Replication hasn't changed, I suspect something else is causing issues.
Try 7.1.1 without the memory growth and crashing indexers, and if it still persists - please raise a case with support and provide diags - and post the case number here - happy to take a quick peek.

0 Karma

Engager

Sadly had similar issues with even 7.1.1. I suspect that its related to Data Models as one of our indexers that doesn't deal with any data models etc was fine, were as the rest of our Indexers CPU, Memory disk and disk I/O went insane to the point where they would eventually crash. Rolled back to 7.0.2 for now. Opened case as well.

0 Karma

SplunkTrust
SplunkTrust

7.1.1 fixed issues:

2018-05-18 SPL-154138, SPL-154542,
SPL-154544, FAST-9662 Searches with
multikv extraction use too much
memory: potentially orders of
magnitude more than previous versions.

I suggest you upgrade to 7.1.1 or the current release ASAP (7.1.1 at the time of writing)

I'm not sure if this will resolve your second issue but it might fix your memory issue, which might relate to your other issue...