Splunk Enterprise

Memory leak in Windows Versions of Splunk Enterprise

weiss_h
Explorer

Hello community, we are currently a bit desperate because of a Splunk memory leak problem under Windows OS that most probably all of you will have, but may not have noticed yet, here is the history and analysis of it:

The first time we observed a heavy memory leak problem on a Windows Server 2019 instance was after updating to Splunk Enterprise Version 9.1.3 (from 9.0.7). The Windows server affected has installed some Splunk apps (Symantec, ServiceNow, MS o365, DBconnect, Solarwinds), which are starting a lot of python scripts at very short intervals. After the update the server crashes every few hours due to low memory. Openend a Splunk case #3416998 in Feb 9th.

With the MS sysinternals tool rammap.exe we found a lot "zombie" processes (PIDs no more listed in task manager) which are still using some KB of memory (~20-32 KB). Process names are btool.exe, python3.exe, splunk-optimiz, splunkd.exe. It seems every time a process of one of these programs ends, it leaves behind such a memory usage. The Splunk apps on our Windows server do this very often and fast which results in thousands of zombie processes.

weiss_h_0-1724162539399.png

 

After this insight we downgraded Splunk on the server to 9.0.7 and the problem disappears.

Then on a test server we installed Splunk Enterprise versions 9.1.3 and 9.0.9. Both versions are showing the same issue. New Splunk case #3428922.

In March 28th we got this information from Splunk:
.... got an update from our internal dev team on this "In Windows, after upgrading Splunk enterprise to 9.1.3 or 9.2.0 consumes more memory usage. (memory and processes are not released)" internal ticket. They investigated the diag files and seems system memory usage is high, but only Splunk running. This issue comes from the mimalloc (memory allocator). This memory issue will be fixed in the 9.1.5 and 9.2.2
..........

9.2.2 arrived at July 1st: Unfortunately, still the same issue, the memory leak persists. 3rd Splunk case #3518811 (which is still open). Also not fixed in Version 9.3.0.
Even after a online session showing them the rammap.exe screen they wanted us to provide diags again and again from our (test) servers - but they should actually be able to reproduce it in their lab.

The hudge problem is: because of existing vulnerabilities in the installed (affected) versions we need to update Splunk (Heavy Forwarders) on our Windows Servers, but cannot due to the memory leak issue.


How to reproduce:
- OS tested: Windows Server 2016, 2019, 2022, Windows 10 22H2
- Splunk Enterprise Versions tested: 9.0.9, 9.1.3, 9.2.2 (Universal Forwarder not tested)
- let the default installation run for some hours (splunk service running)
- download rammap.exe from https://learn.microsoft.com/en-us/sysinternals/downloads/rammap and start it
- goto Processes tab, sort by Process column
- look for btool.exe, python3.exe and splunkd.exe with a small total memory usage of about ~ 20-32 KB. PIDs of this processes don't exists in task list (see Task manager or tasklist.exe)
- with the Splunk default installation (without any other apps) the memory usage slowly increases because the default apps script running interval isn't very high
- stopping Splunk service releases memory usage (and zombie processes disappear in rammap.exe)

- for faster results you can add an app for exessive testing with python3.exe, starting it in short (0 seconds) intervals. The test.py doesn't need to be exist! Splunk starts python3.exe anyway. Only inputs.conf file is needed:
... \etc\apps\pythonDummy\local\inputs.conf
[script://$SPLUNK_HOME/etc/apps/pythonDummy/bin/test.py 0000]
python.version = python3
interval = 0
[script://$SPLUNK_HOME/etc/apps/pythonDummy/bin/test.py 1111]
python.version = python3
interval = 0
...............if you want, add some more stanzas, 2222, 3333 and so on .............
- the more python script stanzas there are, the more and faster the zombies processes appears in rammap.exe

Please share your experiences. And open tickets for Splunk support if you also see the problem, please.
We hope Splunk finally react.

 

splunkg
Explorer

Upgraded to version 9.3.0.0, but the issue still remains..

0 Karma

NoSpaces
Communicator

I think that your are mistaken
According to the below message of weiss_h, this issue fixed only in new version 9.3.1

splunkg
Explorer

Indeed, we upgraded to version 9.3.1 and the memory leak is fixed.

No more memory issues.

weiss_h
Explorer

Good news, versions 9.1.6, 9.2.3, 9.3.1 are available now. Testing with 9.2.3 shows no more zombie processes and splunkd handle count remains low, so the memory leak seems to be fixed.

NoSpaces
Communicator

Just a day ago, I migrated from 9.1.5 to 9.1.6.
I confirm an absence of zombie processes! 😃

michaje
Explorer

Thank you for this post.  We are experiencing a similar issue, and have opened a case.  We are now testing the suggestion made by the Tech Support engineer:

Please add the following stanza to your configuration to see if it resolves the issue.
 
%SPLUNK_HOME%\etc\apps\introspection_generator_addon\local\server.conf
[introspection:generator:resource_usage]
disabled = true
acquireExtra_i_data = false
 
Would you please check/apply the workaround and let us know the output?
 
I will provide an update when I know more - and when I am back from leave.
0 Karma

michaje
Explorer

This workaround does work for us:

%SPLUNK_HOME%\etc\apps\introspection_generator_addon\local\server.conf
[introspection:generator:resource_usage]
disabled = true
acquireExtra_i_data = false
0 Karma

weiss_h
Explorer

Hi again,

it's not only the introspection_generator_addon app which is the reason for the memory leak, it's a general bug. This information we got from Splunk support last week:

Engineering have found that in the instrumentation source code, an edge condition where a process handle was opened, queried for information and failed to close the handle -- which leads to an open handle on the process (in this case with your app it more often with python -- BUT it could have been any process under the splunk subprocess tree). As a result of the open handle in the introspection code, when the process in question terminates, the OS will not release all the resources of said terminating process because the reference count to that process is not 0. It  is a bit of a racy condition depending on the state of the process.

In our environment we have disabled also the apps python_upgrade_readiness_app, splunk_assist and splunk_secure_gateway because they are also starting sub-processes which are leaving behind zombie processes (and by the way we observed a high CPU utilization by python3.exe processes started by splunk_assist app).

Another workaround mentioned by Splunk is to stop the splunkd.exe process with an extreme high handle count.

If Splunk instrumentation is need on the system the problem can be avoided with restarting of the Splunk instrumentation this could be archived from power shell and using the command below and schedule this with the windows Task scheduler.

Get-Process | Where-Object {$_.ProcessName -eq 'splunkd'} | Where-Object {$_.HandleCount -GE 5000} | Stop-Process

Here an output of a run on a test server:

PS C:\Windows\system32> Get-Process | Where-Object {$_.ProcessName -eq 'splunkd'}

Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
424 51 347948 193552 15.913,17 2448 0 splunkd
77846 23 57020 58536 855,06 7680 0 splunkd


PS C:\Windows\system32> Get-Process | Where-Object {$_.ProcessName -eq 'splunkd'} | Where-Object {$_.HandleCount -GE 5000} | Stop-Process

Confirm
Are you sure you want to perform the Stop-Process operation on the following item: splunkd(7680)?
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"):
PS C:\Windows\system32> Get-Process | Where-Object {$_.ProcessName -eq 'splunkd'}

Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
424 51 345888 193540 15.913,70 2448 0 splunkd
309 23 36524 38020 0,58 125536 0 splunkd

yes, this works, but we don't want to rollout it to all our servers.

 

 

0 Karma

NoSpaces
Communicator

Thank you for sharing it!
I wrote a little script with simple logic using this workaround.
I have rolled it out across my environment and will monitor how it works.
But I think that this approach is quite a safe way to "remediate" the problem before the Splunk team fixes it.

function Stop-SplZombieProcess {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true)]
        [string]$HostName,

        [Parameter(Mandatory = $false)]
        [int]$Threshold = 5000,

        [Parameter(Mandatory = $false)]
        [switch]$MultiZombie
    )

    begin {
    }
    
    process {
        Write-Host "Trying to find zombies on the host '$HostName'."
        $Procs = Invoke-Command -ComputerName $HostName -ScriptBlock {Get-Process | Where-Object {$_.ProcessName -eq 'splunkd'} }
        if ($Procs.Count -eq 1) {
            Write-Host "Only one splunkd process with '$($Procs.Handles)' handles was found. Most likely it is not a zombie."
        }
        else{
            [array]$Zombies = $Procs | Where-Object { $_.Handles -ge $Threshold }
            if ($Zombies) {
                if ($Zombies.Count -eq 1) {
                    $ProcId = $Zombies.Id
                    Write-Host "Zombie was found. The number of handles is '$($Zombies.Handles)'. Trying to kill."
                    Invoke-Command -ComputerName $HostName -ScriptBlock {Stop-Process -Id $using:ProcId -Force}
                    Write-Host "The zombie process with ProcId '$ProcId' has been killed on the host '$HostName'."
                }
                elseif ($MultiZombie) {
                    Write-Host 'Performing zombie multikill.'
                    foreach ($Item in $Zombies) {
                        $ProcId = $Item.Id
                        Write-Host "Zombie was found. The number of handles is '$($Item.Handles)'. Trying to kill."
                        Invoke-Command -ComputerName $HostName -ScriptBlock {Stop-Process -Id $using:ProcId -Force}
                        Write-Host "The zombie process with ProcId '$ProcId' has been killed on the host '$HostName'."
                    }
                }
                else {
                    Write-Warning "Found more than one process with handles more than '$Threshold'. Rise the threshold value or use the 'MultiZombie' switch to kill more than one zombie."
                }
            }
            else {
                Write-Host "Zombies not found on the host '$HostName'."
            }
        }
    }

    end {
        [System.gc]::collect()
    }
}
0 Karma

NoSpaces
Communicator

Thank you for workaroud.
I also applied it to one of my SHC member to see the difference.
In the afternoon, a will check the existence of zombie procecess.

0 Karma

KeithH
Path Finder

We have also applied the same work around and will advise in a couple of days if things are looking better.

0 Karma

michiel_nld
New Member

We're experiencing the same issues.

We're running version 9.3.0 with a separate indexer, search head, and license server. All of our servers are affected by the memory leak.

This started after we upgraded from version 9.1.3.

We were hoping that subsequent updates would fix it.

Is there any way we can assist to expedite your case?

0 Karma

weiss_h
Explorer

In the meantime Splunk support confirmed the issue and a Escalation Manager is involved. Hope we get a fixed version soon, but currently we have no statement on this.

You may want to open also a case, refer to  #3518811.

0 Karma

NoSpaces
Communicator

I also encountered exactly the same problem on my search head cluster.
Now I'm on version 9.1.5 and still having this issue.

0 Karma

KeithH
Path Finder

Hi,

I support two customers who are both running Splunk on Windows and after upgrades this year are experiencing very similar problems.  I use this to monitoring the swap memory usage:

index=_introspection swap component=Hostwide | timechart avg(data.swap_used) span=1h

and as it increases we then start seeing dumps which I can also graph with this:

index=_internal sourcetype=splunkd_crash_log "Crash dump written to:" | timechart count

Like you this has been logged with Splunk for some time but no fix yet - though they did just say there is an internal case looking into it.

For my customers the problem builds up more slowly so as long as they restart Splunk twice a week they have no problems.  Sounds like that wont help you.  

Its nice to know there are others with the same issue.  Thanks for all your detail especially re rammap.exe.

Good Luck

Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...