We are using Citrix PVS to provision fresh XenApp servers every night, about 60 of them in total. A few dozen applications are then script-installed after boot, the UF is one of them.
This works well most of the time but almost every day one or two terminal servers experience problems right off the bat. The process "splunk-regmon.exe" crashes over and over, generating huge amounts of log messages in the local splunk.log file and the _internal index.
On on occasion, it also started spewing crash dmp files, quickly filling the disk with 11 gigs of them before the server retired itself.
INFO TailingProcessor - Ignoring file 'C:\Program Files\SplunkUniversalForwarder\var\log\splunk\C__Program Files_SplunkUniversalForwarder_bin_splunk-regmon_exe_crash-2015-02-27-07-55-53.dmp' due to: binary ERROR ExecProcessor - message from ""C:\Program Files\SplunkUniversalForwarder\bin\splunk-regmon.exe"" splunk-regmon - run_regmon: Fail to run Registry Monitor: 'bad allocation'
We use the following command line to install Universal Forwarder:
splunkforwarder-6.1.3-220630-x64-release.msi AGREETOLICENSE=Yes RECEIVING_INDEXER="splunk.######:9997" DEPLOYMENT_SERVER="splunk.######:8089" LAUNCHSPLUNK=1
(Domain names redacted)
The terminal servers then receive apps from the deployment server, here are the configs:
[WinEventLog://Application] disabled = 0 index=windows start_from = oldest current_only = 1 [WinEventLog://Security] disabled = 0 index=windows start_from = oldest current_only = 1 [WinEventLog://System] disabled = 0 index=windows start_from = oldest current_only = 1 [WinEventLog://ForwardedEvents] checkpointInterval = 5 current_only = 1 disabled = 0 start_from = oldest index=windows [WinEventLog://Setup] checkpointInterval = 5 current_only = 1 disabled = 0 start_from = oldest index=windows [admon://default] disabled = 1 monitorSubtree = 1 [WinRegMon://default] disabled = 1 hive = .* proc = .* type = rename|set|delete|create [WinRegMon://hkcu_run] disabled = 1 hive = \\REGISTRY\\USER\\.*\\Software\\Microsoft\\Windows\\CurrentVersion\\Run\\.* proc = .* type = set|create|delete|rename [WinRegMon://hklm_run] disabled = 1 hive = \\REGISTRY\\MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run\\.* proc = .* type = set|create|delete|rename [monitor://C:\Progra~1\Tivoli\TSM\baclient\dsmerror.log] disabled = false sourcetype = dsmerror index=tsm current_only = 1 [monitor://C:\Progra~1\Tivoli\TSM\baclient\dsmsched.log] disabled = false sourcetype = dsmsched index=tsm current_only = 1 [WinEventLog:Microsoft-Windows-PrintService/Operational] disabled = 0 index=windows start_from = oldest current_only = 1 [WinEventLog:Microsoft-Windows-PrintService/Admin] disabled = 0 index=windows start_from = oldest current_only = 1
As you can clearly see from the configuration files deployed, we are not using the registry monitor at all. Is there any way we can completely stop it from being executed? Why is this happening on just one or two random identical terminal servers booted off the same image? Is this a known problem with the UF version we are using?
Thanks for the tip to check out uberagent-for-splunk, but could this use case also be addressed without installing UF at all?
Did you consider collecting from these XA nodes with a remote forwarder?
We did consider this, but the need to collect other log files means we still need a local agent. The UF would fit the bill perfectly if we could just iron out the wrinkles. It's working fine on 100+ other servers so we just need to figure out what's causing random issues during nightly unattended installs.
You used to be able to disable it completely like so:
# $SPLUNK_HOME/etc/system/local [script://$SPLUNK_HOME\bin\scripts\splunk-regmon.path] disabled = 1
But I've not got a Windows forwarder handy to test on.
And now the problem is back. Around 7 AM, one of the terminal servers died of Disk Full Syndrome with massive amounts of coredumps. "splunk-regmon.exe" and "splunk-admon.exe" have crashed over and over all since the forwarder finished installing around 4 AM this morning.
All the others are working just fine. This is driving me crazy.
If you are using PVS, then the setting to disable regmon would have been wiped out on a reboot. I have 2 suggestions:
1) Use Splunk deployment server to distribute a package to the universal forwarder to disable regmon.
2) Make the universal forwarder part of you gold image already set up like you need on the image. (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Integrateauniversalforwarderontoasystemimag...)
1) This is what we already do. The setting is deployed but it does not have the desired effect of preventing crashes that fills the disk with coredumps.
2) This may be an option in the future if we can trust the UF. As of now, we can't.
Seems to do the trick, over the weekend I have not seen one instance of splunk-regmon.exe crashing and I don't see it in the process list anymore.
Do you know a similar way to disable splunk-netmon.exe? This process also seems to crash randomly (although much less often)