All,
I have about 2658 devices checking into our deployment server (CentOS 6.6, x64, Splunk 6.41)
8vCPU/16gigs ram
Over all we sit around 10-20% CPU with plenty of memory free. But the over all UI performance is becoming basically un-usable. I am guessing there are some performance tweaks I need to make. Really havn't seen any guides to this.
Probably worth mentioning 99% of the clients have a 2 hour check-in time. But about 20 servers (other Splunk servers) are set to every 2 minutes.
top - 19:09:45 up 319 days, 1:44, 2 users, load average: 0.72, 0.63, 0.40
Tasks: 235 total, 2 running, 233 sleeping, 0 stopped, 0 zombie
Cpu(s): 20.9%us, 7.8%sy, 0.0%ni, 69.0%id, 1.7%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 16333660k total, 15423556k used, 910104k free, 191868k buffers
Swap: 8388604k total, 375048k used, 8013556k free, 8733040k cached
Is there any update on this? I still experience the slowness in 7.2
I did not test it yet, but in the latest release 7.1.2 there seems to be a bugfix for that:
SPL-155009, SPL-153261 Slow Performance in the Deployment Server UI and sometime crash the browser
Have same issue Splunk 6.5.2.
Its very slow to load page.
I commented further up:
I installed 7.1.2 with the partial fix that I mentioned previously, and it has improved.. but it is still incredibly slow. One environment I am working on has over 20k UF's now.
Support have logged an ER (Enhancement Request), however it was indicated this can take months.. if it is even looked into at all.
It is a best practice NOT to use the Deployment Server UI at all. Why? Because admins never enter into it from the same app and the result is that the physical serverclass.conf
files are spread all over the app space creating an upgrade and management nightmare. We always disable the UI (with a deliberate configuration in serverclass.conf
) and then manage it from the CLI. That is the only sane way to do it long-term. Disabling the UI also means that you are safer to use a configuration management tool to version-control the serverclass.conf
file and the deployment-apps
directory, which you should be doing.
@woodcock - the admin class which I took a couple of months ago teaches only the Deployment Server UI way. It was tough to swallow during the class and I fully agree with you.
Do you have a document that you can point to where it states Not to use GUI as this is not the best practice.
Not sure I agree that the UI should create files outside of 'system/local' ... seems silly.
Just checked and the bulk of the definitions are in 'search/local' for some reason.
The environment has some extensive whitelists/blacklists and when making changes via the UI (after waiting minutes for it to respond) the filtering/preview feature is very handy.
I ended up creating a read only dashboard of the Forwarder Management using REST so we can at least easily view the state... making changes is the painful part now.
Because search
is an app, like I said. It is Search and Reporting
and built-in, but it is an app.
When you go to Forwarder Management through the GUI (Settings > Forwarder management) the URL is https:///manager/system/deploymentserver.
How does this relate to the inbuilt search app? To me this suggests a system context... hence 'system/local'.
I don't know what to tell you and I don't really know how it happens exactly but there you go. It won't just be search
. Run this:
find /opt/splunk/etc/ -type f -name serverclass.conf
And what you find is exactly why we disable the GUI for DS.
Working with Splunk support now, we get the config split between 'system/local' and 'search/local'.
I have ran into same issue. Started with Splunk 6.5 as deployment server on a VM with 6 CPU and 16GB ram. After restarting splunk service Splunk Universal forwarder management is very responsive. As it builds client list and number of clients climb over 1000 you can notice considerable degradation of response from splunk web. (Only under Forwarder Management section all other are fine). I have 6200 Clients and by the time Splunk Deployment builds complete list, the Web UI response time goes into minutes (3 minutes typically).
So I upgrade from VM to a physical box with 16 CPU and 48GB ram. Turned THP off and set my ulimits as follows to remove any bottle necks.
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63621
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16284
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Installed Splunk 7.1.1
My Cpu utilization hovers between .05% to .15%. Except when I try to reload the Universal management page at which I see one Cpu getting to 100% while others are less than a percent. The performance gains I have seen after moving to Physical server
are VM Page load time just over 3 minutes, Physical Server page load time Just over 1 Minute. It seems Every 1000 Machines add 10 Seconds delay.
I Verified the poor response in Chrome, Firefox, IE, Edge, Safari. Which proved that is is not Browser related. Then I used firefox debug to different sections of page and their load time. See the image. It is clearly a bug in Splunk which needs immediate attention.
For some reason people could not open the image. Please use the following link.
Image isn't loading for me, would be curious to see it. Can you try and reupload / imgur etc.
I installed 7.1.2 with the partial fix that I mentioned previously, and it has improved.. but it is still incredibly slow. One environment I am working on has over 20k UF's now.
Support have logged an ER (Enhancement Request), however it was indicated this can take months.. if it is even looked into at all.
Hi there, Huge Transparent Pages are disabled and ulimits were tuned properly ?
I am using Firefox, my collegue uses Chrome. We also could reproduce the effect with IE11.
Today our deploymentserver has 9500+ UFs, Splunk Release 6.5.2 for all servers and most of the UFs.
The slowness is most happening while listing all the forwarders. To suffer from the slowness it is sufficient to have the deploymentserver in a browser tab open while working on a different tab. It slows down the complete browser.
As soon as you close the tab with the deploymentserver the browser returns to normal speed after a few seconds.
Splunk support did not believe it first, but we showed it on a webex session to them ... now they are thinking on it.
From my perspective I would switch to a DMC like GUI instead of keeping the javascript-like GUI of today.
Latest version (7.1.2) apparently has the partial fix:
2018-06-07 SPL-155009, SPL-153261 Slow Performance in the Deployment Server UI and sometime crash the browser
I'm yet to test this.
I have ran into same issue. Started with Splunk 6.5 as deployment server on a VM with 6 CPU and 16GB ram. After restarting splunk service Splunk Universal forwarder management is very responsive. As it builds client list and number of clients climb over 1000 you can notice considerable degradation of response from splunk web. (Only under UF Management section all other are fine). I have 6200 Clients and by the time Splunk Deployment builds complete list, the Web UI response time goes into minutes (3 minutes typically).
So I upgrade from VM to a physical box with 16 CPU and 48GB ram. Turned THP off and set my ulimits as follows to remove any bottle necks.
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63621
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16284
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
My Cpu utilization hovers between .05% to .15%. Except when I try to reload the Universal management page at which I see one Cpu getting to 100% while others are less than a percent. The performance gains I have seen after moving to Physical server
are VM Page load time just over 3 minutes, Physical Server page load time Just over 1 Minute. It seems Every 1000 Machines add 10 Seconds delay.
I Verified the poor response in Chrome, Firefox, IE, Edge, Safari. Which proved that is is not Browser related. Then I used firefox debug to different sections of page and their load time. See the image. It is clearly a bug in Splunk which needs immediate attention.
Having exactly the same issue - I assume this is still occuring for you?
DS with around 20k UF's (issue was occuring when we had 10k).
Chrome, Firefox or IE all experience the issue.
As soon as the browser/tab is closed, browser is responsive again.
The DS is over specced and is barley hitting 20% resource usage during peaks.
Logged a job support.
We are seeing same issue running Splunk 7.0.1 Baremetal server with "15 CPU and 15 GB RAM"
CPU Avg utilization is 1% Memory is 13GB used.
it is a dedicated server for Deployment only with 6000 clients dialing home every 1 hour.
Deployment server is very responsive right after restart, but after about two hours GUI becomes painfully slow in forwarder management section.
During this slowness CPU is only spiking to 20% and no change in memory utilization.