Hey,
I have a 16-core windows 2003 server running Splunk v4.3.1 which is used for both searching and indexing. On this particular server, there have been a couple occasions recently during daily peak periods of indexing where the UI has slowed down with the splunk daemon not being reachable. During these periods, there is lots of data coming into Splunk via UDP.
Does anyone have any tips of what can be done to ensure that the Splunk UI can still be used in these peak indexing periods? (that does not involve introducing more hardware)
I've noticed that during these busy periods, the server only appears to be making decent use of 2/3 cores of the available 16. Is this normal? If not, can this be changed?
Any ways in which we can improve indexing throughput/efficiency would be very useful. Thanks in advance for your help.
Ok, I have to ask are you using SOS (Splunk On Splunk) and Splunk Deployment Monitor. These to apps will help diagnose issues. I am not sure how familiar you are with Splunk performance requirements and I don’t know you environment.
Here are something’s I would look into:
Splunk requires the dedication of one CPU core for every user that logs into the system. Each search the user runs takes up an additional CPU core for the duration of the search. CPU usage does not necessarily show performance issues as you CPU’s could be occupied, but fully utilizing the core. If all core are occupied all activates will become slow as processing time is being split.
To address your issue of data loss consider an intermediate UF or HF and possibility of using indexer acknowledgement. If you are receiving large amounts of data streaming data you may want to increase your input and output queues to handle the disruption in Indexer response. Indexer acknowledgement does have performance implications on the Forwarders. This should help with in-flight data loss. The intermediate forwarder will act as a buffer for streamed data.
Additional Reading:
HowconcurrentusersaffectSplunkperformance
I hope this helps or gets you started. If answers to help don’t forget to accept and/or thumbs up. Cheers.
I suspect it may be that the IO on the system is becoming overloaded and causing Splunkd to become unresponsive. What the current average queue sizes your system is experiencing and the peak CPU load over busy periods?
Ok, I have to ask are you using SOS (Splunk On Splunk) and Splunk Deployment Monitor. These to apps will help diagnose issues. I am not sure how familiar you are with Splunk performance requirements and I don’t know you environment.
Here are something’s I would look into:
Splunk requires the dedication of one CPU core for every user that logs into the system. Each search the user runs takes up an additional CPU core for the duration of the search. CPU usage does not necessarily show performance issues as you CPU’s could be occupied, but fully utilizing the core. If all core are occupied all activates will become slow as processing time is being split.
To address your issue of data loss consider an intermediate UF or HF and possibility of using indexer acknowledgement. If you are receiving large amounts of data streaming data you may want to increase your input and output queues to handle the disruption in Indexer response. Indexer acknowledgement does have performance implications on the Forwarders. This should help with in-flight data loss. The intermediate forwarder will act as a buffer for streamed data.
Additional Reading:
HowconcurrentusersaffectSplunkperformance
I hope this helps or gets you started. If answers to help don’t forget to accept and/or thumbs up. Cheers.
It's a combination of issues. The size of some of the events, we were receiving high numbers of events containing stack traces. Also, there were a high number of UDP in-connections during peak periods which was causing the server to slow down.
@Ant1D, thanks for accepting my answer. Do mind sharing what was this cause and resolution?
I have the same suspicions as @Drainy, use your metric.log file to determine your queue size. I would recommend installing SOS and Deployment monitor for you Splunk environment. In addition I would also consider strictly defining your sources in your props.conf using Linebreaking and/or Timestamp extraction configuration in particular MAX_TIMESTAMP_LOOKAHEAD which can improve indexing performance. As a note event size is determined by number of bytes.
I am not using any of those apps. I have a decent understanding of Splunk performance. Just looking for some tips on how I can stop the described performance issue from reoccurring. The events coming in are not large (e.g. roughly 10 field="value" pairs in each event). Most dashboards have 1 or 2 searches. Concurrent searches are also minimal.
Thanks for your comments. It's a 2.27GHz 16-core server as described above with 24GB RAM. It's indexing up to 7GB a day. The bulk of this indexing is coming from a UDP port and occasionally in busy periods (when large amounts of upcoming data is coming via this port from numerous hosts) Splunk slows down and the UI can become unavailable. I would have thought that this server would be able to handle that.
perhaps for more reasons than one... would it not be a better solution to log network traffic to file first? - if that is possible, e.g. if you are using UDP514, using a syslog server to create local logs and handle the network traffic, then Splunk can just read the file... probably quicker.