Hi ALL. Currently I am facing another problem in our distributed environment. We have 5 individual indexer instance configured to store/retrieve data out of which one of the indexer is used almost 87% of disk usage, when compared to other four indexer instance.
We have two heavy forwarder and some 1000 + universal forwarders, forwarding the inputs to these five indexer instances.
I have checked the outputs.conf, and indexes.con file in both the heavy forwarder and five indexer instance and all share the same configuration details.
Splunk Version - 6.2.1
defaultGroup = all_indexers
maxQueueSize = 1GB
server = splunk01.xxxxx.com:9997,splunk02.xxxxx.com:9997,splunk03.xxxxx.com:9997,splunk04.xxxxx.com:9997,splunk05.xxxxx.com:9997
autoLB = true
Kindly let me know how to figure out where exact the problem is and also what to check in the splunkd.log.
Thanks in advance.
Given the information in the comments above, it appears that there are two log sources that are of high enough velocity that the forwarder may not switch indexers, because EOF is not reached when processing the input stream.
Assuming that net_firewall is a syslog datasource processed via your HWF pair, somesoni2's advice is correct. You need to enable forceTimeBasedAutoLB=true in outputs.conf and - optionally - reduce the AutoLBFrequency to something lower than the default of 30 seconds to ensure the HWF breaks the indexer connection more often and selects a new indexer.
In addition to that, I would recommend increasing parallelIngestionPipelines to at least 2 (more if your HWF machines have enough cores) as documented here.
Having said that, and as an additional comment: As is best practice for consuming syslog data sources, you should consider replacing your HWFs with a pair of syslog-ng servers (behind your F5) that receive your syslog stream. Configure syslog-ng with appropriate rules to write source data to directories/files based on their origin and install Universal forwarders to process the files using [monitor://] in inputs.conf. This will give you much better control over properly sourcetyping the data streams and thus facilitate easier searching.
In addition, you will not lose any events if you have to restart forwarders due to configuration changes or upgrades. Take a read here for more details.
thanks ssievert, I wanted to check with you whether it is safe to change the configuration file, will there be any impact as it is done on prod environment and also how can we say that this is the root cause for this problem.
We have two outputs.conf file in HWF pair, in which one set is configured under system/default and another set is under apps/defaults.
stanza under /opt/splunk/etc/apps/ourapps/defaults/outputs.conf is already mentioned in the above comments, whereas another outputs.conf file which is present under /opt/splunk/etc/system/defaults/outputs.conf contains default stanza in which i can see the autoLBFrequency set to 30 and other stanza as shown below (mentioned partial stanza not the complete stanza)
autoLBFrequency = 30
type = udp
priority = <13>
dropEventsOnQueueFull = -1
maxEventSize = 1024
as per the precedence splunk first looks for system/local and then to the apps/local, then to the apps/defaults and least preference is given to system/defaults. So we need to enable forceTimeBasedAutoLB=true and reduce the
autoLBFrequency = 15 in /opt/splunk/etc/apps/ourapps/defaults.
thanks in advance.
Yes, never make any changes to system/default, always do it in your local app context.
This change is safe to make on a Universal Forwarder only if you have no events larger than 64KB.
Since your HWF will do your event parsing and hence understands event boundaries, these settings can be changed without concern for event sizes on a heavy forwarder.
The impact of the change is that your HWF will pick a new indexer to send data to more often (every 15 seconds), which will hopefully result in more even event distribution for those syslog sourcetypes that are flowing through your HWF (which appear to be the ones showing the imbalance).
If you want to find out how frequently your HWFs connect to indexers, you could run a search against index=_internal host=yourHWFhost and look for messages like this:
mm-dd-yyyy hh:mm:ss.nnn +1000 INFO TcpOutputProc - Connected to idx=126.96.36.199:9997
You should see one of these at least every 30 seconds.
You could also look at your LB configuration and make sure the F5 is not configured with sticky sessions and properly balances across your two HWFs.
thanks ssievert, I had ran the query which you had commented on the post on the search head.
index=_internal host=ourHWF* splunk_server=splunk0* and got some events from both the HWF
but how to find out the frequency difference between the each indexer instances. Could please share me the query to find out the frequency difference between the each indexes.
thanks in advance.
Hi Ssievert could please help me in the query to find out the frequency difference each indexer instance or let me know what are the filed compared to get the results as i am not sure what to check in the events after executing the query on search head.
thanks in advance.
Hi ssievert thanks for your effort on this, in our environment we have two HF server with F5 load balance and they get syslogs related via load balance and gets into HF then to indexer. we have 1000 + UF that forward data individually directly to the indexer servers with out going to HF instances.
kindly let guide us in getting this issue fixed.
Try setting forceTimebasedAutoLB under your tcpout:all_indexers stanza. This will force forwarder to be distribute data among indexers more uniformally.
forceTimebasedAutoLB = [true|false]
* Will force existing streams to switch to newly elected indexer every
* Defaults to false
We have noticed that splunk03 indexer instance, only two indexes are storing more data, when compared to other indexes within the same instance splunk03.
indexes consuming more data : net_firwall & srv_unix are having Tera bit of data when compared to same indexes in rest of the indexer instances " splunk01,02,04 & 05"
my question is that when almost 30 indexes are indexing same amount of data except these two indexes (net_firewall & srv_unix)in the splunk03 and we have checked the indexes.conf file and they share the same configuration with all other indexer instances. So in this case could you please help to figure out what we need to check and fix this issue.
I have seen this when the netowork connectivity is slow/poor. If one of them is ROCK SOLID and the rest are slow to respond to the point that they timeout frequently, this is exactly what you will see.