Hi Splunkers,
I have a Splunk cluster with 1 SH, 1 CM and HF, and 3 indexers. The CM setup is configured to connect forwarders and SH using indexer discovery. All of this setup works well when we don't have any issues. Still, when the indexer is not accepting any connections (sometimes when we are overusing the license, we flip the input port on indexers to xxxx, not to receive any accepted data from forwarders), the network activity (read /write) on the Splunk Search Head is taking a hit. The Search Head becomes completely unusable at this point.
Has anyone faced a similar issue like this, or am I missing any setting during the setup of Indexer discovery?
Thanks,
Pravin
Can you open more how and when you are doing this “flip”? You probably know that when you are over used your license it’s not mater how much you do it?
When you are “flip” you indexer receiving port and node ind discovery update this information to its list. Then when someone ask it then it told it to those. Based on that UFs just update their targets based on that. If/when you have FW between your source and IDX then those will block connections and their cannot send events anymore to IDXes.
But if you have UFs configured to use static host+port combination then those try to send continuously to those. If your SHs and other infra nodes are using indexer discovery then those are starting to use those new ports. Of course if there is not FW openings between those nodes and IDXes then traffic stops and when queues get full then probably other issues arise.
You should check that those “flipped” ports are open between SHCs and IDXes and then your environment should works as expected.
Then is this the best way to avoid license overuse is another story!
Hi @isoutamo ,
You may be aware that Splunk has its panel that records license warnings and breaches, but once the number of warnings/breaches (I assume it was 5) exceeds the 30-day limit, Splunk would cut the data intake, and the panels become unusable.
To make sure that data isn't completely cut off, we at our company made an app that keeps track whenever we hit the mark of 3 breaches in a 30-day rolling period. So, upon hitting the mark, the port flip comes into action, and it flips the default receiving port from 9997 to XXXX. Some random alphabet because the indexer discovery will determine the new port as well, once the indexer is restarted.
This strategy was initially implemented as a port switch from 9997 to 9998, and the inputs.conf was configured in the usual static way, where I mention the names of the <server>:<port> format, but later reformatted to suit the indexer discovery technique.
What was strange about this technique was that we never had network issues in the search head during the classic forwarding technique, but noticed the same in the indexer discovery technique.
Also, to make sure the problem exists only after the indexer discovery, I simulated the same in a test environment and noticed the worse network usage when the indexers are not reachable, but still the Search head was usable. The only difference between the two environments is that production has a lot of incoming data to the indexers, and the SH also acts as the license master to a lot of other sources where whereas the test environment doesn't do the same.
The data flow begins again as we switch the ports back to 9997 after midnight, once the new day license period starts and the SH is back to its normal state.
Yes I'm aware that MC has these panels where you could see some statistics about license usage. Unfortunately those are not aligned with current license policy 😞
If you have official enterprise then the license policy is 45/60 not 3/30. And if your license size is 100GB+ then it's nonblocking for searches.
And independent of your license there is no blocking for ingesting side. So even you have done hard license breach the ingesting is still working, but you cannot make any searches except for internal indexes (figure out why you have done breach)!
I'm expecting, that as you have LM in your SH, and you are sending your internal logs (including those which LM is needed) to your idx cluster, except when you have blocked your indexer discovery by wrong port information, you have hit by "missing connection to LM" instead of indexing internal logs.
Anyhow you shouldn't flip any internal logs as those are not counted towards your license usage. Only real data which are sent to an other than _* indexes are counted as indexed data. Usually all that date is coming from UF/HF not your SH etc.
So my proposal is that you just switch receiving port of indexers to some valid port which are allowed only from SH side by FW. Then your SH (and LM in SH) can continue send their internals into indexers and everything should works. And same time all UFs with static indexer information cannot send events as the receiving port has changed. If you have any real dat inputs on SH then you should set up HF and move those inputs there.
Of course the real fix is buy enough big splunk license....
Hi @isoutamo ,
Thanks for the answer. Could you please provide more clarity on this part?
So my proposal is that you just switch receiving port of indexers to some valid port which are allowed only from SH side by FW. Then your SH (and LM in SH) can continue send their internals into indexers and everything should works. And same time all UFs with static indexer information cannot send events as the receiving port has changed. If you have any real dat inputs on SH then you should set up HF and move those inputs there.
Are there multiple receiving ports for an indexer? And, if so, how can I do that?
Thanks,
Pravin
Yes, you could define several ports if needed, by adding a new receiver into those indexers by app via CM. I’m not sure if I understood correctly that you flip your receiver port to some invalid value or something like that?
Basically you could have separate port reserved for internal nodes which is blocked by firewall from normal traffic from UFs etc. This is allowed only from SH etc. Another receiver port is for all other UFs and IHFs (intermediate heavy forwarders). Then when you need to block real incoming indexing traffic, just disable that port. Then SH stop using this, as it gets information that this is closed by indexer discovery! And continue to use that SH only port.
But still I said that you should update your license to cover your real ingestion needs or remove unnecessary ingestion.
When we set up a cluster, the SH, CM and the indexers stay connected over the management port 8089 and will keep sending _internal logs no matter what, but the forwarders use the inputs port 9997 to send data to the indexers.
In our case, we only flip the port to XXXX or 9998, depending on the type of forwarding setup used.
We have controlled data ingestion and always stay within limits, but sometimes unexpected testing causes a high input flow, and thus, we have to take measures to make sure we don't breach the license.
If i understood you correctly, you flip indexer receiving port whenever you have license overage and your SH becomes unresponsive.
SH becomes unresponsive, because it keeps trying to communicate with those indexers, waiting for timeouts and retries. This causes high network activity, resource exhaustion, or unresponsive.
If indexers are unresponsive, the SH waits for each connection to time out, which can block UI and search activity eventually.
I would suggest not to use port flipping as a workaround. This destabilizes the cluster and SH. Instead, address license overages at the source (reduce ingestion, increase license, or use Splunk’s enforcement).
Also for a quick workaround is to restart SH, which will clear the active connections. But the best approach is to address license issues at the source rather than blocking ingestion.
Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!
Hi @Prewin27 ,
Thanks for your response.
We used to have a classic way of connecting the forwarder to peers, and we had the same port flip technique, but didn't have the SH slowness. This is making me doubt if I am missing something in the configuration of the indexer discovery.
Also, to get back on the same issue, I always tried to check if the SH responds, but on the flipside, I never checked the indexers UI. So should it be the case where all the UI should fail due to cluster instability, and not just the SH?
But when this was the case, I tried the SH restart, but of no use.
To address license overages at source, reducing the ingestion or increasing the license is not possible at this stage because overages are a rare one-off scenario, but Splunk's license enforcement is something which I can do.
Is there a way I can cut off the data when I am approaching a license breach?
One more important thing to notice is how the Splunk license works. Splunk logs license through the _internal index, and meter gets data based on license.log, but if the license.log file is being indexed late with a time delay, still the license gets updated for the _time of the data, and not the indextime.
Any thoughts on this process @livehybrid @Prewin27
Thanks,
Pravin
I don't think SH unresponsive issue is because of config issue on your Indexer Discovery. With your classic approach(static indexers list) the indexers themselves remained reachable and responsive to the SH for search and metadata operations. The SH normally becomes slow or unresponsive when it cannot communicate with the indexers for distributed search/query.
You can set up alerts on the license master to notify you as you approach your daily license limit.
If you must block data, do so at the forwarder level (disable outputs or even disabling FW port if possible(not recommended)) Also you can consider using null queue to drop data at HF.
Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!
Hi @_pravin
This isnt really a suitable way to manage your excessive license usage. If your SH are configured to send their internal logs to the Indexers (which they should be) then they will start queueing data, it sounds like this queueing could be slowing down your SHs.
I think the best thing to do would be to focus on reducing your license ingest- are there any data sources which are unused or could be trimmed down?
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Hi @livehybrid ,
The SH sends its internal logs to the indexers as well, but we used to have a classic way of connecting the forwarder earlier, but didn't have the SH slowness. This is making me doubt if I am missing something in the configuration of the indexer discovery.
To get back to your comment, when you say SH starts queuing data, is there an intermediate queue in the SH, or maybe any forwarder to store data when it's unable to connect to the indexers (data layer).
Thanks,
Pravin