Splunk Enterprise

network issues during indexer descovery fail

_pravin
Contributor

Hi Splunkers,

I have a Splunk cluster with 1 SH, 1 CM and HF, and 3 indexers. The CM setup is configured to connect forwarders and SH using indexer discovery. All of this setup works well when we don't have any issues. Still, when the indexer is not accepting any connections (sometimes when we are overusing the license, we flip the input port on indexers to xxxx, not to receive any accepted data from forwarders), the network activity (read /write) on the Splunk Search Head is taking a hit. The Search Head becomes completely unusable at this point.

Has anyone faced a similar issue like this, or am I missing any setting during the setup of Indexer discovery?

Thanks,
Pravin

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Can you open more how and when you are doing this “flip”? You probably know that when you are over used your license it’s not mater how much you do it?

When you are “flip” you indexer receiving port and node ind discovery update this information to its list. Then when someone ask it then it told it to those. Based on that UFs just update their targets based on that. If/when you have FW between your source and IDX then those will block connections and their cannot send events anymore to IDXes. 

But if you have UFs configured to use static host+port combination then those try to send continuously to those. If your SHs and other infra nodes are using indexer discovery then those are starting to use those new ports. Of course if there is not FW openings between those nodes and IDXes then traffic stops and when queues get full then probably other issues arise.

You should check that those “flipped” ports are open between SHCs and IDXes and then your environment should works as expected. 
Then is this the best way to avoid license overuse is another story!

_pravin
Contributor

Hi @isoutamo ,

You may be aware that Splunk has its panel that records license warnings and breaches, but once the number of warnings/breaches (I assume it was 5) exceeds the 30-day limit, Splunk would cut the data intake, and the panels become unusable.

To make sure that data isn't completely cut off, we at our company made an app that keeps track whenever we hit the mark of 3 breaches in a 30-day rolling period. So, upon hitting the mark, the port flip comes into action, and it flips the default receiving port from 9997 to XXXX. Some random alphabet because the indexer discovery will determine the new port as well, once the indexer is restarted.

This strategy was initially implemented as a port switch from 9997 to 9998, and the inputs.conf was configured in the usual static way, where I mention the names of the <server>:<port>  format, but later reformatted to suit the indexer discovery technique.

What was strange about this technique was that we never had network issues in the search head during the classic forwarding technique, but noticed the same in the indexer discovery technique. 

Also, to make sure the problem exists only after the indexer discovery, I simulated the same in a test environment and noticed the worse network usage when the indexers are not reachable, but still the Search head was usable. The only difference between the two environments is that production has a lot of incoming data to the indexers, and the SH also acts as the license master to a lot of other sources where whereas the test environment doesn't do the same.

The data flow begins again as we switch the ports back to 9997 after midnight, once the new day license period starts and the SH is back to its normal state.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Yes I'm aware that MC has these panels where you could see some statistics about license usage.  Unfortunately those are not aligned with current license policy 😞

If you have official enterprise then the license policy is 45/60 not 3/30. And if your license size is 100GB+ then it's nonblocking for searches.

And independent of your license there is no blocking for ingesting side. So even you have done hard license breach the ingesting is still working, but you cannot make any searches except for internal indexes (figure out why you have done breach)!

I'm expecting, that as you have LM in your SH, and you are sending your internal logs (including those which LM is needed) to your idx cluster, except when you have blocked your indexer discovery by wrong port information, you have hit by "missing connection to LM" instead of indexing internal logs.

Anyhow you shouldn't flip any internal logs as those are not counted towards your license usage. Only real data which are sent to an other than _* indexes are counted as indexed data. Usually all that date is coming from UF/HF not your SH etc. 

So my proposal is that you just switch receiving port of indexers to some valid port which are allowed only from SH side by FW. Then your SH (and LM in SH) can continue send their internals into indexers and everything should works. And same time all UFs with static indexer information cannot send events as the receiving port has changed. If you have any real dat inputs on SH then you should set up HF and move those inputs there.

Of course the real fix is buy enough big splunk license....

_pravin
Contributor

Hi @isoutamo ,

Thanks for the answer. Could you please provide more clarity on this part?

So my proposal is that you just switch receiving port of indexers to some valid port which are allowed only from SH side by FW. Then your SH (and LM in SH) can continue send their internals into indexers and everything should works. And same time all UFs with static indexer information cannot send events as the receiving port has changed. If you have any real dat inputs on SH then you should set up HF and move those inputs there.

Are there multiple receiving ports for an indexer? And, if so, how can I do that?

Thanks,

Pravin

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Yes, you could define several ports if needed, by adding a new receiver into those indexers by app via CM. I’m not sure if I understood correctly that you flip your receiver port to some invalid value or something like that?

Basically you could have separate port reserved for internal nodes which is blocked by firewall from normal traffic from UFs etc. This is allowed only from SH etc. Another receiver port is for all other UFs and IHFs (intermediate heavy forwarders). Then when you need to block real incoming indexing traffic, just disable that port. Then SH stop using this, as it gets information that this is closed by indexer discovery! And continue to use that SH only port.

But still I said that you should update your license to cover your real ingestion needs or remove unnecessary ingestion.

_pravin
Contributor

When we set up a cluster, the SH, CM and the indexers stay connected over the management port 8089 and will keep sending _internal logs no matter what, but the forwarders use the inputs port 9997 to send data to the indexers.

In our case, we only flip the port to XXXX or 9998, depending on the type of forwarding setup used.

We have controlled data ingestion and always stay within limits, but sometimes unexpected testing causes a high input flow, and thus, we have to take measures to make sure we don't breach the license.

0 Karma

isoutamo
SplunkTrust
SplunkTrust
The port 8089 is used only for rest api requested a responses, not for sending logs! You need separate port for those like 9997 is in normal situation. It doesn’t mater what it is. Only ensure that it’s allowed in all FWs between SH and indexers.
When you are flipping the port to a XXXX or 9998 then indexer discovery tells SH that there is a new receiver port activated and SH should use also it and remove previous 9997. If there is e.g. FW blocking traffic from SH to indexers for those new ports then SH can’t work as expected and , I expect, later when it lost access to its current LM logs there will start those other issues which you have mentioned?
You should find some hints from your instances internal logs if this is really what has happened.

PrewinThomas
Builder

@_pravin 

If i understood you correctly, you flip indexer receiving port whenever you have license overage and your SH becomes unresponsive.
SH becomes unresponsive, because it keeps trying to communicate with those indexers, waiting for timeouts and retries. This causes high network activity, resource exhaustion, or unresponsive.
If indexers are unresponsive, the SH waits for each connection to time out, which can block UI and search activity eventually.

I would suggest not to use port flipping as a workaround. This destabilizes the cluster and SH. Instead, address license overages at the source (reduce ingestion, increase license, or use Splunk’s enforcement).

Also for a quick workaround is to restart SH, which will clear the active connections. But the best approach is to address license issues at the source rather than blocking ingestion.

Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

_pravin
Contributor

Hi @PrewinThomas ,

Thanks for your response.

We used to have a classic way of connecting the forwarder to peers, and we had the same port flip technique, but didn't have the SH slowness. This is making me doubt if I am missing something in the configuration of the indexer discovery. 

Also, to get back on the same issue, I always tried to check if the SH responds, but on the flipside, I never checked the indexers UI. So should it be the case where all the UI should fail due to cluster instability, and not just the SH?

But when this was the case, I tried the SH restart, but of no use.

To address license overages at source, reducing the ingestion or increasing the license is not possible at this stage because overages are a rare one-off scenario, but Splunk's license enforcement is something which I can do.

Is there a way I can cut off the data when I am approaching a license breach?

One more important thing to notice is how the Splunk license works. Splunk logs license through the _internal index, and meter gets data based on license.log, but if the license.log file is being indexed late with a time delay, still the license gets updated for the _time of the data, and not the indextime.

Any thoughts on this process @livehybrid @PrewinThomas 

Thanks,

Pravin

0 Karma

PrewinThomas
Builder

@_pravin 

I don't think SH unresponsive issue is because of config issue on your Indexer Discovery. With your classic approach(static indexers list) the indexers themselves remained reachable and responsive to the SH for search and metadata operations. The SH normally becomes slow or unresponsive when it cannot communicate with the indexers for distributed search/query.

You can set up alerts on the license master to notify you as you approach your daily license limit.
If you must block data, do so at the forwarder level (disable outputs or even disabling FW port if possible(not recommended)) Also you can consider using null queue to drop data at HF.


Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

0 Karma

livehybrid
Super Champion

Hi @_pravin 

This isnt really a suitable way to manage your excessive license usage. If your SH are configured to send their internal logs to the Indexers (which they should be) then they will start queueing data, it sounds like this queueing could be slowing down your SHs.

I think the best thing to do would be to focus on reducing your license ingest- are there any data sources which are unused or could be trimmed down? 

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

0 Karma

_pravin
Contributor

Hi @livehybrid ,

The SH sends its internal logs to the indexers as well, but we used to have a classic way of connecting the forwarder earlier, but didn't have the SH slowness. This is making me doubt if I am missing something in the configuration of the indexer discovery.

To get back to your comment, when you say SH starts queuing data, is there an intermediate queue in the SH, or maybe any forwarder to store data when it's unable to connect to the indexers (data layer).

Thanks,

Pravin

0 Karma
Get Updates on the Splunk Community!

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

The Defining Technology Movement of Our Lifetime The advent of agentic AI is arguably the defining technology ...

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

In today’s complex digital landscape, security teams face increasing pressure to protect sprawling data across ...

Your summer travels continue with new course releases

Summer in the Northern hemisphere is in full swing, and is often a time to travel and explore. If your summer ...