Re: HEC HttpInputAckService pending queries

beneteos · ‎01-08-2024

Hello,

We set HEC http input for several flows of data and related tokens, and we added ACK feature to this configuration. (following https://docs.splunk.com/Documentation/Splunk/9.1.2/Data/AboutHECIDXAck)

We work with a distributed infra, 1 Search Head, two indexers (no cluster)

All was Ok with HEC but after some time we got our first error event :

ERROR HttpInputDataHandler [2576842 HttpDedicatedIoThread-0] - Failed processing http input, token name=XXXX [...] reply=9, events_processed=0
INFO HttpInputDataHandler [2576844 HttpDedicatedIoThread-2] - HttpInputAckService not in healthy state. The maximum number of ACKed requests pending query has been reached.

Server busy error (reply=9) leads to unavailability of HEC, but only for the token(s) where maximum number of ACKed requests pending query have been reached. Restarting the indexer is enough to get rid of the problem, but after many logs have been lost.

We did some search and tried to customize some settings, but we only succeeded in delaying the 'server busy' problem (1 week to 1 month).

Has anyone experienced the same problem ? How can we avoid increasing those pending query counter ?

Thanks a lot for any help.

etc/system/local/limits.conf
[http_input]
# The max number of ACK channels.
max_number_of_ack_channel = 1000000
# The max number of acked requests pending query.
max_number_of_acked_requests_pending_query = 10000000
# The max number of acked requests pending query per ACK channel.
max_number_of_acked_requests_pending_query_per_ack_channel = 4000000

etc/system/local/server.conf
[queue=parsingQueue]
maxSize=10MB

maxEventSize = 20MB
maxIdleTime = 400
channel_cookie = AppGwAffinity (this one because we are using load balancer, so cookie is also set on LB)

richgalloway · ‎01-08-2024

HEC ACKs require the client to specifically ask for the status. Does your HEC client do that? It can't just throw events at Splunk and hope to get an ACK. The client has to say "did you index it, yet"? See https://docs.splunk.com/Documentation/Splunk/9.1.2/Data/AboutHECIDXAck#Query_for_indexing_status

---
If this reply helps you, Karma would be appreciated.

beneteos · ‎01-09-2024

Hello,

Thanks for your answer but I don't have the same understanding of Splunk documentation as you.
If you were right, HEC service would be down a few hours after startup, or less.

As explained in Splunk documentation (see the graph), HEC responds with an ACK for each event thrown, but you can send a request for a particular event to verify the status : "Each time a client sends a request to the HEC endpoint using a token with indexer acknowledgment enabled (1), HEC returns an acknowledgment identifier to the client (2)."
https://docs.splunk.com/Documentation/Splunk/9.1.2/Data/AboutHECIDXAck#Query_for_indexing_status
1. Client send HEC request with event data
2. HEC acks the request once event is indexed

HEC clients don't need to ask for status for events to get indexed well (millions each day), but after a while, the indexers become busy due to the maximum number of pending requests. I already increased this value so now I need to understand why this pending queries

So my problem is something with pending requests and why they are increasing like that. I don't see any errors with the metrics, but they don't seem to be cumulative (Because Splunk Enterprise deletes status information after clients retrieve it) :

I cannot control HEC client behavior beyond basic settings (for information, this is Akamai DataStream).

richgalloway · ‎01-09-2024

The steps seem pretty clear in the docs.

1) Send data to HEC

2) Get an ACK *ID* in response

3) Use the ACK ID to confirm the data has been written

To verify that the indexer has indexed the event(s) contained in the request, query the [https://<host>:<port>/services/collector/ack] endpoint

Indexers get pending queries because the client has not closed them by requesting the status.

---
If this reply helps you, Karma would be appreciated.

beneteos · ‎01-09-2024

But step 3 you mentioned is optional, in the sense that it's not required to request statuses for events to be indexed (I can verify my data is present, and events logged). So I didn't expect this behavior.

After this max number of pending events reached, channel for the related token go on busy status, and leads to loss of logs until I restart service.

I tried to increase max_number_of_acked_requests_pending_query, but it will only allow me to postpone the deadline, and set a huge value could perhaps also have negative impact on servers health.

As I cannot control anything on client except channel header and authorization header, and as client doesn't seem do status requests (firewall logs), I will try to update maxIdleTime under 60, as client send data every 60 seconds.

Thanks

richgalloway · ‎01-09-2024

The documentation does not say step 3 is optional. That you can see your data confirms it is present, but that is not the same thing as fetching the ACK.

Restarting the service clears the pending ACKs and re-enables reception of data. Fetching the ACKs will also re-enable reception without a restart.

If the client cannot fetch ACKs then I suggest turning off HEC ACK.

---
If this reply helps you, Karma would be appreciated.

HEC HttpInputAckService pending queries

HTTP Event Collector

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

.conf24 | Session Scheduler is Live!!

Introducing the Splunk Community Dashboard Challenge!