Hi,
This issue relates to why "in-flight" messages keeps piling up for SQS-based S3 input's queue for aws:config sourcetype.
I have found the co-relation between the SQS queue message backlog and a warning message in the splunk_ta_aws_aws_sqs_based_s3 logs.
Splunk AWS AddOn is NOT deleting messages from the SQS queue whenever there is a log entry like the one below, which indicates that there is no S3 bucket in that message.
2018-02-07 16:07:35,559 level=WARNING pid=24549 tid=Thread-2 logger=splunk_ta_aws.modinputs.sqs_based_s3.handler pos=handler.py:_process:248 | start_time=1518019605 datainput="AWS-Config-debug", created=1518019655.52 message_id="67c27878-01d5-4896-bba7-4df1d9f6946c" job_id=0702de8f-9e66-4f1c-9608-644d6d3cc12b ttl=300 | message="There's no files need to be processed in this message."
However, the AddOn IS deleting messages from the SQS queue whenever there is a log like the one below which indicates that there is an S3 bucket in that message.
2018-02-07 16:07:34,815 level=INFO pid=24549 tid=Thread-1 logger=splunk_ta_aws.modinputs.sqs_based_s3.handler pos=handler.py:_index_summary:351 | start_time=1518019605 datainput="AWS-Config-debug", created=1518019654.27 message_id="b9d0d088-3e7c-40c2-99ac-a66fba6adf20" job_id=5bb3a90a-ff87-41fe-9633-91c67621e8e4 ttl=300 | message="Sent data for indexing." last_modified="2018-02-07T16:07:23Z" key="s3://mycompany-security-config-bucket-us-east-1/mycompany.IO-config-logs/AWSLogs/567090277256/Config/us-east-1/2018/2/7/ConfigSnapshot/567090277256_Config_us-east-1_ConfigSnapshot_20180207T160722Z_11fb9f0e-00b0-42f5-a5d3-3462d8814f9b.json.gz" size=48526
Why can't the AWS TA delete queue messages when there is no S3 bucket mentioned in the message?
Could this be a bug in Splunk AWS AddOn or can be alleviated by a queue tuning setting in AWS SQS and/or the Splunk AWS TA?
Thanks in advance for any hint !
best regards,
Shreedeep
This is what I found and could improve situation by tweaking the code. By default it is configured to check its credentials every 30 secs. Once it is detected then modinput goes down in a not-graceful way leaving many entries in "in-flight"..
I changed it to 14400 which means check it every 4hours. - in line 76 below; This reduced the in-flight by 95+ %.
etc/apps/Splunk_TA_aws/bin/splunksdc/config.py
74 def has_expired(self):
75 now = time.time()
76 if now - self._last_check > 30:
77 eelf._last_check = now
78 self._has_expired = self._check()
Did you ever determine the cause of this issue?
nope.
But I'm observing that this problem can be avoided by reading the datasource directly from SQS if possible instead of SQS-based S3.