I am pulling data from 30-40 log groups from 3 different regions using the Splunk Add-on for AWS. I am having an issue where after about 10-15 minutes, I stop receiving the most up to date events from half of my log groups. I receive data initially just fine from all log groups, but it seems after it pulls the most recent data at the time it doesn't check again for more data. The delay and interval settings are set to the default and I've confirmed that the most current events are being received by the Cloudwatch logs service. My only clue seems to be this event in the Splunk internal logs that occurs for my log groups with this issue.
2015-12-08 17:52:22,328 INFO pid=7026 tid=Thread-298 file=aws_cloudwatch_logs.py:_do_was_job_func:130 | Previous job of the same task still running. Exit current job. region=us-west-2, log_group=syslog
This event seems to occur indefinitely every 10 minutes and Splunk never pulls more data from the log group again.
Any ideas?
The latest amazon add-on version I updated to (3.0.0) has fixed the issue.
I was able to get around this issue by limiting the time range for the data it is polling. This is under the Splunk Add-on for AWS console -> Inputs -> Actions -> Edit -> Templates
Specifically the "Only After" value
We resolved this issue with changing from direct cloudwatch logs to Kinesis, please check http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html
We also got answer from AWS:
. Instead you should use the Kinesis subscription integration that Splunk apparently provides, but does not use by default. The default Splunk integration only works for very small customers. You should reach out to Splunk for support if needed on how to use Splunk with CloudWatch Logs.
The latest amazon add-on version I updated to (3.0.0) has fixed the issue.
I am also seeing the same throttling alerts in 4.1.1
Can confirm, throttling errors with version 4.1.0 and only 11 cloudwatch logs logstreams.
Failure in describing cloudwatch logs streams due to throttling exception for log_group=, sleep=5.98632069244, reason=Traceback (most recent call last):
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/cloudwatch_logs_mod/aws_cloudwatch_logs_data_loader.py", line 64, in describe_cloudwatch_log_streams
group_name, next_token=buf["nextToken"])
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 308, in describe_log_streams
body=json.dumps(params))
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 576, in make_request
body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'__type': u'ThrottlingException', u'message': u'Rate exceeded'}
For what it's worth, @nickpayze, I'm seeing this on 3.0.0. 😞 Same throttling exception that you saw
We have this same issue running latest 4.1.0 version. It seems to try to run describe_log_stream against all log_groups at the same time which is probably causing the throttling. This is especially an issue when you have a large set of log_groups.
Also seeing this issue on 4.0.0
I found a Throttling exception ERROR in the internal logs that may be another clue, could this be the culprit?:
2015-12-10 16:21:51,357 ERROR pid=24928 tid=Thread-23 file=util.py:describe_cloudwatch_log_streams:118 | Failure in describing cloudwatch logs streams due to throttling exception for log_group=kern.log, sleep=5.96629281236, reason=Traceback (most recent call last):
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/aws_cloudwatch_logs_resources/util.py", line 108, in describe_cloudwatch_log_streams
group_name, next_token=buf["nextToken"])
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 308, in describe_log_streams
body=json.dumps(params))
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 576, in make_request
body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'message': u'Rate exceeded', u'__type': u'ThrottlingException'}
I'm seeing the same behavior with Splunk running on Windows 7
What OS is being used to host Splunk?
Ubuntu 14.04
Ubuntu's dash shell returns a different SIGTERM than bash, resulting in orphaned input processes. This was meant to have been resolved in TA version 2.0.1 (which is why rpille asked which version). At first glance, it appears this condition is being detected and partially handled (additional processes aren't spawned when orphaned processes exist, yet the orphaned process is not terminated). I'll file a new bug for this and explore workarounds.
Hi @nickpayze, can you try adding a start_by_shell=false
to the [aws_cloudwatch_logs]
configuration in inputs.conf
and re-starting Splunk?
Will I have to wait until this issue is resolved in the next version of the aws add-on?
Would you turn on the debug log and double check if you can find log message "Start to describe streams **" and "Job ended. region **" for each interval? The log group name should be print out in those message.
I do not see any "Job ended" messages for any of my log groups.
I see many "Start to describe streams" messages for the log groups I am still receiving events on (every few seconds) and the " Previous job of the same task still running" message running every 10 minutes for the log groups I stopped receiving events on.
I've added the setting and it does get rid of the bash process that runs alongside the python process for aws_cloudwatch_logs.py . I am still getting the same behavior as before though. 😞
What version of the add-on are you running?
version 2.0.1
Also one thing I forgot to specify, when I restart the splunk server, it follows the same behavior as described above, pulls all data from all logs again up to most recent, then stops and shows that message.