All Apps and Add-ons

Splunk Add-on for Amazon Web Services: Why do I stop receiving events from some of my Cloudwatch log log-groups?

nickpayze
Explorer

I am pulling data from 30-40 log groups from 3 different regions using the Splunk Add-on for AWS. I am having an issue where after about 10-15 minutes, I stop receiving the most up to date events from half of my log groups. I receive data initially just fine from all log groups, but it seems after it pulls the most recent data at the time it doesn't check again for more data. The delay and interval settings are set to the default and I've confirmed that the most current events are being received by the Cloudwatch logs service. My only clue seems to be this event in the Splunk internal logs that occurs for my log groups with this issue.

2015-12-08 17:52:22,328 INFO pid=7026 tid=Thread-298 file=aws_cloudwatch_logs.py:_do_was_job_func:130 | Previous job of the same task still running. Exit current job. region=us-west-2, log_group=syslog

This event seems to occur indefinitely every 10 minutes and Splunk never pulls more data from the log group again.

Any ideas?

1 Solution

nickpayze
Explorer

The latest amazon add-on version I updated to (3.0.0) has fixed the issue.

View solution in original post

briancronrath
Contributor

I was able to get around this issue by limiting the time range for the data it is polling. This is under the Splunk Add-on for AWS console -> Inputs -> Actions -> Edit -> Templates

Specifically the "Only After" value

0 Karma

henrikhuitti
New Member

We resolved this issue with changing from direct cloudwatch logs to Kinesis, please check http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html

We also got answer from AWS:

. Instead you should use the Kinesis subscription integration that Splunk apparently provides, but does not use by default. The default Splunk integration only works for very small customers. You should reach out to Splunk for support if needed on how to use Splunk with CloudWatch Logs.

0 Karma

nickpayze
Explorer

The latest amazon add-on version I updated to (3.0.0) has fixed the issue.

amiller100
New Member

I am also seeing the same throttling alerts in 4.1.1

0 Karma

henrikhuitti
New Member

Can confirm, throttling errors with version 4.1.0 and only 11 cloudwatch logs logstreams.

Failure in describing cloudwatch logs streams due to throttling exception for log_group=, sleep=5.98632069244, reason=Traceback (most recent call last):
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/cloudwatch_logs_mod/aws_cloudwatch_logs_data_loader.py", line 64, in describe_cloudwatch_log_streams
    group_name, next_token=buf["nextToken"])
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 308, in describe_log_streams
    body=json.dumps(params))
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 576, in make_request
    body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'__type': u'ThrottlingException', u'message': u'Rate exceeded'}
0 Karma

wsh
New Member

For what it's worth, @nickpayze, I'm seeing this on 3.0.0. 😞 Same throttling exception that you saw

0 Karma

lcasey001
Explorer

We have this same issue running latest 4.1.0 version. It seems to try to run describe_log_stream against all log_groups at the same time which is probably causing the throttling. This is especially an issue when you have a large set of log_groups.

0 Karma

gsumner
Explorer

Also seeing this issue on 4.0.0

0 Karma

nickpayze
Explorer

I found a Throttling exception ERROR in the internal logs that may be another clue, could this be the culprit?:

2015-12-10 16:21:51,357 ERROR pid=24928 tid=Thread-23 file=util.py:describe_cloudwatch_log_streams:118 | Failure in describing cloudwatch logs streams due to throttling exception for log_group=kern.log, sleep=5.96629281236, reason=Traceback (most recent call last):
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/aws_cloudwatch_logs_resources/util.py", line 108, in describe_cloudwatch_log_streams
    group_name, next_token=buf["nextToken"])
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 308, in describe_log_streams
    body=json.dumps(params))
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 576, in make_request
    body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'message': u'Rate exceeded', u'__type': u'ThrottlingException'}
0 Karma

kyleguillot
New Member

I'm seeing the same behavior with Splunk running on Windows 7

0 Karma

bwooden
Splunk Employee
Splunk Employee

What OS is being used to host Splunk?

0 Karma

nickpayze
Explorer

Ubuntu 14.04

0 Karma

bwooden
Splunk Employee
Splunk Employee

Ubuntu's dash shell returns a different SIGTERM than bash, resulting in orphaned input processes. This was meant to have been resolved in TA version 2.0.1 (which is why rpille asked which version). At first glance, it appears this condition is being detected and partially handled (additional processes aren't spawned when orphaned processes exist, yet the orphaned process is not terminated). I'll file a new bug for this and explore workarounds.

0 Karma

bwooden
Splunk Employee
Splunk Employee

Hi @nickpayze, can you try adding a start_by_shell=false to the [aws_cloudwatch_logs]configuration in inputs.conf and re-starting Splunk?

0 Karma

nickpayze
Explorer

Will I have to wait until this issue is resolved in the next version of the aws add-on?

0 Karma

azhang_splunk
Splunk Employee
Splunk Employee

Would you turn on the debug log and double check if you can find log message "Start to describe streams **" and "Job ended. region **" for each interval? The log group name should be print out in those message.

0 Karma

nickpayze
Explorer

I do not see any "Job ended" messages for any of my log groups.

I see many "Start to describe streams" messages for the log groups I am still receiving events on (every few seconds) and the " Previous job of the same task still running" message running every 10 minutes for the log groups I stopped receiving events on.

0 Karma

nickpayze
Explorer

I've added the setting and it does get rid of the bash process that runs alongside the python process for aws_cloudwatch_logs.py . I am still getting the same behavior as before though. 😞

0 Karma

rpille_splunk
Splunk Employee
Splunk Employee

What version of the add-on are you running?

0 Karma

nickpayze
Explorer

version 2.0.1

Also one thing I forgot to specify, when I restart the splunk server, it follows the same behavior as described above, pulls all data from all logs again up to most recent, then stops and shows that message.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

 (view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...