Lambda Cloudwatch logs often missing due to edge c...

tophercullen · ‎07-11-2018

My use case:
I have several very small lambda functions that run hourly and output ~20 lines each time. I've configured a cloudwatch input for each functions log group with a 600s frequency. These are the only things using cloudwatch api.

Problem:
While the setup technically 'works', more often than not, only the first log line of each invocation is indexed. The remaining lines are conspicuously missing. Occasionally, it does get all the log lines.

Root cause:
I looked over the internal logs and I only see normal operations. I think the issue is how the add-on tests for new logs. Each invocation of lambda creates a new log stream. See example my example log stream (some data redacted for privacy)

    {
            "firstEventTimestamp": 1531317415254, 
            "lastEventTimestamp": 1531317415254, 
            "creationTime": 1531317406089, 
            "uploadSequenceToken": "Atoken", 
            "logStreamName": "MyLogStream", 
            "lastIngestionTime": 1531317582675, 
            "arn": "someARN", 
            "storedBytes": 0
   },

This example is of a log stream that has completed its lambda execution, meaning there are some 20+ lines in there. Yet, we can clearly see that the "EventTimestamp" attributes are identical. Thats odd, lets look at the events.

{
    "nextForwardToken": "sometoken", 
    "events": [
        {
            "ingestionTime": 1531317415315, 
            "timestamp": 1531317415254, 
            "message": "START RequestId somestuff \n"
        }, 
        {
            "ingestionTime": 1531317430720, 
            "timestamp": 1531317415255, 
            "message": "Mylogline \n"
        }, 
.........
<redacted event list>
.........
        {
            "ingestionTime": 1531317597749, 
            "timestamp": 1531317582672, 
            "message": "Final log line \n"
        }, 
        {
            "ingestionTime": 1531317597749, 
            "timestamp": 1531317582673, 
            "message": "END RequestId: somestuff \n"
        }, 
        {
            "ingestionTime": 1531317597749, 
            "timestamp": 1531317582673, 
            "message": "REPORT RequestId: some more stuff"
        }
    ], 
    "nextBackwardToken": "sometoken"
}

Here can clearly see the each event's timestamp is different, and the last events timestamp does NOT match the "lastEventTimestamp" attribute of the log stream. The "lastEventTimestamp" matches the first event. I believe that the add-on is using the "lastEventTimestamp" to determine if there are more log events to index. This will cause missing logs under certain conditions. In my case, when the lambda executions are still running when it indexes the log stream. Lucky me, since my hourly lambda executions happen on the same interval as the check (*/10), I'm missing nearly all my logs.

Best i can tell, it should use the "lastIngestionTime" attribute which actually seems to indicate some sort of change.

EDIT: I modified the execution interval several times to no avail. Maybe 1/10 streams are ever indexed properly, making this input pretty useless.

FeatureCreeep · ‎07-19-2018

We spoke with AWS support reps and they confirmed that they were aware of this issue and that plan to deploy a fix the week of the 23rd.

FeatureCreeep · ‎07-13-2018

@tophercullen Good explanation. I've found the exact same behavior. It seems like the add-on is doing the right thing by only looking at logs if CloudWatch says there is new data in it. This keeps API calls to CloudWatch to a minimum which is important since CloudWatch will throttle calls if you make too many and the limit is easy to hit. It seems to me that the problem is that CloudWatch is incorrectly reporting the last event time. The add-on is just believing what CloudWatch is telling it and thinks there are no new logs so no reason to ask for them.

Seems like a bug report needs to be made for CloudWatch due to it's incorrectly reported "Last Event Time".

tophercullen · ‎07-13-2018

I understand what you are saying. However, you misunderstand what is going on. The add-on, with what it is doing, tries to be efficient about log stream events by check a log stream attribute to determine if it should recurse into a log stream. Thats smart.

However, It can't do anything about the log streams themselves. The api doesn't allow you to filter on the logs streams of a log group in your request. You can only request all the streams in a log group, with optional sort or paginate. If you have too many streams in a group, yes, you will run into potential problems because it can't be smart about that. It can only ever get all the streams in a group.

That all said, what i'm suggesting doesn't change any 'efficiency'. You query log groups for their streams and then, instead of "lastEventTimestamp", you check the "lastIngestionTime" to determine if you should recurse. As far as i know, that attribute is a more clear indicator of change.

From their API docs "lastEventTime updates on an eventual consistency basis. It typically updates in less than an hour from ingestion, but may take longer in some rare situations."

Lambda Cloudwatch logs often missing due to edge case/race condition

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

Are you a member of the Splunk Community?

Lambda Cloudwatch logs often missing due to edge case/race condition

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk