ServerlessArchitecture#09 - My Lambda Function timeouts in 31s, when the API Gateway infront of it in 30s - A Stack wide Exception Handling Approach?

Let's start with a scenario "REST API Using AWS API Gateway, Lambda and DynamoDB"

Here, Assume that,

the Lambda Function timeouts in 31s, when the API Gateway infront of it timeouts in 30s.

To debug this sentence, I will start with the following few questions:

Question 01: How long AWS API Gateway waits for a lambda function to respond before it time outs?

Question 02: What for most challenge the major cloud service providers placed in their architecture?

Question 03: What is the micro-service approach to web application architecture?

Question 04: How will you monitor and debug issues when timeouts are inevitable to occur ?

1.1 Relate the questions to the above scenario

Question	Description
Q1. How long AWS API Gateway waits for a lambda function to respond before it time outs?	In this above scenario, - AWS API Gateway only waits up to 30 seconds for a Lambda function to respond before it times out.
Q2. What for most challenge the major cloud service providers placed in their architecture?	All three major cloud service providers place execution time limits on the order of minutes - AWS : 15 mins - Azure: 5 mins - Google: 9 mins
Q3. What is the micro-service approach to web application architecture?	As a developer you must devise ways of handling requests in a short amount of time. - Thus the timeout limit imposed by API Gateway infront of Lambda Function enforcing web application architecture to have the micro-service approach.
Q4. How will you monitor and debug issues when timeouts are inevitable to occur ?	Jump to Section 1.2

Not all timeouts are within our control

It's worth noting that while timeouts can be due to bugs we introduce into the system.

Sometimes an external service we depend on may take too long or return incorrect data that affects our functions' execution.

1.2 Tracking Function Timeouts Using Logs

One way to track timeouts is to watch the function logs. AWS Lambda will log a message when a function times out:

START RequestId: 6f747a63-ce41-11e7-8cfe-3d61dad3d219 Version: $LATEST
END RequestId: 6f747a63-ce41-11e7-8cfe-3d61dad3d219
REPORT RequestId: 6f747a63-ce41-11e7-8cfe-3d61dad3d219    Duration: 3001.52 ms    Billed Duration: 3000 ms     Memory Size: 128 MB    Max Memory Used: 20 MB
2017-11-20T22:23:32.998Z 6f747a63-ce41-11e7-8cfe-3d61dad3d219 Task timed out after 3.00 seconds

While the logs aren't readily human-readable, it's possible to watch the logs with another Lambda function or pull logs into an external logging solution to notice when functions timeout.

But there can be considerable delay (on the order of up to a minute or more) for logs to be analyzed no matter which way you go.

It also becomes hard to piece together what causes and symptoms result from functions timing out when looking at logs. While it's certainly possible to build a system based solely on logs, there is a better way.

1.2 Tracking Function Timeouts Using A Watchdog Timer

As a Cloud Developer, I always want a solution for noticing timeouts and handling them gracefully in a more realtime and flexible fashion.

To enable this I have added a wrapper that inject a watchdog timer.

const lambda = new AWS.Lambda();

exports.handler = function (event, context, callback) {
  // Install watchdog timer as the first think you do
  setupWatchdogTimer(event, context, callback);

  // Do stuff...
}

Watchdog timers are set to an interval within which an expected amount of work should occur.

For example, if you expect a compute task to take 10 seconds you might set a watchdog timer for 15 seconds.
If the task isn't complete after 15 seconds the timer triggers other logic that handles the exceptional case.

Mark it: Handling the exceptional cases

Note: The code that handles exceptional cases can generally take as much time as it needs.

But that's NOT the case in Serverless Architecture. if we want to use watchdog timers for serverless use cases we need to be cognizant of the amount of time spent handling the exceptional case when the trigger fires.

because our exception handling logic runs in the same Function that has an overall time limit placed on it.

But why did I installed this watchdog timer? To answer this, let me explain you my solution strategy.

My Solution Strategy

For example, an AWS Lambda function responding to API Gateway requests may have an API Gateway timeout of 30 seconds and a Lambda timeout of 31 seconds.

When the function starts, it sets a watchdog timer for 30 seconds (like the above code snippet).
If that timer is triggered we know the API Gateway response has already timed out
but we have one more second for the Lambda function to send a notification message off with diagnostics info.
Lets examine the below code:

const lambda = new AWS.Lambda();

// 1. Lambda Handler
exports.handler = function (event, context, callback) {
  // S1-Actual-Call Install watchdog timer as the first think you do
  setupWatchdogTimer(event, context, callback);

  // Do stuff...
}

function setupWatchdogTimer(event, context, callback) {

  // S2.1-Preparing- When timer triggers, emit an event to the 'myErrorHandler' function and posses along the details about the original message caused the timeout

  const timeoutHandler = () => {
    // Include original event and request ID for diagnostics
    const message = {
      message: 'Function timed out',
      originalEvent: event,
      originalRequestId: context.awsRequestId
    };

    // S3- Invoke with 'Event' handling so we don't want for response (Asynchronous Invocation)

    const params = {
      FunctionName: 'myErrorHandler', // this is another lambda function
      InvocationType: 'Event',
      Payload: JSON.stringify(message)
    };

    // S4. Emit event to 'myErrorHandler', then callback with an error from this function
    lambda.invoke(params).promise()
      .then(() => callback(new Error('Function timed out')));
  }

  // S2. Set timer so it triggers one second before this function would timeout
  setTimeout(timeoutHandler, context.getRemainingTimeInMillis() - 1000);
}

The above function emits an event to another Lambda function (i.e. myErrorHandler) when it times out to deal with exceptions.
But it not only tells the other function that it timed out, it also passes along details about the original message that caused the timeout.

Now as it tells the Other Function 'myErrorHandler' about the timedout with details original message, what can we do with this message?

Option 01: We can sent this to error aggregating services (i.e. Sentry,Scalyr, OutSystems etc) for further debugging.
Option 02: We can also use this for custom retry or other error handling logic.

i.e. sending an email to the affected user (based on data from the original event message) to tell them you noticed the issue and customer service will follow up with a resolution soon.

Refresher For You: Node.js Event-Driven Process

//Node.js is perfect for event-driven applications
// every action on a computer is an event.
/*
1- Create an event handler
2- Assign the event handler just have created to an event, give a custom name of your event using eventEmitter
    -- use events.EventEmitter()
3- Fire the event
    -- using eventEmitter
 */

var eventsModule=require('events');
var eventEmitter=new eventsModule.EventEmitter();

// 1- Create an event handler
var myEventHandler=function () {
    console.log("Event Handler: Msg--> I head a Scream!");
};

//2- Assign the event handler to an event by using eventEmitter
eventEmitter.on('myEventName',myEventHandler);

//3- Fire the event 'myEventName'
eventEmitter.emit('myEventName');