ServerlessArchitecture#04 AWS Lambda Cold Starts#PART-4 How long does AWS Lambda keep your idle functions around before a cold start?

ServerlessArchitecture#04 AWS Lambda Cold Starts#PART-4 How long does AWS Lambda keep your idle functions around before a cold start?


To summarize, the blog Use-Cases are:

  1. Interesting observations: Lambda functions are no longer recycled after 5 minutes of inactivity .
  2. How does language, memory and package size affect cold starts of AWS Lambda?
  3. What’s the new period of inactivity which triggers a cold start?
  4. Does memory allocation impact the idle time before a cold start?
    Hypotheses 01: There is an upper bound to how long Lambda allows your function to stay idle before reclaiming the associated resources.
    Hypotheses 02: The idle timeout is not a constant.
    Hypotheses 03: The upper bound for inactivity varies by memory allocation.

  5. Experimenting to find the upper bound for inactivity

  6. Step Functions to setup the experiment
  7. Experimental Results/Findings: Over 60% of cold starts occurred after 45 mins — before the functions reached their upper bound for inactivity.
  8. Further Investigation: For instance, the 1536 MB function exhibited very different behaviour compared to other functions. Is this a special case, or do functions with more than 1024 MB of memory all share these traits?

Selection_026.png

Using AWS Step Function to find the longest time your AWS Lambda function can idle before the resources are reclaimed

This blog is rephrased on bit.ly/31vNoqw with the author's experimental guideline which I lated implemented and tested by myself.

In a recent experiment, I compared the cold start times of AWS Lambda using { different languages, memory allocation, and sizes of deployment package }.

One of the interesting observations was that functions are no longer recycled after 5 minutes of inactivity — which makes cold starts far less punishing.

1.1 How does language, memory and package size affect cold starts of AWS Lambda?

Comparing the cold start times of AWS Lambda using different languages, memory allocation, and sizes of deployment.

During the experiment, some of my functions didn’t experience a cold start until after 30 minutes of idle time. The longer period of inactivity is something Amazon quietly changed behind the scene — and is fantastic news. However, the change prompted me to ask a few follow-up questions:


  • What’s the new period of inactivity which triggers a cold start?

  • Does memory allocation impact the idle time before a cold start?


To satisfy my curiosity, I devised an experiment and hypotheses. The experiment is intended to help us glimpse into implementation details of the AWS Lambda platform.

Since AWS can — and will — change these implementation details without notice, you shouldn’t build your application with these results!


Hypotheses 01: There is an upper bound to how long Lambda allows your function to stay idle before reclaiming the associated resources.

This should be a given — it simply wouldn’t make any sense for AWS to keep idle functions around forever.

Idle functions occupy resources that can be used to help other AWS customers scale up to meet their needs. And most importantly to AWS, an inactive function is not paying the bills.



Hypotheses 02: The idle timeout is not a constant.

From a developer’s point-of-view, a consistent and published idle period before a cold start would be preferred — e.g. functions are always terminated after X mins of inactivity.

However, AWS will most likely vary the timeout to optimize for higher utilization periods. This allows them to keep the performance levels more evenly distributed across its fleet of physical servers.

For example, if there’s an elevated level of resource contention in a region — it makes sense for AWS to reduce the cold start period and terminate functions to free up resources.



Hypotheses 03: The upper bound for inactivity varies by memory allocation.

An idle function with 1536 MB of memory allocation is wasting a lot more resource than an idle function with 128 MB of memory. It makes sense for AWS to terminate idle functions with higher memory allocation earlier.

image.png


1.2 Experimenting to find the upper bound for inactivity

To find the upper bound for inactivity, I have adopted the below approach:

  • Create a Lambda function: We first need to create a Lambda function to act as the system-under-test to report when it has experienced a cold start.
  • Determining the Upper Bound- A place where Cold Starts are Guaranteed: need a mechanism to progressively increase the interval between invocations until we arrive at a place where each invocation is guaranteed to be a cold start — the upper bound.

    The value of the upper bound is determined when ten (10) consecutive cold starts are observed after being invoked X minutes apart.

To answer hypothesis 3 — the impact of memory — we will also replicate the system-under-test function with different memory allocations.

This experiment is a time consuming process, it requires discipline and a degree of precision in timing. Suffice to say I won’t be doing this by hand!

1.2.1 Step Functions to setup the experiment

First and Failed Approach: My first approach was to use a CloudWatch Schedule to trigger the system-under-test function, and let the function dynamically adjust the schedule based on whether it’s experienced a cold start.

This approach failed miserably. Whenever the system-under-test updates the schedule, it fires shortly thereafter instead of waiting for the newly specified interval.

Alternative and Successful Approach: AWS Step Functions allows you to create a state machine where you can invoke Lambda functions, wait for a specified amount of time, execute parallel tasks, retry, catch errors, and much more.

Below is the state machine used to carry out this experiment. The visual workflow depicts how the FindIdleTimeout state will invoke the system-under-test function. Depending on its output, it either completes the experiment or waits before recursing.

image.png


The wait state allows you to drive the number of seconds to wait using data — see theSecondsPath parameter in the documentation for more details. The wait state allowed me to start the state machine with an input like this:

{ 
    “target”: “when-will-i-coldstart-dev-system-under-test-128”, 
    “interval”: 600, 
    “coldstarts”: 0 
}
  • The input is then passed to another find-idle-timeout function as invocation event.
  • The function will invoke the target — which is one of the variants of the system-under-test function with different memory allocations
  • It will then increase the interval if the system-under-test function doesn’t report a cold start.
  • Finally, the find-idle-timeout function will return a new piece of data for the Step Function execution
{ 
    “target”: “when-will-i-coldstart-dev-system-under-test-128”, 
    “interval”: 660, 
    “coldstarts”: 0 
}
  • At this point, the wait state will use the interval value and wait 660 seconds before switching back to the FindIdleTimeout state.
  • It will then invoke the find-idle-timeout function again — using the previous output as input.
    "Wait": {
      "Type": "Wait",
      "SecondsPath": "$.interval",
      "Next": "FindIdleTimeout"
    },
    

With this setup I’m able to kick off multiple executions — one for each memory setting. Using the Steps Functions dashboard, you can observe the active executions for your state machine.

image.png

Along the way I have plenty of visibility into what’s happening, all from the comfort of the Step Functions management console. Using the Step Functions console, you can see the current state of the state machine.

image.png

Using the Step Functions Console, you can also see the input and current output of the state machine.

image.png

The first time the target function is invoked, it is guaranteed to be a cold start. Here you can see the current cold start count is one (1).

image.png

Using the Step Functions Console, you can also see when the state transitions happened — and the relevant inputs and outputs at each transition.

image.png

1.2.1 Experimental Results!!!

From the data, it’s clear that AWS Lambda shuts down idle functions around the hour mark. It’s interesting to note that the function with 1536 MB memory is terminated over 10 mins earlier.

This finding supports hypothesis 3 — idle functions with higher memory allocation will be terminated earlier.

image.png To help analyze the results, I collected data on all the idle intervals where we saw a cold start and categorized them into 5 minute brackets.

This table shows the number of cold starts that occurred before each function reached its upper bound idle time

image.png

image.png

From this chart, you can see that over 60% of cold starts occurred after 45 mins — before the functions reached their upper bound for inactivity.

image.png


Even though the data is seriously lacking, the little data collected still allows us to observe some high level trends:

over 60% of cold starts happened after 45 mins of inactivity — prior to hitting the upper bound the function with 1536 MB memory sees significantly fewer number of cold starts prior to hitting the upper bound it’s worth noting that functions with 1536 MB also have a lower upper bound (48 mins) when compared to other functions

The data seems to clearly supports hypothesis 2 — that the idle timeout is not a constant.

There’s no way for us to figure out the reason behind these cold starts, or if there’s significance to the 45 mins barrier.


2. Conclusions

AWS Lambda will generally terminate functions after 45–60 mins of inactivity, although idle functions can sometimes be terminated a lot earlier to free up resources needed by other customers.

I hope you found this experiment interesting — it’s meant for fun and to satisfy a curious mind — and nothing more! Please don’t build applications on the assumptions these results are valid, or assume they will remain valid for the foreseeable future. You can find the source code for the experiment here.


While I answered a few questions, the results from this experiment also deserve further investigation.

For instance, the 1536 MB function exhibited very different behaviour compared to other functions.

Is this a special case, or do functions with more than 1024 MB of memory all share these traits?