Gunicorn - ‘[CRITICAL] Worker Timeout’ on App Service Linux

7 minute read | By Anthony Salemo

This blog post will quickly cover a few scenarios where ‘[CRITICAL] WORKER TIMEOUT’ may be encountered and why.

Overview

This post applies to both App Service Linux using Python - since these images use Gunicorn, and, any custom image used on Web Apps for Containers that utilize Gunicorn for running the application.

The message seen is typically something like:

[74] [CRITICAL] WORKER TIMEOUT (pid:91)

This will typically be associated with a HTTP 500 response returned from Gunicorn. To a user, this may look like this:

HTTP 500 from Gunicorn

In this case, [79] refers to the PID of the parent Gunicorn process. Gunicorn uses a master/child process model - where child processes handle incoming requests. Depending on the error, you can potentially determine the PID of the worker process that was terminated:

[2024-06-28 20:44:32 +0000] [74] [CRITICAL] WORKER TIMEOUT (pid:91)
2024-06-28T20:44:32.8354846Z [2024-06-28 20:44:32 +0000] [91] [ERROR] Error handling request /api/sleep

In this case, [91] refers to one of the child processes for Gunicorn. 74 is still the master process. These timeouts can come from any of the worker processes that are handling that failing request, at that time.

Prerequisites

To see these errors, you need to have App Service Logging enabled. See Enable application logging (Linux/Container)

After enabling this, application stdout / stderr will be written to /home/LogFiles/xxxxx_default_docker.log

If you’re using a custom startup command with Gunicorn, you need to set --access-logfile '-' --error-logfile '-' in your Gunicorn startup command. This will write to stdout/err and be picked up by App Service - which subsequentially is written into default_docker.log

You can review the Configuring Gunicorn worker classes and other general settings for more information.

These logs can then be viewed in various ways:

  • Diagnose and Solve Problems -> Application Logs
  • Logstream
  • Kudu -> /home/LogFiles/
  • FTP -> /home/LogFiles/
  • Azure CLI - az webapp log tail

Potential reasons

Although this is covered in various other places in the Gunicorn and Python community, below are a few common reasons why this may happen.

Missing or inappropriate –timeout

If you’re using App Service Linux with Python - and not altering the startup command - then you’ll be using a predefined (but overridable) startup command through Gunicorn that’s called out here, which is essentially --timeout 600 --access-logfile '-' --error-logfile '-' -c /opt/startup/gunicorn.conf.py --chdir=/tmp/<RANDOM_GUID>" gunicorn <wsgi_file_location>:<wsgi_callable>

Here, --timeout is already defined. What timeout does is explained in the Gunicorn docs, which is:

Default: 30

Workers silent for more than this many seconds are killed and restarted.

Value is a positive number or 0. Setting it to 0 has the effect of infinite timeouts by disabling timeouts for all workers entirely.

If you end up overriding this default command, and omit timeout, then you’re now implicitly defaulting this back to 30 seconds. This is important to understand, since given that App Service is an HTTP PaaS based platform - if a request does not complete within the allocated timeout period (which is now 30 seconds if ommited), then the Gunicorn worker will be killed by the Gunicorn parent process.

In this context, “workers silent” would equate to a long-running HTTP request (or longer running than what timeout is set to). This could be a slow response from something further upstream, slow logic execution, or resource contention.

Therefor, always ensure --timeout is set to an appropriate value. Additionally, regardless of the timeout value - App Service has an idle connection limit of 240 seconds - so, if a request does not complete by that time, the connection (and requests) will be terminated and cancelled irregardless of this setting. To be safe, you can set --timeout back to 600 (--timeout 600), or, essentially anything over 240.

Troubleshooting:

  • If you still notice [CRITICAL] WORKER TIMEOUT even after extending timeout, review the below scenarios for other possible reasons
  • If none of the below applies - then application logic needs to be reviewed - especially if there are any external dependencies that are involved with the request flow, as this may play into long running requests
  • Testing this behavior locally, in a container (or locally in a Linux-based environment) with Gunicorn while pointing to any external dependencies that your application relies on to mimic the environment on App Service should also be done, when possible.

Resource contention

High CPU:

High CPU, in itself, wouldn’t cause workers to be killed due to [CRITICAL] WORKER TIMEOUT - but it can cause requests and application logic to execute slowly.

If a very high amount of CPU is seen, or, even what seems to be an increase in CPU due to intensive tasks like computation - this probably needs to be the main focus.

If timeout (explained above) is set to an appropriate value - a focus on application profiling while reproducing the issue should be done. This can be used as a reference: Python Performance High CPU Using CProfile

NOTE: A lot of what’s described in Container Apps: Profiling Python applications for performance issues can also be used in terms of tooling

High memory:

High memory, compared to CPU, can cause [CRITICAL] WORKER TIMEOUT. Assuming our application (eg. one or multiple gunicorn processes) is consuming high enough memory - OOMKiller (a Linux kernel concept) would kill these processes.

Tooling, such as the Diagnose and Solve Problems -> Memory Usage blade - or APMs, like App Insights, New Relic, Dynatrace, or others - can help determine of an application process is consuming high memory.

This may surface like this:

[2024-06-28 20:45:22 +0000] [74] [CRITICAL] WORKER TIMEOUT (pid:92)
2024-06-28T20:445:22.8354846Z [2024-06-28 20:45:22 +0000] [92] [ERROR] Worker (pid:92) was sent SIGKILL! Perhaps out of memory?

However, there are times this is extremely misleading and not actually the cause. A good step is always to investigate this from a memory perspective - at least to see if there is high memory (a leak, or consistent, or a large enough spike) - but this can also simply occur due to what’s described in the Missing or inappropriate –timeout section above.

If high memory isn’t seen, investigate this from a --timeout usage and long running request perspective.

NOTE: A lot of what’s described in Container Apps: Profiling Python applications for performance issues can also be used in terms of tooling

Long running requests

This is pretty much in the same vein as the Missing or inappropriate –timeout section above.

Long running requests that a Gunicorn worker process is handling, which exceeds timeout, will cause this. This could be for almost an infinite number of reasons, but some examples may be:

  • ML/AI-based applications - eg. computation on large sets that take minutes at a time (and may be CPU intensive too)
  • Long running database queries
  • Too much load on the application - SNAT port exhaustion, slow logic execution, higher resource consumption, etc.

As mentioned above, for this, you can set --timeout back to 600 (--timeout 600), or, essentially anything over 240.

Utilizing any of the profilers or APMs within Container Apps: Profiling Python applications for performance issues should be done to try and further pin point the issue. You can use the various metrics in the Metrics blade on App Service to use as a starting point for data such as incoming requests, duration, resource usage, and m ore.