Why is my Amazon ECS task stuck in the PENDING state?

6 minute read
0

My Amazon Elastic Container Service (Amazon ECS) task stuck in the PENDING state.

Short description

The following scenarios commonly cause Amazon ECS tasks to get stuck in the PENDING state:

  • The Docker daemon is unresponsive.
  • The Docker image is large.
  • The Amazon ECS container agent lost connectivity with the Amazon ECS service in the middle of a task launch.
  • The Amazon ECS container agent takes a long time to stop an existing task.
  • Your Amazon Virtual Private Cloud (Amazon VPC) routing isn't configured correctly.
  • An essential container depends on non-essential containers that fail to be HEALTHY.

Resolution

To see why your task is stuck in the PENDING state, complete the following troubleshooting steps.

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

The Docker daemon is unresponsive

For CPU issues, complete the following steps:

1.    Use Amazon CloudWatch metrics to see if your container instance exceeded maximum CPU.

2.    Increase the size of your container instance as needed.

For memory issues, complete the following steps:

1.    Run the free command to see how much memory is available for your system.

2.    Increase the size of your container instance as needed.

For I/O issues, complete the following steps:

1.    Run the iotop command.

2.    Find out what tasks in what services are using the most IOPS. Then, use task placement constraints and strategies to distribute these tasks to distinct container instances.

-or-

Use CloudWatch to create an alarm for your Amazon Elastic Block Store (Amazon EBS) BurstBalance metrics. Then, use an AWS Lambda function or your own custom logic to balance tasks.

The Docker image is large

Larger images take longer to download and increase the amount of time the task is in the PENDING state.

To speed up the transition time, tune the ECS_IMAGE_PULL_BEHAVIOR parameter to take advantage of image caching.

Note: For example, set the ECS_IMAGE_PULL_BEHAVIOR parameter to prefer-cached in /etc/ecs/ecs.config. If prefer-cached is specified, then the image is pulled remotely when there's no cached image. Otherwise, the cached image on the instance is used.

The Amazon ECS container agent lost connectivity with the Amazon ECS service in the middle of a launch

1.    To verify the status and connectivity of the Amazon ECS container agent, run either of the following commands on your container instance.

Run the following command for Amazon Linux 1:

$ sudo status ecs
$ sudo docker ps -f name=ecs-agent

Run the following command for Amazon Linux 2:

$ sudo systemctl status ecs
$ sudo docker ps -f name=ecs-agent

Note: You see active/running in the output.

2.    To view metadata on running tasks in your ECS container instance, run the following commands on your container instance:

$ curl http://localhost:51678/v1/metadata

You receive the following output:

{
  "Cluster": "CLUSTER_ID",
  "ContainerInstanceArn": "arn:aws:ecs:REGION:ACCOUNT_ID:container-instance/TASK_ID",
  "Version": "Amazon ECS Agent - AGENT "
}

3.    To view information on running tasks, run the following command on your container instance:

$ curl http://localhost:51678/v1/tasks

You receive the following output:

{
  "Tasks": [
    {
      "Arn": "arn:aws:ecs:REGION:ACCOUNT_ID:task/TASK_ID",
      "DesiredStatus": "RUNNING",
      "KnownStatus": "RUNNING",
      ... ...
    }
  ]
}

4.    If the issue relates to a disconnected agent, then restart your container agent with either of the following commands.

Run the following command for Amazon Linux 1:

$ sudo stop ecs
$ sudo start ecs

Run the following command for Amazon Linux 2:

$ sudo systemctl stop ecs
$ sudo systemctl start ecs

You receive an output that's similar to the following message:

ecs start/running, process xxxx

5.    To determine agent connectivity, check the following logs during the relevant time frame for key words such as error, warn, or agent transition state:

View the Amazon ECS container agent log at /var/log/ecs/ecs-agent.log.yyyy-mm-dd-hh.
View the Amazon ECS init log at /var/log/ecs/ecs-init.log.
View the Docker logs at /var/log/docker.

Note: You can also use the Amazon ECS logs collector to collect general operating system logs, Docker logs, and container agent logs for Amazon ECS.

The Amazon ECS container agent takes a long time to stop an existing task

When your container agent receives new tasks to start from Amazon ECS (from PENDING to RUNNING), it might have older tasks to stop. In this case, the agent doesn't start these new tasks until the old tasks are stopped first.

To control container stop and start timeout at the container instance level, set the following two parameters:

1.    In /etc/ecs/ecs.config, adjust the value of the ECS_CONTAINER_STOP_TIMEOUT parameter. This parameters sets the amount of time that elapses before Amazon ECS forcibly ends your containers if they don't exit normally on their own.

Note: The default value for Linux and Windows is 30s.

2.    In /etc/ecs/ecs.config, adjust the value of the ECS_CONTAINER_START_TIMEOUT parameter. This parameter sets the amount of time that elapses before the Amazon ECS container agent stops trying to start the container.

Note: The default value is 3m for Linux and 8m for Windows.

If your agent version is 1.26.0 or later, then you can define the preceding stop and start timeout parameters per task. This might make the task transition to a STOPPED state. For example, suppose that containerA has a dependency on containerB reaching a COMPLETE, SUCCESS, or HEALTHY status. If you don't specify a startTimeout value for containerB and containerB doesn't reach the desired status within that time, then containerA doesn't start.

For an example of container dependency, see Example: Container dependency on AWS GitHub.

Your Amazon VPC routing isn't configured correctly

Check the configuration for the VPC subnet that your Amazon ECS or Fargate tasks run in. If the subnet isn't configured correctly, then it doesn't have access to Amazon ECS or Amazon ECR. To resolve this issue, be sure that the route table for your subnet has an internet gateway or a NAT gateway. If you launch a task in a subnet that doesn't have an egress route to the internet, then use AWS PrivateLink. This allows you to privately access Amazon ECS APIs with private IP addresses.

An essential container depends on non-essential containers that fail to be HEALTHY

If your non-essential containers fail to be in a HEALTHY state, and an essential container depends on them, then your task becomes stuck in PENDING. In this case, you see the following message:

"stoppedReason":"Service ABCXYZ: task last status remained in PENDING too long."

To resolve this issue, make sure that your dependent (non-essential) containers work as expected. If you can't resolve the underlying issue, then make these containers essential to avoid a task that's stuck in PENDING for too long.

Related information

Container dependency

Amazon ECS Container Agent (AWS GitHub)

AWS OFFICIAL
AWS OFFICIALUpdated a year ago