AWS ECS

Elastic Container Service
Author

Benedict Thekkel


Core Concepts

Cluster - Logical grouping of services and tasks - Just a namespace when using Fargate; with EC2 launch type it also manages the underlying instances - One cluster per environment is common (rm-prod, rm-staging)

Task Definition - Blueprint for your container(s) — like a docker-compose.yml - Versioned; each update creates a new revision - Defines: image, CPU/memory, env vars, secrets, port mappings, volumes, logging, IAM role

Task - A running instance of a task definition - Ephemeral — stopped tasks are gone - Can be run one-off (e.g. Django migrations) or managed by a Service

Service - Maintains a desired count of tasks; restarts failed ones - Integrates with ALB for load balancing - Handles rolling deploys, circuit breaker, scaling


Launch Types

Fargate - Serverless — AWS manages the underlying host - You specify vCPU + memory at task level; billed per second - No SSH access to host (use ECS Exec for container access) - Available CPU/memory combos are fixed (e.g. 0.25 vCPU / 512MB → 4 vCPU / 30GB)

EC2 Launch Type - You manage an ASG of EC2 instances registered to the cluster - ECS Agent runs on each instance and manages task placement - More control (GPU, specific instance types, host networking) - Use ECS-optimised AMI; use Capacity Providers to manage the ASG

External Launch Type - Run ECS tasks on on-prem or other cloud via ECS Anywhere - Niche — edge/hybrid use cases


Task Definition Deep Dive

{
  "family": "rm-django",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::...:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::...:role/rm-django-task-role",
  "containerDefinitions": [
    {
      "name": "django",
      "image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/rm-django:latest",
      "portMappings": [{ "containerPort": 8000 }],
      "environment": [
        { "name": "DJANGO_ENV", "value": "production" }
      ],
      "secrets": [
        { "name": "DATABASE_URL", "valueFrom": "arn:aws:secretsmanager:...:secret:rm/db-url" },
        { "name": "SECRET_KEY", "valueFrom": "arn:aws:secretsmanager:...:secret:rm/secret-key" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/rm-django",
          "awslogs-region": "ap-southeast-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8000/health/ || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Two IAM roles — don’t confuse them: - executionRoleArn — ECS agent uses this to pull images from ECR, fetch secrets from Secrets Manager/SSM, write logs to CloudWatch - taskRoleArn — your application uses this at runtime (e.g. S3 read, SES send)


Networking

awsvpc mode (required for Fargate) - Each task gets its own ENI and private IP - Security groups applied at task level (not host level) - Tasks in private subnets → need NAT GW or VPC endpoints for ECR/Secrets Manager/CloudWatch

VPC Endpoints worth having for ECS Fargate: - com.amazonaws.ap-southeast-2.ecr.api - com.amazonaws.ap-southeast-2.ecr.dkr - com.amazonaws.ap-southeast-2.secretsmanager - com.amazonaws.ap-southeast-2.logs - S3 Gateway endpoint (ECR layers stored in S3)

Without these, all Fargate traffic routes via NAT GW — adds cost and latency.


Service Configuration

resource "aws_ecs_service" "django" {
  name            = "rm-django"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.django.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.django.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.django.arn
    container_name   = "django"
    container_port   = 8000
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_controller {
    type = "ECS" # or CODE_DEPLOY for blue/green
  }
}

Deployment Strategies

Rolling (default) - Replaces tasks incrementally - minimumHealthyPercent (default 100) and maximumPercent (default 200) control the window - Simple; some risk of mixed versions serving traffic simultaneously

Blue/Green (CodeDeploy) - Spins up entirely new task set; shifts traffic via ALB weighted target groups - Supports canary (10PercentThen90Percent) and linear rollout configs - Full rollback capability; more complex setup

Canary via App Mesh / weighted routing - Advanced; use ALB weighted target groups manually


Auto Scaling

Application Auto Scaling on ECS - Scale service desired count based on metrics - Target tracking (simplest): keep average CPU at 60%, ECS handles the math - Step scaling: explicit rules per threshold - Scheduled scaling: known traffic patterns

resource "aws_appautoscaling_target" "django" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.django.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "cpu-target-tracking"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.django.resource_id
  scalable_dimension = aws_appautoscaling_target.django.scalable_dimension
  service_namespace  = aws_appautoscaling_target.django.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 60.0
  }
}

Common Patterns for Django + Celery

Service per process type:

ECS Cluster: rm-prod
├── Service: rm-django          (2+ tasks, behind ALB, gunicorn)
├── Service: rm-celery-worker   (2+ tasks, no ALB, SQS consumer)
├── Service: rm-celery-beat     (1 task, desired_count=1, scheduler)
└── Service: rm-flower          (1 task, internal ALB only, monitoring)

All share the same task definition family but with different commands:

Service Command override
Django gunicorn rm.wsgi:application --bind 0.0.0.0:8000
Celery worker celery -A rm worker -l info -Q default,high
Celery beat celery -A rm beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler

Celery Beat — desired_count: 1 always. If two beat tasks run simultaneously, every scheduled task fires twice. Consider a lock (Redlock) as a safety net.


One-Off Tasks (Migrations, Management Commands)

Run as standalone tasks, not services:

aws ecs run-task \
  --cluster rm-prod \
  --task-definition rm-django:42 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx]}" \
  --overrides '{"containerOverrides":[{"name":"django","command":["python","manage.py","migrate"]}]}'

In CI/CD, run migrations as a step before updating the service. Wait for the task to exit 0 before proceeding.


ECR (Elastic Container Registry)

  • Private Docker registry; integrated with ECS task execution role
  • Lifecycle policies — auto-expire old images (e.g. keep last 10 tagged, delete untagged after 1 day)
  • Image scanning (Basic or Enhanced via Inspector)
  • Tag immutability — prevent overwriting tags in prod
resource "aws_ecr_repository" "django" {
  name                 = "rm-django"
  image_tag_mutability = "IMMUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

resource "aws_ecr_lifecycle_policy" "django" {
  repository = aws_ecr_repository.django.name
  policy = jsonencode({
    rules = [{
      rulePriority = 1
      description  = "Keep last 10 images"
      selection = {
        tagStatus   = "any"
        countType   = "imageCountMoreThan"
        countNumber = 10
      }
      action = { type = "expire" }
    }]
  })
}

Logging

awslogs (CloudWatch) — standard - Logs go to a CloudWatch log group per service - Set retention (don’t leave at “Never expire”) - Query with CloudWatch Logs Insights

FireLens — advanced - Sidecar container (Fluent Bit or Fluentd) as log router - Ship to S3, OpenSearch, Datadog, etc. - More control over log parsing and routing


ECS Exec (Debugging)

aws ecs execute-command \
  --cluster rm-prod \
  --task <task-id> \
  --container django \
  --interactive \
  --command "/bin/bash"

Requires: - enableExecuteCommand: true on the service - SSM Session Manager permissions on the task role - ssmmessages:* VPC endpoint or NAT access


Service Connect vs Service Discovery

Service Connect (newer, preferred) - Built-in service mesh; tasks find each other by name - Handles retries, circuit breaking, metrics - Define a namespace; services register automatically

Cloud Map / Service Discovery - DNS-based; tasks register A records - Simpler; no proxy overhead - Good for internal service-to-service when you don’t need observability features


Cost Considerations

Fargate pricing (ap-southeast-2, approximate): - 0.25 vCPU / 512MB ≈ $10–12/month per task (always-on) - 0.5 vCPU / 1GB ≈ $20–25/month per task - Fargate Spot — up to 70% cheaper; tasks can be interrupted; fine for Celery workers, not web

Use Spot for workers, on-demand for Django web service.


Gotchas

  • Image tag latest in prod — don’t. Use immutable SHA or semver tags; ECS won’t re-pull latest on redeploy unless forced
  • Task stops with exit code 1 immediately — check CloudWatch logs; ECS will loop-restart and you’ll burn through your deployment health check window
  • desired_count: 0 — valid way to “pause” a service without destroying it (e.g. beat in non-prod)
  • Health check grace period — set health_check_grace_period_seconds on the service (e.g. 120s) for Django startup time; otherwise ALB kills tasks before they’re ready
  • Secrets at task start only — secrets injected via secrets in task def are fetched once at task start; rotating a secret requires task replacement
  • Container dependency ordering — use dependsOn in task def if you have sidecars (e.g. wait for datadog-agent to be healthy before starting app)
Back to top