AWS ECS
Core Concepts
Cluster - Logical grouping of services and tasks - Just a namespace when using Fargate; with EC2 launch type it also manages the underlying instances - One cluster per environment is common (rm-prod, rm-staging)
Task Definition - Blueprint for your container(s) — like a docker-compose.yml - Versioned; each update creates a new revision - Defines: image, CPU/memory, env vars, secrets, port mappings, volumes, logging, IAM role
Task - A running instance of a task definition - Ephemeral — stopped tasks are gone - Can be run one-off (e.g. Django migrations) or managed by a Service
Service - Maintains a desired count of tasks; restarts failed ones - Integrates with ALB for load balancing - Handles rolling deploys, circuit breaker, scaling
Launch Types
Fargate - Serverless — AWS manages the underlying host - You specify vCPU + memory at task level; billed per second - No SSH access to host (use ECS Exec for container access) - Available CPU/memory combos are fixed (e.g. 0.25 vCPU / 512MB → 4 vCPU / 30GB)
EC2 Launch Type - You manage an ASG of EC2 instances registered to the cluster - ECS Agent runs on each instance and manages task placement - More control (GPU, specific instance types, host networking) - Use ECS-optimised AMI; use Capacity Providers to manage the ASG
External Launch Type - Run ECS tasks on on-prem or other cloud via ECS Anywhere - Niche — edge/hybrid use cases
Task Definition Deep Dive
{
"family": "rm-django",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::...:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::...:role/rm-django-task-role",
"containerDefinitions": [
{
"name": "django",
"image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/rm-django:latest",
"portMappings": [{ "containerPort": 8000 }],
"environment": [
{ "name": "DJANGO_ENV", "value": "production" }
],
"secrets": [
{ "name": "DATABASE_URL", "valueFrom": "arn:aws:secretsmanager:...:secret:rm/db-url" },
{ "name": "SECRET_KEY", "valueFrom": "arn:aws:secretsmanager:...:secret:rm/secret-key" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/rm-django",
"awslogs-region": "ap-southeast-2",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8000/health/ || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}Two IAM roles — don’t confuse them: - executionRoleArn — ECS agent uses this to pull images from ECR, fetch secrets from Secrets Manager/SSM, write logs to CloudWatch - taskRoleArn — your application uses this at runtime (e.g. S3 read, SES send)
Networking
awsvpc mode (required for Fargate) - Each task gets its own ENI and private IP - Security groups applied at task level (not host level) - Tasks in private subnets → need NAT GW or VPC endpoints for ECR/Secrets Manager/CloudWatch
VPC Endpoints worth having for ECS Fargate: - com.amazonaws.ap-southeast-2.ecr.api - com.amazonaws.ap-southeast-2.ecr.dkr - com.amazonaws.ap-southeast-2.secretsmanager - com.amazonaws.ap-southeast-2.logs - S3 Gateway endpoint (ECR layers stored in S3)
Without these, all Fargate traffic routes via NAT GW — adds cost and latency.
Service Configuration
resource "aws_ecs_service" "django" {
name = "rm-django"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.django.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.django.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.django.arn
container_name = "django"
container_port = 8000
}
deployment_circuit_breaker {
enable = true
rollback = true
}
deployment_controller {
type = "ECS" # or CODE_DEPLOY for blue/green
}
}
Deployment Strategies
Rolling (default) - Replaces tasks incrementally - minimumHealthyPercent (default 100) and maximumPercent (default 200) control the window - Simple; some risk of mixed versions serving traffic simultaneously
Blue/Green (CodeDeploy) - Spins up entirely new task set; shifts traffic via ALB weighted target groups - Supports canary (10PercentThen90Percent) and linear rollout configs - Full rollback capability; more complex setup
Canary via App Mesh / weighted routing - Advanced; use ALB weighted target groups manually
Auto Scaling
Application Auto Scaling on ECS - Scale service desired count based on metrics - Target tracking (simplest): keep average CPU at 60%, ECS handles the math - Step scaling: explicit rules per threshold - Scheduled scaling: known traffic patterns
resource "aws_appautoscaling_target" "django" {
max_capacity = 10
min_capacity = 2
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.django.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu" {
name = "cpu-target-tracking"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.django.resource_id
scalable_dimension = aws_appautoscaling_target.django.scalable_dimension
service_namespace = aws_appautoscaling_target.django.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 60.0
}
}
Common Patterns for Django + Celery
Service per process type:
ECS Cluster: rm-prod
├── Service: rm-django (2+ tasks, behind ALB, gunicorn)
├── Service: rm-celery-worker (2+ tasks, no ALB, SQS consumer)
├── Service: rm-celery-beat (1 task, desired_count=1, scheduler)
└── Service: rm-flower (1 task, internal ALB only, monitoring)
All share the same task definition family but with different commands:
| Service | Command override |
|---|---|
| Django | gunicorn rm.wsgi:application --bind 0.0.0.0:8000 |
| Celery worker | celery -A rm worker -l info -Q default,high |
| Celery beat | celery -A rm beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler |
Celery Beat — desired_count: 1 always. If two beat tasks run simultaneously, every scheduled task fires twice. Consider a lock (Redlock) as a safety net.
One-Off Tasks (Migrations, Management Commands)
Run as standalone tasks, not services:
aws ecs run-task \
--cluster rm-prod \
--task-definition rm-django:42 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx]}" \
--overrides '{"containerOverrides":[{"name":"django","command":["python","manage.py","migrate"]}]}'In CI/CD, run migrations as a step before updating the service. Wait for the task to exit 0 before proceeding.
ECR (Elastic Container Registry)
- Private Docker registry; integrated with ECS task execution role
- Lifecycle policies — auto-expire old images (e.g. keep last 10 tagged, delete untagged after 1 day)
- Image scanning (Basic or Enhanced via Inspector)
- Tag immutability — prevent overwriting tags in prod
resource "aws_ecr_repository" "django" {
name = "rm-django"
image_tag_mutability = "IMMUTABLE"
image_scanning_configuration {
scan_on_push = true
}
}
resource "aws_ecr_lifecycle_policy" "django" {
repository = aws_ecr_repository.django.name
policy = jsonencode({
rules = [{
rulePriority = 1
description = "Keep last 10 images"
selection = {
tagStatus = "any"
countType = "imageCountMoreThan"
countNumber = 10
}
action = { type = "expire" }
}]
})
}
Logging
awslogs (CloudWatch) — standard - Logs go to a CloudWatch log group per service - Set retention (don’t leave at “Never expire”) - Query with CloudWatch Logs Insights
FireLens — advanced - Sidecar container (Fluent Bit or Fluentd) as log router - Ship to S3, OpenSearch, Datadog, etc. - More control over log parsing and routing
ECS Exec (Debugging)
aws ecs execute-command \
--cluster rm-prod \
--task <task-id> \
--container django \
--interactive \
--command "/bin/bash"Requires: - enableExecuteCommand: true on the service - SSM Session Manager permissions on the task role - ssmmessages:* VPC endpoint or NAT access
Service Connect vs Service Discovery
Service Connect (newer, preferred) - Built-in service mesh; tasks find each other by name - Handles retries, circuit breaking, metrics - Define a namespace; services register automatically
Cloud Map / Service Discovery - DNS-based; tasks register A records - Simpler; no proxy overhead - Good for internal service-to-service when you don’t need observability features
Cost Considerations
Fargate pricing (ap-southeast-2, approximate): - 0.25 vCPU / 512MB ≈ $10–12/month per task (always-on) - 0.5 vCPU / 1GB ≈ $20–25/month per task - Fargate Spot — up to 70% cheaper; tasks can be interrupted; fine for Celery workers, not web
Use Spot for workers, on-demand for Django web service.
Gotchas
- Image tag
latestin prod — don’t. Use immutable SHA or semver tags; ECS won’t re-pulllateston redeploy unless forced - Task stops with exit code 1 immediately — check CloudWatch logs; ECS will loop-restart and you’ll burn through your deployment health check window
desired_count: 0— valid way to “pause” a service without destroying it (e.g. beat in non-prod)- Health check grace period — set
health_check_grace_period_secondson the service (e.g. 120s) for Django startup time; otherwise ALB kills tasks before they’re ready - Secrets at task start only — secrets injected via
secretsin task def are fetched once at task start; rotating a secret requires task replacement - Container dependency ordering — use
dependsOnin task def if you have sidecars (e.g. wait for datadog-agent to be healthy before starting app)