Running an AI Agent on AWS for $24/month

The Problem

I wanted to run a personal AI agent (OpenClaw) around the clock. It's a Telegram bot backed by an LLM that summarizes articles, manages reminders, and does a bunch of other tasks I'd rather not do manually. Requirements: a persistent server, outbound internet access for API calls, and HTTPS webhook ingress.

The textbook AWS setup is an EC2 instance in a private subnet behind an Application Load Balancer. Problem: instances in a private subnet can't reach the internet without a NAT Gateway, and a NAT Gateway costs $32/month before you send a single byte. For a personal project, that's more than the compute itself.

I needed outbound internet (the agent calls OpenAI, fetches web pages, sends Telegram messages) without paying $32/month for the privilege. Here's what I ended up doing.

Architecture Overview

Nothing fancy here. An EC2 t3a.large runs in a private subnet. An Application Load Balancer (ALB) in the public subnet handles HTTPS termination and forwards traffic to the instance. nginx on the instance reverse-proxies Telegram webhooks to the agent process. For outbound traffic, a tiny fck-nat instance replaces the NAT Gateway.

Telegram → ALB (HTTPS) → nginx → OpenClaw agent → fck-nat instance → Internet (OpenAI, web, etc.)

The ALB lives in two public subnets (AWS requires this). The EC2 instance sits in the private subnet, fck-nat in the public one. Route53 points the domain at the ALB, ACM provides the TLS certificate. Everything is Terraform, so one terraform apply and you're running.

The NAT Gateway Problem

AWS NAT Gateway is a managed service that lets instances in private subnets reach the internet. Reliable, auto-scaling, zero maintenance. Also absurdly expensive for small workloads.

You pay $0.045/hour just for the gateway to exist (~$32/month), plus $0.045/GB for data processing. If you're running a production microservices cluster pushing terabytes, sure, that's fine. But for a personal AI agent making a few hundred API calls a day? That's a 60% surcharge on your total infrastructure cost.

I looked at the alternatives. NAT instances are the old-school approach but manual and annoying to maintain. VPC endpoints only cover AWS services, not the open internet. IPv6 egress-only gateways don't help when upstream APIs are IPv4-only. Then I found fck-nat.

The fck-nat Solution

fck-nat is an open-source project that replaces the AWS NAT Gateway with a single EC2 instance running iptables NAT. It ships as a stripped-down AMI based on Amazon Linux with just enough to forward packets. The name tells you everything about how the community feels about NAT Gateway pricing.

The setup is straightforward: launch a t4g.nano (ARM-based, the cheapest EC2 instance type) in a public subnet, assign it an Elastic IP, and point the private subnet's route table at it. The instance does nothing but forward packets. Cost: ~$3/month instead of $32.

Approach	Monthly cost	Managed?
AWS NAT Gateway	~$32 + data fees	Yes
fck-nat (t4g.nano)	~$3	No (self-hosted)

The trade-off? No auto-recovery, no scaling, single point of failure. If the fck-nat instance dies, outbound internet stops until you reboot it. For a production e-commerce site, that's a non-starter. For my personal AI agent, I genuinely don't care. If it goes down at 3 AM, the bot doesn't reply for a few minutes. Nobody gets paged.

Terraform Setup

This section shows key Terraform resources. Values are anonymized. The full setup is based on the loicgasser/dumpling Terraform configuration.

VPC & Subnets

The VPC has two public subnets (for the ALB and fck-nat) and one private subnet (for the agent EC2 instance). The private subnet's route table sends 0.0.0.0/0 traffic through the fck-nat instance instead of a NAT Gateway.

vpc.tf


resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = { Name = "agent-vpc" }
}

resource "aws_subnet" "public_a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "eu-central-1a"
  map_public_ip_on_launch = true

  tags = { Name = "public-a" }
}

resource "aws_subnet" "public_b" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.2.0/24"
  availability_zone       = "eu-central-1b"
  map_public_ip_on_launch = true

  tags = { Name = "public-b" }
}

resource "aws_subnet" "private" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.3.0/24"
  availability_zone = "eu-central-1a"

  tags = { Name = "private" }
}

# Private subnet routes through fck-nat, not a NAT Gateway
resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block           = "0.0.0.0/0"
    network_interface_id = module.fck_nat.eni_id
  }

  tags = { Name = "private-rt" }
}

resource "aws_route_table_association" "private" {
  subnet_id      = aws_subnet.private.id
  route_table_id = aws_route_table.private.id
}

fck-nat Instance

The fck-nat module launches a t4g.nano in the public subnet. The update_route_table option automatically points the private route table at this instance on boot, which helps with recovery if the instance is replaced.

nat.tf


module "fck_nat" {
  source = "RaJiska/fck-nat/aws"

  name      = "agent-fck-nat"
  vpc_id    = aws_vpc.main.id
  subnet_id = aws_subnet.public_a.id

  instance_type = "t4g.nano"

  update_route_table = true
  route_table_id     = aws_route_table.private.id
}

Application Load Balancer

The ALB sits in the public subnets and terminates TLS. It forwards HTTPS traffic to the agent instance on port 80 (nginx). A health check on /health ensures the target group only routes to healthy instances.

alb.tf


resource "aws_lb" "agent" {
  name               = "agent-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = [aws_subnet.public_a.id, aws_subnet.public_b.id]

  tags = { Name = "agent-alb" }
}

resource "aws_lb_target_group" "agent" {
  name     = "agent-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 30
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.agent.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate_validation.cert.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.agent.arn
  }
}

resource "aws_lb_listener" "http_redirect" {
  load_balancer_arn = aws_lb.agent.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"
    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

EC2 Instance & Security Groups

The agent runs on a t3a.large (2 vCPU, 8 GB RAM), which handles the agent process and occasional heavy LLM-related workloads without breaking a sweat. The security group only allows inbound traffic from the ALB on port 80. No SSH from the internet. I use AWS Systems Manager Session Manager for shell access instead, which requires no open ports at all.

ec2.tf


resource "aws_security_group" "alb" {
  name_prefix = "agent-alb-"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "agent" {
  name_prefix = "agent-ec2-"
  vpc_id      = aws_vpc.main.id

  # Only allow traffic from the ALB
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "agent" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3a.large"
  subnet_id              = aws_subnet.private.id
  vpc_security_group_ids = [aws_security_group.agent.id]
  iam_instance_profile   = aws_iam_instance_profile.ssm.name

  root_block_device {
    volume_size = 30
    volume_type = "gp3"
  }

  tags = { Name = "openclaw-agent" }
}

Route53 & ACM

Route53 points the domain at the ALB. ACM gives you a free TLS certificate with automatic DNS validation, same pattern I use for my static site setup.

dns.tf


resource "aws_acm_certificate" "cert" {
  domain_name       = "agent.example.ch"
  validation_method = "DNS"
}

resource "aws_route53_record" "agent" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "agent.example.ch"
  type    = "A"

  alias {
    name                   = aws_lb.agent.dns_name
    zone_id                = aws_lb.agent.zone_id
    evaluate_target_health = true
  }
}

Cost Breakdown

Here's what it actually costs per month. The EC2 price assumes a 3-year Reserved Instance (all upfront), which drops the t3a.large from ~$55/month on-demand to about ~$15/month. If you know you'll run the server for years, the commitment pays for itself quickly.

Resource	Monthly cost
EC2 t3a.large (3yr RI)	~$15
fck-nat t4g.nano	~$3
Application Load Balancer	~$5
Route53	~$1
Total	~$24/month

With a NAT Gateway instead of fck-nat, the total jumps to ~$53/month. More than double. The fck-nat swap saves $29/month, or $348/year. For a side project, that adds up.

Security Considerations

The agent instance has no public IP address. You can't reach it from the internet directly, period. All inbound traffic goes through the ALB, which only forwards HTTP on port 80 to the target group. Small attack surface.

TLS termination at the ALB. The ALB handles HTTPS with a TLS 1.3 policy. Traffic between the ALB and the instance stays on a private network, so HTTP on the internal hop is fine.
Security groups as firewalls. The agent's security group only allows port 80 from the ALB's security group. No SSH, no other ports. Shell access goes through SSM Session Manager (IAM-based auth, no inbound rules needed).
API keys and tokens live in AWS Secrets Manager or SSM Parameter Store. The app pulls them at runtime, so Terraform state never contains secrets.
The EC2 instance's IAM role follows least privilege: SSM for shell access, Secrets Manager for API keys, CloudWatch for logs. Nothing else.

What I Learned

NAT is the hidden cost killer on AWS. It doesn't show up on the EC2 pricing page. Most "getting started" tutorials never mention it. But the moment you put an instance in a private subnet and need outbound internet, there's that $32/month line item. For personal and dev workloads, just use fck-nat.

Don't sleep on Reserved Instances. If you know a server will run for 1–3 years, the math is obvious. My t3a.large drops from $55 to $15/month with a 3-year commitment. 73% off for a server I use every single day.

Having everything in Terraform is what makes this whole setup disposable. VPC, subnets, ALB, EC2, DNS, certificates, all in code. Need to rebuild it? Move to another region? Spin up a second agent? One terraform apply. No clicking through the AWS console, no forgotten security group rules.

The best infrastructure is the one you can delete and recreate without thinking about it.

PR-Based Access: The Agent as a Teammate

Here's something I didn't expect when I started this project: once the agent lives on an EC2 instance with an IAM role, it can request its own permissions through pull requests. Need it to deploy a website? It opens a PR adding S3 and CloudFront permissions to the instance role's Terraform. You review, merge, terraform apply, and it's done.

This replaces the typical CI/CD pipeline entirely. No GitHub Actions to maintain. No AWS secrets stored in GitHub repo settings. No deploy pipeline that breaks at 2 AM. The agent already has the right IAM role — it just needs the right policy attached. And every permission change is a git commit you can audit.

The pattern scales naturally. Another S3 bucket? Another PR. SES for sending emails? Same thing. You stay in control through Terraform and PR reviews. The agent gets to do useful work without you sitting at the keyboard. One message, one PR, one review — that's the whole deploy workflow.

Loic Gasser Founder & Principal Engineer at Maxeo Solutions About →