クラウド破産を防ぐ：AWS Budget × Lambda による異常検知と自動リソース停止

はじめに：クラウド破産の発生パターン

「月末に請求書を開いたら想定の10倍だった」——クラウド破産は特定の失敗パターンが繰り返される。

実際に起きやすい破産パターン:

  パターンA: NAT Gateway の転送量課金
    開発環境のEC2からS3・ECRへの通信がすべて NAT Gateway 経由
    NAT Gateway: $0.045/GB + $0.045/時間
    → Docker イメージの pull が多い開発環境で月 $200〜$500 に膨張

  パターンB: DDoS・クローラーによるトラフィック急増
    CloudFront 経由でも大量リクエストは Lambda の実行回数に直撃
    API Gateway + Lambda: リクエスト数課金
    → 悪意あるクローラーで Lambda が 100万回実行、月 $800

  パターンC: 開発用リソースの放置
    テスト用 RDS インスタンス (db.t3.medium) を停止し忘れ
    → 1ヶ月で $50〜$150、半年で $300〜$900

  パターンD: S3 バケットの誤公開
    公開設定のバケットに大容量ファイルを入れてしまい
    外部からの GET リクエストで転送量が爆発
    → 数日で数万円の転送量課金

これらは「起きてから対処する」では遅い。「異常を検知して自動的に止める」仕組みを構築しておくことが、クラウドを安全に使う前提条件だ。

Part 1：防御の全体構成

単一の対策ではなく、検知・通知・自動停止の3層で構成する。

防御の3層構成:

  Layer 1: 予防（Budget アラート）
  ─────────────────────────────────
  日次・月次の支出を監視
  80% / 100% / 予測超過 の3段階でアラート
  → SNS トピックに通知を送信

  Layer 2: 通知（SNS → メール / Slack）
  ─────────────────────────────────
  SNS サブスクリプションでメール通知
  オプション: Lambda で Slack Webhook に転送

  Layer 3: 自動停止（SNS → Lambda）
  ─────────────────────────────────
  Budget の 100% 超過通知を Lambda がトリガー
  タグ `auto-stop: enabled` のリソースを自動停止
  保護タグ `protect: true` のリソースは停止しない
  → ECS Fargate: desired count を 0 に
  → EC2: stop-instances
  → RDS: stop-db-instance

Part 2：Terraform による全リソース構築

SNS トピックと Budget の設定

# modules/cost-guard/main.tf

variable "monthly_budget_usd" {
  type        = number
  description = "月次予算上限（USD）"
}

variable "alert_email" {
  type        = string
  description = "アラート通知先メールアドレス"
}

variable "project_name" {
  type = string
}

# SNS トピック（通知ハブ）
resource "aws_sns_topic" "cost_alert" {
  name = "${var.project_name}-cost-alert"
}

# メール通知のサブスクリプション
resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.cost_alert.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

# Lambda サブスクリプション（自動停止用）
resource "aws_sns_topic_subscription" "auto_stop_lambda" {
  topic_arn = aws_sns_topic.cost_alert.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.auto_stop.arn
}

# Budget の設定
resource "aws_budgets_budget" "monthly" {
  name         = "${var.project_name}-monthly"
  budget_type  = "COST"
  limit_amount = tostring(var.monthly_budget_usd)
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # 80% 到達: 警告メール
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alert.arn]
  }

  # 100% 到達: 自動停止トリガー
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alert.arn]
  }

  # 予測値が 110% を超えたら: 先手を打つ
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 110
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alert.arn]
  }
}

Lambda 関数の IAM ロール

data "aws_iam_policy_document" "auto_stop_policy" {
  # ECS: タスク数を 0 に変更する権限
  statement {
    effect    = "Allow"
    actions   = ["ecs:UpdateService", "ecs:DescribeServices", "ecs:ListServices"]
    resources = ["*"]
  }

  # EC2: インスタンスの停止（終了ではない）
  statement {
    effect    = "Allow"
    actions   = [
      "ec2:DescribeInstances",
      "ec2:StopInstances"
      # "ec2:TerminateInstances" は含めない（停止のみ、削除はしない）
    ]
    resources = ["*"]
  }

  # RDS: インスタンスの停止
  statement {
    effect    = "Allow"
    actions   = [
      "rds:DescribeDBInstances",
      "rds:StopDBInstance"
    ]
    resources = ["*"]
  }

  # CloudWatch Logs への書き込み
  statement {
    effect    = "Allow"
    actions   = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
    resources = ["arn:aws:logs:*:*:*"]
  }
}

resource "aws_iam_role" "auto_stop_lambda" {
  name = "${var.project_name}-auto-stop-lambda"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "auto_stop" {
  name   = "auto-stop-policy"
  role   = aws_iam_role.auto_stop_lambda.id
  policy = data.aws_iam_policy_document.auto_stop_policy.json
}

Lambda 関数のデプロイ

data "archive_file" "auto_stop" {
  type        = "zip"
  source_file = "${path.module}/lambda/auto_stop.py"
  output_path = "${path.module}/lambda/auto_stop.zip"
}

resource "aws_lambda_function" "auto_stop" {
  filename         = data.archive_file.auto_stop.output_path
  source_code_hash = data.archive_file.auto_stop.output_base64sha256
  function_name    = "${var.project_name}-auto-stop"
  role             = aws_iam_role.auto_stop_lambda.arn
  handler          = "auto_stop.lambda_handler"
  runtime          = "python3.12"
  timeout          = 60

  environment {
    variables = {
      DRY_RUN      = "false"   # "true" にするとログだけ出して実際には停止しない
      AWS_REGION   = var.aws_region
      PROJECT_NAME = var.project_name
    }
  }
}

# SNS から Lambda を呼び出す権限
resource "aws_lambda_permission" "sns_invoke" {
  statement_id  = "AllowSNSInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.auto_stop.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.cost_alert.arn
}

Part 3：Lambda の自動停止ロジック

タグベースの保護機構を実装することで、本番リソースの誤停止を防ぐ。

# modules/cost-guard/lambda/auto_stop.py

import boto3
import json
import logging
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

AWS_REGION   = os.environ.get("AWS_REGION", "ap-northeast-1")
PROJECT_NAME = os.environ.get("PROJECT_NAME", "")
DRY_RUN      = os.environ.get("DRY_RUN", "true").lower() == "true"

ecs    = boto3.client("ecs",    region_name=AWS_REGION)
ec2    = boto3.client("ec2",    region_name=AWS_REGION)
rds    = boto3.client("rds",    region_name=AWS_REGION)


def is_protected(tags: list[dict]) -> bool:
    """タグ protect=true のリソースは停止しない"""
    for tag in tags:
        if tag.get("Key") == "protect" and tag.get("Value", "").lower() == "true":
            return True
    return False


def should_auto_stop(tags: list[dict]) -> bool:
    """タグ auto-stop=enabled のリソースだけ停止対象とする"""
    for tag in tags:
        if tag.get("Key") == "auto-stop" and tag.get("Value", "").lower() == "enabled":
            return True
    return False


def stop_ecs_services():
    """ECS サービスの desired count を 0 に設定する"""
    clusters = ecs.list_clusters()["clusterArns"]
    for cluster_arn in clusters:
        services = ecs.list_services(cluster=cluster_arn)["serviceArns"]
        if not services:
            continue
        details = ecs.describe_services(cluster=cluster_arn, services=services)["services"]
        for svc in details:
            tags = svc.get("tags", [])
            if is_protected(tags):
                logger.info(f"PROTECTED (skip): ECS {svc['serviceName']}")
                continue
            if not should_auto_stop(tags):
                logger.info(f"NO auto-stop tag (skip): ECS {svc['serviceName']}")
                continue
            if svc["desiredCount"] == 0:
                logger.info(f"Already stopped: ECS {svc['serviceName']}")
                continue
            logger.info(f"{'[DRY-RUN] ' if DRY_RUN else ''}Stopping ECS: {svc['serviceName']}")
            if not DRY_RUN:
                ecs.update_service(
                    cluster=cluster_arn,
                    service=svc["serviceName"],
                    desiredCount=0
                )


def stop_ec2_instances():
    """running 状態の EC2 インスタンスを停止する"""
    res = ec2.describe_instances(
        Filters=[{"Name": "instance-state-name", "Values": ["running"]}]
    )
    instance_ids = []
    for reservation in res["Reservations"]:
        for inst in reservation["Instances"]:
            tags = inst.get("Tags", [])
            if is_protected(tags):
                logger.info(f"PROTECTED (skip): EC2 {inst['InstanceId']}")
                continue
            if not should_auto_stop(tags):
                logger.info(f"NO auto-stop tag (skip): EC2 {inst['InstanceId']}")
                continue
            instance_ids.append(inst["InstanceId"])

    if instance_ids:
        logger.info(f"{'[DRY-RUN] ' if DRY_RUN else ''}Stopping EC2: {instance_ids}")
        if not DRY_RUN:
            ec2.stop_instances(InstanceIds=instance_ids)


def stop_rds_instances():
    """available 状態の RDS インスタンスを停止する"""
    dbs = rds.describe_db_instances()["DBInstances"]
    for db in dbs:
        if db["DBInstanceStatus"] != "available":
            continue
        # RDS タグは別 API で取得
        arn = db["DBInstanceArn"]
        tags = rds.list_tags_for_resource(ResourceName=arn).get("TagList", [])
        if is_protected(tags):
            logger.info(f"PROTECTED (skip): RDS {db['DBInstanceIdentifier']}")
            continue
        if not should_auto_stop(tags):
            logger.info(f"NO auto-stop tag (skip): RDS {db['DBInstanceIdentifier']}")
            continue
        logger.info(f"{'[DRY-RUN] ' if DRY_RUN else ''}Stopping RDS: {db['DBInstanceIdentifier']}")
        if not DRY_RUN:
            rds.stop_db_instance(DBInstanceIdentifier=db["DBInstanceIdentifier"])


def lambda_handler(event, context):
    logger.info(f"Triggered. DRY_RUN={DRY_RUN}")
    logger.info(f"Event: {json.dumps(event)}")

    stop_ecs_services()
    stop_ec2_instances()
    stop_rds_instances()

    return {"statusCode": 200, "body": "Auto-stop completed"}

Part 4：タグ設計によるリソース管理

Lambda が正しく動作するには、リソースへのタグ付けルールを統一する必要がある。Terraform で管理するリソースには、モジュール変数からタグを一括適用する。

# 開発環境の ECS サービス（予算超過時に自動停止する）
resource "aws_ecs_service" "dev_api" {
  name          = "dev-api-service"
  cluster       = aws_ecs_cluster.dev.id
  desired_count = 1
  # ...

  tags = {
    Environment = "dev"
    auto-stop   = "enabled"   # ← 自動停止の対象
    protect     = "false"
    ManagedBy   = "terraform"
  }
}

# 本番環境の ECS サービス（絶対に自動停止しない）
resource "aws_ecs_service" "prd_api" {
  name          = "prd-api-service"
  cluster       = aws_ecs_cluster.prd.id
  desired_count = 2
  # ...

  tags = {
    Environment = "prd"
    auto-stop   = "disabled"  # ← 対象外
    protect     = "true"      # ← 二重の保護
    ManagedBy   = "terraform"
  }
}

タグによる保護ロジックの判定フロー:

  リソースを発見
      ↓
  protect=true ?
  ├── YES → スキップ（ログ出力して終了）
  └── NO  ↓
         auto-stop=enabled ?
         ├── NO  → スキップ（ログ出力して終了）
         └── YES → 停止実行（DRY_RUN=true なら実際には止めない）

Part 5：Dry-run モードで安全に検証する

初回デプロイ時は必ず DRY_RUN=true で動作確認する。

# Lambda を手動でテスト実行（SNS のダミーイベントを渡す）
aws lambda invoke \
  --function-name my-project-auto-stop \
  --payload '{
    "Records": [{
      "Sns": {
        "Message": "{\"AlarmName\": \"budget-threshold\", \"NewStateValue\": \"ALARM\"}"
      }
    }]
  }' \
  --cli-binary-format raw-in-base64-out \
  response.json

cat response.json
# {"statusCode": 200, "body": "Auto-stop completed"}

# CloudWatch Logs でどのリソースが対象になるか確認
aws logs tail /aws/lambda/my-project-auto-stop --follow
# [DRY-RUN] Stopping ECS: dev-api-service
# PROTECTED (skip): ECS prd-api-service
# NO auto-stop tag (skip): EC2 i-0abc123def456

Dry-run で「止まるべきリソースだけが対象になっている」ことを確認したら DRY_RUN=false に変更してデプロイする。

Part 6：Cost Anomaly Detection との組み合わせ

Budget は月次の予算ベースだが、AWS Cost Anomaly Detection は過去のパターンから外れた異常な支出を日次で検知する。 両方を組み合わせることで、月末を待たずに異常を捕捉できる。

# Cost Anomaly Detection の設定
resource "aws_ce_anomaly_monitor" "service" {
  name         = "${var.project_name}-anomaly-monitor"
  monitor_type = "DIMENSIONAL"

  monitor_dimension = "SERVICE"  # サービス単位で異常を検知
}

resource "aws_ce_anomaly_subscription" "alert" {
  name      = "${var.project_name}-anomaly-alert"
  frequency = "DAILY"  # 毎日集計してアラート

  monitor_arn_list = [aws_ce_anomaly_monitor.service.arn]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_alert.arn
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["10"]   # $10 以上の異常支出で通知
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

Budget vs Cost Anomaly Detection の使い分け:

  Budget アラート:
    「月次の累計支出が予算の X% を超えた」
    → 月末の請求額の予測・上限管理に使う

  Cost Anomaly Detection:
    「昨日のEC2コストが過去7日間の平均より $50 高い」
    → パターンからの逸脱をリアルタイム近くで検知する
    → DDoS・設定ミス・意図しないスケールアウトの早期発見に有効

Conclusion：コスト管理は「仕組みで防ぐ」もの

クラウドコストの管理を「毎月請求書を確認する」という人的プロセスに依存するのは危険だ。

人的プロセスの問題:
  → 確認を忘れる
  → 旅行中・長期休暇中に異常が発生する
  → 異常に気づいた時点ですでに大きな金額になっている

仕組みによる防御:
  → Budget が 80% で警告、100% で自動停止をトリガー
  → Anomaly Detection が日次で異常パターンを検知
  → Lambda がタグを確認して安全にリソースを停止
  → Dry-run で事前に動作を検証済み

対策	カバーするリスク	コスト
Budget アラート（メール）	月次上限超過の検知	無料
Budget → Lambda 自動停止	上限超過後のリソース暴走	Lambda 実行コスト（月 $0.001 以下）
Cost Anomaly Detection	日次の異常支出パターン	無料（モニタリング料金なし）
タグによる保護機構	本番リソースの誤停止防止	無料

「クラウド破産」は運の問題ではなく、設計の問題だ。 防御の仕組みをコードで表現し、Terraform で管理することで、「誰かが設定し忘れる」という人的リスクも排除できる。