Terraform State Management: Avoiding the Mistakes That Break Production | RelientOps

Why State Matters More Than Your Terraform Code

Your .tf files are declarative wishes. Terraform's state file is the record of what actually exists. When those two diverge — through a manual console change, a failed apply mid-run, or two engineers running terraform apply simultaneously — you get state drift, and the symptoms range from harmless drift warnings to deleted production databases.

Getting state right is the foundational prerequisite for everything else in an IaC workflow.

Use a Remote Backend From Day One

Never commit terraform.tfstate to git. It contains secrets in plaintext, blocks team collaboration, and creates a single point of failure.

S3 + DynamoDB remains the gold standard on AWS:

terraform {
  backend "s3" {
    bucket         = "acme-tf-state-prod"
    key            = "platform/eks/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "tf-state-lock"
  }
}

The DynamoDB table provides state locking: Terraform writes a lock entry before any mutating operation and removes it when done. A concurrent apply will fail immediately with a clear message — far better than silent corruption.

Key/path naming convention: {team}/{component}/terraform.tfstate scales to hundreds of state files without collisions.

Isolate State by Blast Radius

Don't store all your infrastructure in a single root module. The rule of thumb:

| Tier | Example | Change Frequency | |---|---|---| | Foundation | VPCs, IAM | Very rare | | Platform | EKS, RDS | Rare | | Application | Deployments, ALBs | Frequent |

Each tier gets its own state file. A botched app-tier apply can never touch your VPC. Use terraform_remote_state data sources to pass outputs across tiers:

# In platform tier — reads VPC outputs from foundation state
data "terraform_remote_state" "foundation" {
  backend = "s3"
  config = {
    bucket = "acme-tf-state-prod"
    key    = "foundation/vpc/terraform.tfstate"
    region = "us-east-1"
  }
}

# Then reference naturally:
# data.terraform_remote_state.foundation.outputs.private_subnet_ids

Workspace Strategy: Environment Isolation

Terraform workspaces let a single set of .tf files manage multiple environments with separate state files.

terraform workspace new staging
terraform workspace new production
terraform workspace select production

State is stored at {key}/env:/{workspace}/terraform.tfstate automatically. Use terraform.workspace in your code for environment-specific values:

locals {
  env_config = {
    staging    = { instance_type = "t3.small",  min_nodes = 1 }
    production = { instance_type = "m5.xlarge", min_nodes = 3 }
  }

  cfg = local.env_config[terraform.workspace]
}

resource "aws_instance" "app" {
  instance_type = local.cfg.instance_type
}

Caveat: workspaces share the same provider configuration. For truly separate AWS accounts (the recommended security posture), use separate directories or terragrunt with account-level state paths.

Protecting State: Versioning and Deletion Protection

Enable S3 versioning on your state bucket. When an apply corrupts state, rollback is a single aws s3api restore-object call away.

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket acme-tf-state-prod \
  --versioning-configuration Status=Enabled

# Prevent accidental deletion
aws s3api put-bucket-policy \
  --bucket acme-tf-state-prod \
  --policy file://deny-delete-policy.json

The deny-delete-policy.json should deny s3:DeleteBucket and s3:DeleteObject for everyone except a break-glass IAM role.

Handling State Drift

When CloudFormation, a teammate, or an SRE manually changes infrastructure, the state no longer matches reality. Detect drift before it bites you:

# Refresh state from real infrastructure (read-only, no apply)
terraform refresh

# See what Terraform would change if applied against current state
terraform plan -refresh-only

For deliberate out-of-band changes (e.g., manually scaling an ASG), import the resource rather than recreating it:

terraform import aws_autoscaling_group.app my-asg-name

CI/CD Integration: Locking Down Who Can Apply

In a team environment, terraform apply should only run from CI:

Plan on PR — post the plan output as a PR comment (atlantis, terraform-pr-commenter, or GitHub Actions)
Apply on merge to main — only after code review approval
Restrict AWS credentials — use OIDC federated identity so CI gets short-lived tokens; no long-lived access keys

# GitHub Actions OIDC example
- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/TerraformCIRole
    aws-region: us-east-1

The TerraformCIRole IAM role should have the minimum permissions required and a trust policy that only allows GitHub Actions OIDC tokens from your repo.

Quick Checklist

[ ] Remote backend configured (S3 + DynamoDB or equivalent)
[ ] State bucket has versioning enabled
[ ] State bucket has delete-protection policy
[ ] State isolated per environment and per blast-radius tier
[ ] terraform apply gated behind CI merge (no local applies in prod)
[ ] OIDC federation for CI credentials (no long-lived access keys)
[ ] Drift detection in scheduled CI job (terraform plan -refresh-only)

State management isn't glamorous, but it's the infrastructure beneath your infrastructure. Get it wrong once in production and you won't forget it.