Infrastructure-as-code has become the standard approach for managing cloud resources. The specific tools vary—Terraform, Ansible, CloudFormation, Pulumi—but patterns for maintainable infrastructure remain consistent.
State Management
IaC tools maintain state, a record of what resources exist and their configurations. Corrupting state creates operational nightmares.
Remote state is essential. Local state gets lost, corrupted, or diverges between team members. Cloud backends solve this.
State locking prevents concurrent applies from corrupting state. Every backend supports it and it should be enabled.
State isolation means separate state files for separate environments and components. A single state file for everything is a liability.
Regular backups matter. Even with versioned backends, independent backups enable fast recovery.
Module Design
Good modules are the difference between infrastructure that scales and infrastructure that becomes unmaintainable.
Single responsibility means a module should do one thing. A VPC module creates a VPC and core components. It doesn't also create compute clusters.
Minimal interfaces mean exposing inputs that users need to customize while hiding implementation details.
Sensible defaults mean most inputs should have defaults for common cases. Required inputs should be truly required.
Version pinning means pinning module versions explicitly. Floating versions lead to surprise breakages.
Directory Structure
Structure affects navigability and blast radius. A typical layout has modules as reusable components, environments for environment-specific configs like dev, staging, and prod, and global for shared resources.
Each environment directory should be independently applicable. Changes to dev shouldn't risk prod.
Change Management
Infrastructure changes are inherently risky and process matters.
Plan before apply means always running plan, reviewing output, then applying. PRs should show plan output before merge.
Small changes are easier to review and roll back than large changesets. Break big changes into smaller PRs.
Staged rollouts apply to dev first, verify, then staging, verify, then prod.
Change windows mean risky changes happen during low-traffic periods to allow time for recovery.
Secrets
Secrets should never be committed to IaC repositories.
External secrets managers like AWS Secrets Manager, Vault, or GCP Secret Manager allow referencing secrets by path rather than value.
For CI/CD, inject secrets as environment variables and never echo them in logs.
Automate secret rotation where possible and log secret access.
Testing
Infrastructure code can be tested.
Static analysis tools like tflint, checkov, and tfsec catch common mistakes before apply.
Plan validation asserts that plans contain expected changes and don't contain unexpected ones.
Integration tests spin up real infrastructure in isolated accounts, run tests, and tear down.
Compliance tests verify deployed infrastructure matches requirements.
Documentation
Infrastructure code is not self-documenting.
Each module needs a README explaining what it does, how to use it, and what variables mean.
Architecture decision records document significant decisions and rationale.
Runbooks explain how to perform common operations.
Diagrams provide visual architecture documentation that gets updated when architecture changes.
Drift Detection
Infrastructure drifts. Manual console changes happen. Automated processes modify resources.
Regular drift detection runs plan on a schedule and alerts on unexpected differences.
When drift is detected, decide whether to update code or apply code.
Policies that prevent console modifications to IaC-managed resources help prevent drift.
CI/CD Integration
Manual applies don't scale. The workflow should be: developer creates PR with infrastructure change, CI runs plan and posts output to PR, reviewer approves based on plan output, merge triggers apply to target environment, and post-apply tests verify success.
For production, manual approval of apply steps may be appropriate even after merge.
Conclusion
Infrastructure-as-code is a discipline, not just a tool. State management, modular design, change management, and testing matter more than specific tool choice. Get the patterns right and managing large infrastructure becomes tractable.