How to Recover a Failed Terraform Deployment Without Breaking Production
Terraform apply failed what to do?
Infrastructure automation using Terraform is powerful, but sometimes deployments fail in the middle of execution. When this happens, your infrastructure may be partially created, and Terraform state may become inconsistent with the actual cloud resources.
Many DevOps engineers panic at this point and try random fixes, which can damage production infrastructure.
In this guide, you will learn how to safely recover Terraform infrastructure when terraform apply fails halfway.
Problem:
A Terraform deployment stops in the middle while creating or updating infrastructure.
Some resources are created successfully while others fail.
Now you have a dangerous situation:
Terraform state ≠ Actual infrastructure
This means Terraform and the cloud provider no longer agree on the current infrastructure state.
This problem commonly occurs when working with cloud providers like:
Example Error Message
Typical Terraform failure messages look like this:
AuthorizationFailed — service principal missing role
The service principal running Terraform doesn't have permission to perform the ARM action. Most common when applying across subscription scopes or creating role assignments.
OperationNotAllowed — vCPU quota exceeded
Your Azure subscription has hit its regional vCPU limit. This fires mid-apply when creating VMs, AKS node pools, or VMSS. Resources before this point remain live.
ParentResourceNotFound — missing depends_on
A child resource (SQL database, subnet, diagnostic setting) was deployed before its parent finished provisioning. Azure's ARM API returns 404 on the parent reference.
Or Terraform may simply stop with:
Error: failed to create resource
When this happens, the infrastructure might be partially created.
Why Partial Failures Happen?
There is no:
Undo
Rollback
Atomic execution
Terraform follows a simple philosophy: