Sunday, 15 March 2026

How to Recover a Failed Terraform Deployment Without Breaking Production

 How to Recover a Failed Terraform Deployment Without Breaking Production



Terraform apply failed what to do?

Infrastructure automation using Terraform is powerful, but sometimes deployments fail in the middle of execution. When this happens, your infrastructure may be partially created, and Terraform state may become inconsistent with the actual cloud resources.

Many DevOps engineers panic at this point and try random fixes, which can damage production infrastructure.

In this guide, you will learn how to safely recover Terraform infrastructure when terraform apply fails halfway.



Problem:

A Terraform deployment stops in the middle while creating or updating infrastructure.

Some resources are created successfully while others fail.

Now you have a dangerous situation:

Terraform state ≠ Actual infrastructure

This means Terraform and the cloud provider no longer agree on the current infrastructure state.

This problem commonly occurs when working with cloud providers like:

Example Error Message

Typical Terraform failure messages look like this:

AuthorizationFailed — service principal missing role

The service principal running Terraform doesn't have permission to perform the ARM action. Most common when applying across subscription scopes or creating role assignments.

OperationNotAllowed — vCPU quota exceeded

Your Azure subscription has hit its regional vCPU limit. This fires mid-apply when creating VMs, AKS node pools, or VMSS. Resources before this point remain live.

ParentResourceNotFound — missing depends_on

A child resource (SQL database, subnet, diagnostic setting) was deployed before its parent finished provisioning. Azure's ARM API returns 404 on the parent reference.  

Or Terraform may simply stop with:

Error: failed to create resource

When this happens, the infrastructure might be partially created.

Why Partial Failures Happen?

Terraform builds a dependency graph and walks it concurrently. When one node fails, Terraform stops scheduling new work but does not roll back completed nodes. In simple words Terraform is declarative, but it is not transactional.

There is no:

Undo

Rollback
Atomic execution

Terraform follows a simple philosophy:

Author Details

Hi, I'm Prashant — a full-time software engineer with a passion for automation, DevOps, and sharing what I learn. I started Py-Bucket to document my journey through tools like Docker, Kubernetes, Azure DevOps, and PowerShell scripting — and to help others navigate the same path. When I’m not coding or writing, I’m experimenting with side projects, exploring productivity hacks, or learning how to build passive income streams online. This blog is my sandbox — and you're welcome to explore it with me. Get in touch or follow me for future updates!