Sunday, 15 March 2026

How to Recover a Failed Terraform Deployment Without Breaking Production

 How to Recover a Failed Terraform Deployment Without Breaking Production



Terraform apply failed what to do?

Infrastructure automation using Terraform is powerful, but sometimes deployments fail in the middle of execution. When this happens, your infrastructure may be partially created, and Terraform state may become inconsistent with the actual cloud resources.

Many DevOps engineers panic at this point and try random fixes, which can damage production infrastructure.

In this guide, you will learn how to safely recover Terraform infrastructure when terraform apply fails halfway.



Problem:

A Terraform deployment stops in the middle while creating or updating infrastructure.

Some resources are created successfully while others fail.

Now you have a dangerous situation:

Terraform state ≠ Actual infrastructure

This means Terraform and the cloud provider no longer agree on the current infrastructure state.

This problem commonly occurs when working with cloud providers like:

Example Error Message

Typical Terraform failure messages look like this:

AuthorizationFailed — service principal missing role

The service principal running Terraform doesn't have permission to perform the ARM action. Most common when applying across subscription scopes or creating role assignments.

OperationNotAllowed — vCPU quota exceeded

Your Azure subscription has hit its regional vCPU limit. This fires mid-apply when creating VMs, AKS node pools, or VMSS. Resources before this point remain live.

ParentResourceNotFound — missing depends_on

A child resource (SQL database, subnet, diagnostic setting) was deployed before its parent finished provisioning. Azure's ARM API returns 404 on the parent reference.  

Or Terraform may simply stop with:

Error: failed to create resource

When this happens, the infrastructure might be partially created.

Why Partial Failures Happen?

Terraform builds a dependency graph and walks it concurrently. When one node fails, Terraform stops scheduling new work but does not roll back completed nodes. In simple words Terraform is declarative, but it is not transactional.

There is no:

Undo

Rollback
Atomic execution

Terraform follows a simple philosophy:

Fix the problem
Reconcile state
Move forward

Common causes of half-failed deployments include:

  • Service principal lacks Microsoft.Authorization/roleAssignments/write
  • Subscription-level vCPU or networking quotas hit mid-apply
  • ARM race conditions — child resources start provisioning before parent is Succeeded
  • azurerm provider version mismatch with your Terraform version
  • Missing subscription_id in provider block (required since azurerm v4.0)
  • Azure AD replication lag — newly created service principal not yet visible to ARM
  • Resource group locked or in Deleting state from prior failed operation

Step-by-Step Fix (Safe Recovery Method)

Step 1 — Stop All Terraform Applies

First rule:

Never run terraform apply repeatedly while debugging.

Running it multiple times can create duplicate resources or destructive changes.

Step 2 — Capture Evidence

Before making any changes, collect the failure details.

Check:

  • Full Terraform logs

  • The last resource Terraform attempted

  • Provider error messages

Example failure output:

Error creating IAM role
AccessDenied: user not authorized

This helps identify the root cause.

Step 3 — Inspect Terraform State

Now check what Terraform believes exists.

Run:

terraform state list

Inspect specific resources:

    terraform state show <resource_address>

Look for:

  • Missing attributes
  • Incomplete resource data
  • Incorrect counts or loops

This step helps understand Terraform’s view of infrastructure.

Step 4 — Inspect Actual Cloud Infrastructure

Now compare Terraform state with actual cloud resources.

Use:

  • AWS Console

  • Azure Portal

  • GCP Console

  • CLI tools

Check:

  • Does the resource exist?

  • Is it partially configured?

  • Are dependencies missing?

Now you have a State vs Reality comparison.

Step 5 — Fix State Mismatches

Now resolve inconsistencies.

    Case 1 — Resource Exists in Cloud but Not in Terraform State

         Terraform will try to recreate it, causing failure.

        Fix Option 1 — Import the Resource

        Fix Option 2 — Delete the Resource

    Case 2 — Resource Exists in State but Not in Cloud

        Terraform thinks the resource exists, but it doesn't.

        Fix by removing it from state:

   terraform state rm <resource_address>

        Terraform will recreate it during the next apply.

        Another option:

terraform taint <resource_address>

        This forces recreation.

    Case 3 — Resource Exists but Configuration Is Wrong

        Fix the Terraform configuration file.

        Update the .tf code.

        Avoid manual changes in the cloud console unless it's an emergency.

Step 6 — Fix the Root Cause

Before applying again, fix the underlying issue.

Step 7 — Run a Safe Terraform Plan

Now check the proposed changes.

Run:

terraform plan

 

Look for dangerous actions like:

  • unexpected resource destruction

  • duplicate resources

  • incorrect replacements

If anything looks suspicious, fix the configuration before applying.

Essential Commands Quick Reference

CommandWhen to Use
terraform state listGet a full list of all resources Terraform is tracking
terraform state show <addr>Inspect a specific resource's tracked attributes
terraform import <addr> <id>Pull an existing cloud resource into state
terraform state rm <addr>Remove a ghost resource from state without deleting it
terraform taint <addr>Mark resource for forced recreation on next apply
terraform apply -replace=<addr>Force recreation (modern alternative to taint, TF ≥ 0.15.2)
terraform plan -target=<addr>Scope plan/apply to one resource for surgical fixes
terraform state pullDownload and inspect raw state JSON for debugging
terraform refreshSync state with real infrastructure (use carefully)

Conclusion

A Terraform deployment failing halfway is not a disaster, but it requires careful recovery.

Remember the correct process:

Stop Terraform
Inspect state
Inspect cloud infrastructure
Fix mismatches
Correct root cause
Run terraform plan
Apply safely

Terraform is designed to converge infrastructure to the desired state, not to rollback changes.

If you follow the recovery process described above, you can safely repair infrastructure without causing downtime or resource loss.


Terraform Recovery FAQ (Common DevOps Questions)

can terraform rollback apply?

No. Terraform does not support automatic rollback. Resources created before the failure remain in place. Recovery requires manual state reconciliation using terraform import, terraform state rm, or terraform taint.

How do I recover Terraform state after accidental terraform destroy?

If you use remote state (recommended), restore from the previous state version in your S3 backend or Terraform Cloud. For local state, restore from a backup. Then use terraform import to re-register any recreated resources.

What is terraform state rm and when should I use it?

terraform state rm removes a resource from Terraform's state file without deleting the actual cloud resource. Use it when a resource exists in state but has been deleted in the cloud, or when you want to stop managing a resource with Terraform.

How do I prevent Terraform deployment failures in production?

Use remote state with locking, run terraform plan in CI before every apply, use -target for surgical deployments, set up IAM least-privilege for your Terraform role, and always test in a staging environment first.


Related links
Step-by-step guide: Create Linux vm using terraform

Create Widows VM using terraform

Create linux VM using terraform

No comments:

Post a Comment

Author Details

Hi, I'm Prashant — a full-time software engineer with a passion for automation, DevOps, and sharing what I learn. I started Py-Bucket to document my journey through tools like Docker, Kubernetes, Azure DevOps, and PowerShell scripting — and to help others navigate the same path. When I’m not coding or writing, I’m experimenting with side projects, exploring productivity hacks, or learning how to build passive income streams online. This blog is my sandbox — and you're welcome to explore it with me. Get in touch or follow me for future updates!