Kelvin Tay

oraclecloud

via GIPHY

Primer

At my current company, we provision each customer's infrastructure logically within an Oracle Cloud Infrastructure (OCI) Compartment.

More specifically, we manage this provisioning through internal Terraform modules.

When a customer has churned, we want to delete all resources, including the compartment itself (i.e., off-boarding).

I was recently placed in charge of automating this through our CI/CD pipelines.

Surprise 1: Implicit dependencies not managed in Terraform

Since we spin up the resources with Terraform (terraform apply), deleting them should logically mean terraform destroy, right?

Indeed, this was the first direction I took. However, after many failed tries, I noted that:

So, in order to terraform destroy an OCI compartment, I would first need to delete all implicit Cloud Guard targets and etc.

Otherwise, the terraform destroy simply fails, explaining this requirement.

For this, I solved it via running commands with the OCI CLI before Terraform:

// Groovy (Jenkinsfile)

rootSecurityZoneId = getRootSecurityZone()
rootCloudTargetReportingRegion = getReportingRegion()

// NOTE: we need to delete any cloud guard targets linked to our compartment,
// otherwise, we cannot remove the compartment from cloud guard security zone.
sh """
    for target_id in \$(oci cloud-guard target list --all --compartment-id ${compartmentId} --region ${rootCloudTargetReportingRegion} | jq -r '.data.items|map(.id)|join(\" \")'); do
        echo \"Deleting target: \$target_id\"
        oci cloud-guard target delete --target-id \$target_id --region ${rootCloudTargetReportingRegion} --force --debug
    done
"""

sh "oci cloud-guard security-zone remove --compartment-id ${compartmentId} --security-zone-id ${rootSecurityZoneId} --region ${rootCloudTargetReportingRegion} --debug"

// after removal of compartment from security zone, OCI may attach a new cloud guard target still.

// we need to check, and delete any again, if so.
sh """
    for target_id in \$(oci cloud-guard target list --all --compartment-id ${compartmentId} --region ${rootCloudTargetReportingRegion} | jq -r '.data.items|map(.id)|join(\" \")'); do
        echo \"Deleting target: \$target_id\"
        oci cloud-guard target delete --target-id \$target_id --region ${rootCloudTargetReportingRegion} --force --debug
    done
"""

Surprise 2: Terraform Destroy took forever

I patted myself on the back as we blitzed past this hurdle. Onwards to terraform destroy everything!

Alas, I continue to see the deletion of the Compartment taking forever, even after all other resources were first deleted.

The command eventually timed out after 90 minutes, as this was the default for deletion. It took me a full day of fruitless retries, on different environments, before realizing this default.

We had not set any deletion timeout for the compartment in our module. Trying to patch this would be rather nightmarish, given the number of customers and the different versions of our Terraform module.

hcledit to the rescue!

With hcledit, I was able to patch the compartment's deletion timeout during the CI/CD pipeline just before the terraform destroy command.

# add timeouts {} block
hcledit block append module.tenant.oci_identity_compartment.customer_compartment timeouts --newline -u -f ./resources.tf

# insert delete = "4h" inside timeouts {} block
hcledit attribute append module.tenant.oci_identity_compartment.customer_compartment.timeouts.delete '"4h"' -u -f ./resources.tf

# the commands above will generate the following within our compartment resource:
#  timeouts {
#    delete = "4h"
#  }

# NOTE: you will need to `terraform apply -target <compartment>`
# to persist this to state, before `terraform destroy`

Off we go! I was rather proud of myself at this stage.

Yet, the command still timed out. OCI took more than 4 hours on many occasions. In fact, it took 22 hours once. πŸ€¦β€β™‚οΈ

side: The 90-minute default was based on OCI's p90 benchmark in 2021

There is just no good limits we can set here. Setting it too high would also mean the CI/CD job runner is idling while polling for the deletion status.

To be clear, we have a self-hosted Jenkins as our CI/CD, so the idle compute-time was less of a concern. If we were using cloud-hosted runners (i.e., paying for duration), this would have been an impossible choice.

I really wanted to carry on solving how to make terraform destroy here work. This was so that users can rest assure that everything was definitely deleted when the pipeline run completes.

Surprise 3: Retries spat on my face

β€œHey, we can just add a retry. This way, we can re-run terraform destroy to check on the compartment deletion, right?”

note: This is likely a bad idea if you are running with cloud-hosted runners. You will end up paying for idle time.

I congratulated my inner voice at that time.

Alas, OCI's API did not play nice.

Retrying a deletion on a compartment simply threw a HTTP 403, stating the compartment is still deleting.

module.tenant.oci_identity_compartment.customer_compartment: Destroying... [id=ocid1.compartment.oc1..<redacted>]

Error: 403-NotAllowed, Deleting non active compartment or Listing active resources is not allowed. Compartment Id: ocid1.compartment.oc1..<redacted>
Suggestion: Please retry or contact support for help with service: Identity Compartment
Documentation: https://registry.terraform.io/providers/oracle/oci/latest/docs/resources/identity_compartment 
API Reference: https://docs.oracle.com/iaas/api/#/en/identity/20160918/Compartment/DeleteCompartment 
...

Ultimately, with the OCI Terraform provider, I could not retry the terraform destroy smartly.

Eventual Solution

I eventually settled for the solution below:

  1. Run terraform state rm <compartment> to remove the compartment from Terraform state
  2. Run terraform destroy to delete everything else first
  3. Run the OCI CLI commands mentioned above to delete implicit resources on the compartment
  4. Rename the compartment, via the OCI CLI, to add a delete. prefix (marker)
  5. Delete the compartment asynchronously, via the OCI CLI

If you are curious, here is the script for steps 4 and 5:

// Groovy (Jenkinsfile)

// NOTE: when operating on compartment, we need to point the request endpoint to the home region of our OCI tenant
tenantHomeRegion = sh(
    returnStdout: true,
    script: "oci iam region-subscription list --tenancy-id ${tenancyId} | jq -r '.data|map(select(.\"is-home-region\" == true))|.[0].\"region-name\"'"
).trim()


// rename compartment first, to denote for deletion
renamed = "DELETE.${customerName}"
// include Jenkins job URL in description
desc = "Deleted. Triggered from ${env.BUILD_URL}"

sh """
    oci iam compartment update --compartment-id ${compartmentId} --region ${tenantHomeRegion} --name \"${renamed}\" --description \"${desc}\" --force --debug
"""

// NOTE: Deletion request is async.
sh """
    oci iam compartment delete --compartment-id ${compartmentId} --region ${tenantHomeRegion} --force --debug
"""

For step 4, it allowed us to 1. denote the compartment's fate (while the async deletion takes its time), 2. allow for recreation of a compartment with the same name, just in case.

Epilogue

Deleting the OCI compartment via Terraform was not easy. Hopefully, if you are in the same boat, my experience here can guide you what not to try.

Otherwise, if I can offer further advice, you may be better off:

  1. Deleting everything else, but leave the compartment as-is. Right now compartments do not incur any costs. Or,
  2. Simply delete it from outside Terraform, if you have to.

#terraform #oraclecloud #cicd #pain

buy Kelvin a cup of coffee β˜•