Kelvin Tay

terraform

via GIPHY

Primer

At my current company, we provision each customer's infrastructure logically within an Oracle Cloud Infrastructure (OCI) Compartment.

More specifically, we manage this provisioning through internal Terraform modules.

When a customer has churned, we want to delete all resources, including the compartment itself (i.e., off-boarding).

I was recently placed in charge of automating this through our CI/CD pipelines.

Surprise 1: Implicit dependencies not managed in Terraform

Since we spin up the resources with Terraform (terraform apply), deleting them should logically mean terraform destroy, right?

Indeed, this was the first direction I took. However, after many failed tries, I noted that:

So, in order to terraform destroy an OCI compartment, I would first need to delete all implicit Cloud Guard targets and etc.

Otherwise, the terraform destroy simply fails, explaining this requirement.

For this, I solved it via running commands with the OCI CLI before Terraform:

// Groovy (Jenkinsfile)

rootSecurityZoneId = getRootSecurityZone()
rootCloudTargetReportingRegion = getReportingRegion()

// NOTE: we need to delete any cloud guard targets linked to our compartment,
// otherwise, we cannot remove the compartment from cloud guard security zone.
sh """
    for target_id in \$(oci cloud-guard target list --all --compartment-id ${compartmentId} --region ${rootCloudTargetReportingRegion} | jq -r '.data.items|map(.id)|join(\" \")'); do
        echo \"Deleting target: \$target_id\"
        oci cloud-guard target delete --target-id \$target_id --region ${rootCloudTargetReportingRegion} --force --debug
    done
"""

sh "oci cloud-guard security-zone remove --compartment-id ${compartmentId} --security-zone-id ${rootSecurityZoneId} --region ${rootCloudTargetReportingRegion} --debug"

// after removal of compartment from security zone, OCI may attach a new cloud guard target still.

// we need to check, and delete any again, if so.
sh """
    for target_id in \$(oci cloud-guard target list --all --compartment-id ${compartmentId} --region ${rootCloudTargetReportingRegion} | jq -r '.data.items|map(.id)|join(\" \")'); do
        echo \"Deleting target: \$target_id\"
        oci cloud-guard target delete --target-id \$target_id --region ${rootCloudTargetReportingRegion} --force --debug
    done
"""

Surprise 2: Terraform Destroy took forever

I patted myself on the back as we blitzed past this hurdle. Onwards to terraform destroy everything!

Alas, I continue to see the deletion of the Compartment taking forever, even after all other resources were first deleted.

The command eventually timed out after 90 minutes, as this was the default for deletion. It took me a full day of fruitless retries, on different environments, before realizing this default.

We had not set any deletion timeout for the compartment in our module. Trying to patch this would be rather nightmarish, given the number of customers and the different versions of our Terraform module.

hcledit to the rescue!

With hcledit, I was able to patch the compartment's deletion timeout during the CI/CD pipeline just before the terraform destroy command.

# add timeouts {} block
hcledit block append module.tenant.oci_identity_compartment.customer_compartment timeouts --newline -u -f ./resources.tf

# insert delete = "4h" inside timeouts {} block
hcledit attribute append module.tenant.oci_identity_compartment.customer_compartment.timeouts.delete '"4h"' -u -f ./resources.tf

# the commands above will generate the following within our compartment resource:
#  timeouts {
#    delete = "4h"
#  }

# NOTE: you will need to `terraform apply -target <compartment>`
# to persist this to state, before `terraform destroy`

Off we go! I was rather proud of myself at this stage.

Yet, the command still timed out. OCI took more than 4 hours on many occasions. In fact, it took 22 hours once. 🤦‍♂️

side: The 90-minute default was based on OCI's p90 benchmark in 2021

There is just no good limits we can set here. Setting it too high would also mean the CI/CD job runner is idling while polling for the deletion status.

To be clear, we have a self-hosted Jenkins as our CI/CD, so the idle compute-time was less of a concern. If we were using cloud-hosted runners (i.e., paying for duration), this would have been an impossible choice.

I really wanted to carry on solving how to make terraform destroy here work. This was so that users can rest assure that everything was definitely deleted when the pipeline run completes.

Surprise 3: Retries spat on my face

“Hey, we can just add a retry. This way, we can re-run terraform destroy to check on the compartment deletion, right?”

note: This is likely a bad idea if you are running with cloud-hosted runners. You will end up paying for idle time.

I congratulated my inner voice at that time.

Alas, OCI's API did not play nice.

Retrying a deletion on a compartment simply threw a HTTP 403, stating the compartment is still deleting.

module.tenant.oci_identity_compartment.customer_compartment: Destroying... [id=ocid1.compartment.oc1..<redacted>]

Error: 403-NotAllowed, Deleting non active compartment or Listing active resources is not allowed. Compartment Id: ocid1.compartment.oc1..<redacted>
Suggestion: Please retry or contact support for help with service: Identity Compartment
Documentation: https://registry.terraform.io/providers/oracle/oci/latest/docs/resources/identity_compartment 
API Reference: https://docs.oracle.com/iaas/api/#/en/identity/20160918/Compartment/DeleteCompartment 
...

Ultimately, with the OCI Terraform provider, I could not retry the terraform destroy smartly.

Eventual Solution

I eventually settled for the solution below:

  1. Run terraform state rm <compartment> to remove the compartment from Terraform state
  2. Run terraform destroy to delete everything else first
  3. Run the OCI CLI commands mentioned above to delete implicit resources on the compartment
  4. Rename the compartment, via the OCI CLI, to add a delete. prefix (marker)
  5. Delete the compartment asynchronously, via the OCI CLI

If you are curious, here is the script for steps 4 and 5:

// Groovy (Jenkinsfile)

// NOTE: when operating on compartment, we need to point the request endpoint to the home region of our OCI tenant
tenantHomeRegion = sh(
    returnStdout: true,
    script: "oci iam region-subscription list --tenancy-id ${tenancyId} | jq -r '.data|map(select(.\"is-home-region\" == true))|.[0].\"region-name\"'"
).trim()


// rename compartment first, to denote for deletion
renamed = "DELETE.${customerName}"
// include Jenkins job URL in description
desc = "Deleted. Triggered from ${env.BUILD_URL}"

sh """
    oci iam compartment update --compartment-id ${compartmentId} --region ${tenantHomeRegion} --name \"${renamed}\" --description \"${desc}\" --force --debug
"""

// NOTE: Deletion request is async.
sh """
    oci iam compartment delete --compartment-id ${compartmentId} --region ${tenantHomeRegion} --force --debug
"""

For step 4, it allowed us to 1. denote the compartment's fate (while the async deletion takes its time), 2. allow for recreation of a compartment with the same name, just in case.

Epilogue

Deleting the OCI compartment via Terraform was not easy. Hopefully, if you are in the same boat, my experience here can guide you what not to try.

Otherwise, if I can offer further advice, you may be better off:

  1. Deleting everything else, but leave the compartment as-is. Right now compartments do not incur any costs. Or,
  2. Simply delete it from outside Terraform, if you have to.

#terraform #oraclecloud #cicd #pain

buy Kelvin a cup of coffee

As a staff-level support engineer, one of my responsibilities is to empower my teammates in better reproduction of customer environments.

CircleCI does offer an on-prem solution, CircleCI Server, that comes as a Helm chart.

Beyond a Kubernetes cluster (e.g., AWS EKS), you would also need to provision external object stores (e.g., AWS S3), IAM entities (e.g., AWS IAM user/roles) and etc.

My original goal was to try to provision everything within 1 Terraform module. By everything, we are referring to the EKS cluster, the S3 bucket, the Helm release and etc.

However, as I designed along, I then realized this was not ideal in many ways:

  1. Helm releases are re: application deployments while Terraform applies are re: infrastructure deployments. Trying to piggyback a [Helm release]() as part of infrastructure changes did feel odd for me. (This article, specifically anti-pattern 4, describes this conflict better than I can.)
  2. Terraform's philosophy requires all resources managed in Terraform to be strictly managed within Terraform. As an administrator, this means any updates I want to make to my Helm release has to again be done within the Terraform module. The documented notes on upgrades suggests that any drift detected can produced unintended changes if the administrator is not careful.
  3. I still want to define my EKS cluster with eksctl and YAML; There is a eksctl provider indeed but it still requires eksctl explicitly on the host machine nonetheless.

The various discussions on Reddit (1, 2) also convinced me it was better to avoid shoehorning all the setup into 1 Terraform module.

I ended up splitting up the set up such that:

  • All non-EKS related resources (e.g., S3 bucket, IAM users) are managed via Terraform
  • The administrator creates the EKS cluster via eksctl, but the YAML file is first generated by Terraform's local_file
  • Similarly, the administrator manages the Helm release via helm commands, but the YAML files are generated by Terraform's local_sensitive_file
  • The full commands to run for eksctl and helm are shown via Terraform outputs.
  • Each installation phase is its own Terraform module, and we read a previous module's outputs as inputs (e.g., AWS tags) via remote_state data source.

This meant that the administrator has to manage about 4 or more Terraform modules (instead of 1). However, I feel this is easier to manage and reason about.

#terraform #helm #kubernetes #iac

buy Kelvin a cup of coffee

I have released support for the CircleCI Runner resource-class and token in my unofficial Terraform provider for CircleCI, as per v0.10.3.

Developers can now manage the provisioning (and teardown) of CircleCI self-hosted Runners within Terraform. You can explore an example here.

This was a fun challenge, and I wanted to document my journey on this work.

Investigation

Unlike other resources, self-hosted runners are not manageable under the official V2 API. Developers had to use the CircleCI CLI to manage resource-classes and tokens instead.

To port this into my Terraform provider, I was hoping there was a HTTP API available. This way, I can continue using my approach of abstracting the HTTP API away to a Go SDK.

I assumed (wrongly) the CLI was using GraphQL under the hood for Runner operations, as with many others (e.g., for Orbs).

Digging into the source-code, I then realized Runner resource-classes and tokens can be managed via a HTTP API; It was simply not publicly documented, yet.

Glue

After the legwork mentioned above, I tested the HTTP APIs with an OpenAPI (Swagger 2.0) document. This enabled me to generate a Go SDK for Runner APIs.

Why are the Go SDKs separated? Wouldn't it be easier to keep it all in one?

That was something I mulled over for some time indeed. I have documented my reasons for keeping them separate.

Assembly

With the Go SDK published, I “simply” have to then expose the Runner resource-classes and tokens in the Terraform provider codebase.

The main work was done within a pull request here. This also included acceptance tests, and examples.

Ultimately, I noted self-hosted (machine) runners are also available for CircleCI's Server customers (i.e., self-hosted CircleCI). We would want to extend and ensure this addition can be used by platform teams using CircleCI Server.

It turned out that the Runner API:

Thankfully, this was a quick patch. I was also able to verify this fix against my own CircleCI Server instance.

Things I learnt

This feature was satisfying for me to build, and I had many learning points along the journey.

  1. Read the code: This feature would not have been completed if I did not dig deeper into the publicly-available source code 📖

  2. Keep trying: I am still learning (and failing) at Go. However, I think it is important to keep trying and learning. Keeping this source code open-sourced forces me to keep myself honest too about my lack of knowledge. For fellow engineers out there, let's keep at it! 🤓

#terraform #circleci #runner #go

buy Kelvin a cup of coffee

How Terraform works under the hood

I had recently published a Terraform provider for CircleCI.

I was motivated to understand better what is happening under all that fun when we execute the Terraform commands.

Here is my attempt to summarize how Terraform works under the hood.

Behind the magic

We can see Terraform as an ecosystem; There is the Terraform core (including its CLI), and a registry of providers and modules.

Ultimately, you can think of the Terraform core as a state machine. It stores the current state of your stack, and syncs the state of your stack against the cloud, via the right providers, based on your Terraform .tf files.

To support the sync, all providers need to implement the CRUD functions on the resource (so that Terraform can create, read, update and delete an AWS EC2 instance, for example).

If you are a cloud service provider that already exposes a RESTful API, your services are very much “Terraform-able” then.

Terraform core talks to the provider via gRPC. This is one of the many reasons why Hashicorp recommend providers to be written in Go.

#terraform #provider #summary

Infrastructure as Code (IaC) is not a new concept for many. Prior to joining CircleCI, I have been wrangling with AWS CloudFormation long enough (2 years) to bear some of its pain-points.

Since joining CircleCI, I have been exposed to Terraform through internal and external projects. So far, I really like that Terraform is not tied to specific cloud providers.

I also like that any service can contribute a Terraform provider, thereby allowing users to define your service's resources as code. (Any cloud resource that exposes an CRUD / RESTful API is a good candidates for Terraform.)

Recently, I have been wishing many things to be “Terraform-able”.

As a support engineer, I use Zendesk daily. We create Zendesk macros for canned responses in particular. Overtime, these macros grow, and it may be hard to manage or validate changes to our macros. I do wish Zendesk has a Terraform provider, so we can institute our macros in code.

Recently, I also wished that I could Terraform my resume. Then, I snapped out of it.

When you have a powerful hammer like Terraform, everything looks like nails.

#terraform #infrastructureascode #declarative

buy Kelvin a cup of coffee