Kelvin Tay

cicd

via GIPHY

Primer

At my current company, we provision each customer's infrastructure logically within an Oracle Cloud Infrastructure (OCI) Compartment.

More specifically, we manage this provisioning through internal Terraform modules.

When a customer has churned, we want to delete all resources, including the compartment itself (i.e., off-boarding).

I was recently placed in charge of automating this through our CI/CD pipelines.

Surprise 1: Implicit dependencies not managed in Terraform

Since we spin up the resources with Terraform (terraform apply), deleting them should logically mean terraform destroy, right?

Indeed, this was the first direction I took. However, after many failed tries, I noted that:

So, in order to terraform destroy an OCI compartment, I would first need to delete all implicit Cloud Guard targets and etc.

Otherwise, the terraform destroy simply fails, explaining this requirement.

For this, I solved it via running commands with the OCI CLI before Terraform:

// Groovy (Jenkinsfile)

rootSecurityZoneId = getRootSecurityZone()
rootCloudTargetReportingRegion = getReportingRegion()

// NOTE: we need to delete any cloud guard targets linked to our compartment,
// otherwise, we cannot remove the compartment from cloud guard security zone.
sh """
    for target_id in \$(oci cloud-guard target list --all --compartment-id ${compartmentId} --region ${rootCloudTargetReportingRegion} | jq -r '.data.items|map(.id)|join(\" \")'); do
        echo \"Deleting target: \$target_id\"
        oci cloud-guard target delete --target-id \$target_id --region ${rootCloudTargetReportingRegion} --force --debug
    done
"""

sh "oci cloud-guard security-zone remove --compartment-id ${compartmentId} --security-zone-id ${rootSecurityZoneId} --region ${rootCloudTargetReportingRegion} --debug"

// after removal of compartment from security zone, OCI may attach a new cloud guard target still.

// we need to check, and delete any again, if so.
sh """
    for target_id in \$(oci cloud-guard target list --all --compartment-id ${compartmentId} --region ${rootCloudTargetReportingRegion} | jq -r '.data.items|map(.id)|join(\" \")'); do
        echo \"Deleting target: \$target_id\"
        oci cloud-guard target delete --target-id \$target_id --region ${rootCloudTargetReportingRegion} --force --debug
    done
"""

Surprise 2: Terraform Destroy took forever

I patted myself on the back as we blitzed past this hurdle. Onwards to terraform destroy everything!

Alas, I continue to see the deletion of the Compartment taking forever, even after all other resources were first deleted.

The command eventually timed out after 90 minutes, as this was the default for deletion. It took me a full day of fruitless retries, on different environments, before realizing this default.

We had not set any deletion timeout for the compartment in our module. Trying to patch this would be rather nightmarish, given the number of customers and the different versions of our Terraform module.

hcledit to the rescue!

With hcledit, I was able to patch the compartment's deletion timeout during the CI/CD pipeline just before the terraform destroy command.

# add timeouts {} block
hcledit block append module.tenant.oci_identity_compartment.customer_compartment timeouts --newline -u -f ./resources.tf

# insert delete = "4h" inside timeouts {} block
hcledit attribute append module.tenant.oci_identity_compartment.customer_compartment.timeouts.delete '"4h"' -u -f ./resources.tf

# the commands above will generate the following within our compartment resource:
#  timeouts {
#    delete = "4h"
#  }

# NOTE: you will need to `terraform apply -target <compartment>`
# to persist this to state, before `terraform destroy`

Off we go! I was rather proud of myself at this stage.

Yet, the command still timed out. OCI took more than 4 hours on many occasions. In fact, it took 22 hours once. 🤦‍♂️

side: The 90-minute default was based on OCI's p90 benchmark in 2021

There is just no good limits we can set here. Setting it too high would also mean the CI/CD job runner is idling while polling for the deletion status.

To be clear, we have a self-hosted Jenkins as our CI/CD, so the idle compute-time was less of a concern. If we were using cloud-hosted runners (i.e., paying for duration), this would have been an impossible choice.

I really wanted to carry on solving how to make terraform destroy here work. This was so that users can rest assure that everything was definitely deleted when the pipeline run completes.

Surprise 3: Retries spat on my face

“Hey, we can just add a retry. This way, we can re-run terraform destroy to check on the compartment deletion, right?”

note: This is likely a bad idea if you are running with cloud-hosted runners. You will end up paying for idle time.

I congratulated my inner voice at that time.

Alas, OCI's API did not play nice.

Retrying a deletion on a compartment simply threw a HTTP 403, stating the compartment is still deleting.

module.tenant.oci_identity_compartment.customer_compartment: Destroying... [id=ocid1.compartment.oc1..<redacted>]

Error: 403-NotAllowed, Deleting non active compartment or Listing active resources is not allowed. Compartment Id: ocid1.compartment.oc1..<redacted>
Suggestion: Please retry or contact support for help with service: Identity Compartment
Documentation: https://registry.terraform.io/providers/oracle/oci/latest/docs/resources/identity_compartment 
API Reference: https://docs.oracle.com/iaas/api/#/en/identity/20160918/Compartment/DeleteCompartment 
...

Ultimately, with the OCI Terraform provider, I could not retry the terraform destroy smartly.

Eventual Solution

I eventually settled for the solution below:

  1. Run terraform state rm <compartment> to remove the compartment from Terraform state
  2. Run terraform destroy to delete everything else first
  3. Run the OCI CLI commands mentioned above to delete implicit resources on the compartment
  4. Rename the compartment, via the OCI CLI, to add a delete. prefix (marker)
  5. Delete the compartment asynchronously, via the OCI CLI

If you are curious, here is the script for steps 4 and 5:

// Groovy (Jenkinsfile)

// NOTE: when operating on compartment, we need to point the request endpoint to the home region of our OCI tenant
tenantHomeRegion = sh(
    returnStdout: true,
    script: "oci iam region-subscription list --tenancy-id ${tenancyId} | jq -r '.data|map(select(.\"is-home-region\" == true))|.[0].\"region-name\"'"
).trim()


// rename compartment first, to denote for deletion
renamed = "DELETE.${customerName}"
// include Jenkins job URL in description
desc = "Deleted. Triggered from ${env.BUILD_URL}"

sh """
    oci iam compartment update --compartment-id ${compartmentId} --region ${tenantHomeRegion} --name \"${renamed}\" --description \"${desc}\" --force --debug
"""

// NOTE: Deletion request is async.
sh """
    oci iam compartment delete --compartment-id ${compartmentId} --region ${tenantHomeRegion} --force --debug
"""

For step 4, it allowed us to 1. denote the compartment's fate (while the async deletion takes its time), 2. allow for recreation of a compartment with the same name, just in case.

Epilogue

Deleting the OCI compartment via Terraform was not easy. Hopefully, if you are in the same boat, my experience here can guide you what not to try.

Otherwise, if I can offer further advice, you may be better off:

  1. Deleting everything else, but leave the compartment as-is. Right now compartments do not incur any costs. Or,
  2. Simply delete it from outside Terraform, if you have to.

#terraform #oraclecloud #cicd #pain

buy Kelvin a cup of coffee

Many CI/CD providers, like GitHub Actions and CircleCI, offer the options to run your CI/CD job using Docker images today.

This is a useful feature since you can ensure your job is always running with the same pre-installed dependencies. Teams may choose to bring their own Docker image, or use readily-available community images on Docker Hub for instance.

One challenge is that the Docker image in use may not have been intended for use in CI/CD automation. Your team may thus find yourselves trying to debug puzzles like:

  • I'm pretty sure XYZ is installed. Why does the CI/CD job fail to find XYZ?
  • Why is the FOOBAR environment variable different from what we have defined in the Docker image?

Docker image 101

When you execute the docker container run ... command, Docker will, by default, run a Docker container as a process based on the ENTRYPOINT and CMD definitions of your image. Docker will also load the environment variables declared in the ENV definitions.

Your ENTRYPOINT and CMD may be defined to run a long-running process (e.g., web application), or a short process (e.g., running Speccy to validate your OpenAPI spec). This will depend on the intended use of your image.

In addition, Docker images will be designed to come with just-enough tools to run its intended purpose. For example, the wework/speccy image understandably does not come installed with git or curl (see Dockerfile).

Docker images may also be published for specific OS architectures only (e.g., linux/amd64). You will want to confirm which OS and architecture the image can be run on.

These are important contexts, when designing CI/CD jobs using Docker images.

Understanding Docker images for CI/CD

Generally, for CI/CD automation, your job will run a series of shell commands in the build environment.

CI/CD providers like GitLab CI and CircleCI achieve this by override your Docker image's entrypoint with/bin/sh or /bin/bash when executing them as containers.

This is why you would want to use the -debug tag variant for Kaniko's Docker image when using in CircleCI for instance.

Additionally, your Docker image may not come with the required tools for your CI/CD automation. For example, you would require git in order to clone the repository as part of the CI/CD job steps.

Debug Cheatsheet

With this information in mind, here is a list of commands you can run locally to debug your chosen Docker image.

# inspect the "built-in" environment variables of an image
$ docker image inspect docker.io/amazon/aws-glue-libs:glue_libs_2.0.0_image_01 | jq ".[0].Config.Env"
[
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  "LANG=en_US.UTF-8",
  "PYSPARK_PYTHON=python3",
  "SPARK_HOME=/home/glue_user/spark",
  "SPARK_CONF_DIR=/home/glue_user/spark/conf",
  "PYTHONPATH=/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.7-src.zip:/home/glue_user/spark/python/",
  "PYSPARK_PYTHON_DRIVER=python3",
  "HADOOP_CONF_DIR=/home/glue_user/spark/conf"
]

# check the default entrypoint
$ docker image inspect docker.io/amazon/aws-glue-libs:glue_libs_2.0.0_image_01 | jq ".[0].Config.Entrypoint"
[
  "bash",
  "-lc"
]

# check the default cmd
$ docker image inspect docker.io/amazon/aws-glue-libs:glue_libs_2.0.0_image_01 | jq ".[0].Config.Cmd" 
[
  "pyspark"
]

# check tools installed
$ docker container run --rm docker.io/amazon/aws-glue-libs:glue_libs_2.0.0_image_01 "git --version"
...
git version 2.37.1

$ docker container run --rm docker.io/amazon/aws-glue-libs:glue_libs_2.0.0_image_01 "python --version"
...
Python 2.7.18

You can also find an example of this debugging for a CircleCI use-case here: https://github.com/kelvintaywl-cci/docker-executor-explore/blob/main/.circleci/config.yml

#docker #cicd #debug #cheatsheet

buy Kelvin a cup of coffee

Preface

In a CircleCI pipeline, workflows run independent of one another. As such, there is no built-in feature to ensure workflow B runs after workflow A.

However, you can still achieve ordering, through some trickery.

💡 You can control the sequence of jobs within a workflow. I recommend you consider if you can combine/merge workflow B into workflow A itself, first. This article is for when you require separate workflows somehow.

How

To achieve ordering, we simply set an approval job as the first job for workflow B.

# contrived snippet of a .circleci/config.yaml

workflows:
  aaa:
    jobs:
      - one
      - two
  bbb:
    jobs:
      - start:
          type: approval
      - next:
          requires:
            - start

Subsequent jobs in workflow B will only run when the approval job is approved. As such, you can “force” a wait, and only approve this job when workflow A is completed.

Note that this requires manual intervention, of course.

However, a benefit in this approach is that your team can take the time to confirm the outcomes of workflow A. For example, workflow A has deployed some infrastructure changes (e.g., terraform apply), and you prefer inspecting these changes before running workflow B.

One Step Further

You can automate this approval, at the end of workflow A, via the Approve a job API.

Specifically, you would need to create a job that does the following:

  1. Find workflow B's ID from the current pipeline.
  2. Find the approval job's ID from the invoked workflow B.
  3. Approve the job.
jobs:
  ...
  approve-workflow:
    parameters:
      workflow-name:
        type: string
        description: workflow name
      job-name:
        type: string
        description: name of approval job in workflow
    docker:
      - image: cimg/base:current
    steps:
      - run:
          name: Find Workflow ID for << parameters.workflow-name >>
          command: |
            curl -H "Circle-Token: $CIRCLE_TOKEN" https://circleci.com/api/v2/pipeline/<< pipeline.id >>/workflow > workflows.json
            WORKFLOW_ID=$(jq -r '.items | map(select(.name == "<< parameters.workflow-name >>")) | .[0].id' workflows.json)
            echo "export WORKFLOW_ID='${WORKFLOW_ID}'" >> $BASH_ENV
      - run:
          name: Find Job ID for << parameters.job-name >>
          command: |
            curl -H "Circle-Token: $CIRCLE_TOKEN" "https://circleci.com/api/v2/workflow/${WORKFLOW_ID}/job" > jobs.json
            APPROVAL_JOB_ID=$(jq -r '.items | map(select(.name == "<< parameters.job-name >>" and .type == "approval")) | .[0].id' jobs.json)
            echo "export APPROVAL_JOB_ID='${APPROVAL_JOB_ID}'" >> $BASH_ENV
      - run:
          name: Approve job
          command: |
            curl -X POST -H "Circle-Token: $CIRCLE_TOKEN" "https://circleci.com/api/v2/workflow/${WORKFLOW_ID}/approve/${APPROVAL_JOB_ID}" | jq .

In the spirit of sharing, I have created a CircleCI Orb that codifies the above job for your convenience.

https://circleci.com/developer/orbs/orb/kelvintaywl/control-flow

I hope this article and the Orb will be useful. Keep on building, folks!

#circleci #cicd #workflow

buy Kelvin a cup of coffee

snorkel

Before Diving in

This is an attempt to explain and explore how teams can use Docker Buildx for delivering Docker images.

Since we will not be covering all features around Docker Buildx, this is a wide snorkel rather than a deep dive.

This is a quick article for developers who have yet to use Docker Buildx but are curious on its use-cases.

What is Docker Buildx?

Let's take a few steps back before plunging in.

We use Docker Build to build Docker images from Dockerfiles.

Since 18.09, BuildKit was introduced as an improved version of the previous builder. As an example, we can mount secrets when building our images with BuildKit. BuildKit will also ensure that these secrets are not exposed within the built image's layers.

Buildx builds (no pun intended) on top of BuildKit. It comes with more operations besides image-building, as you can see from its available commands. Importantly, Buildx provides features for caching and cross-platform image builds.

Why should we use Docker Buildx?

For software teams shipping Docker images often, Docker Buildx can be an important tool in the box.

Caching image layers ensure a next rebuild of the image will be faster.

Before, teams would need various machines on different platforms to build images for each platform. For example, we would need a ARM64 machine to build a Docker image for ARM64 architectures.

With Docker Buildx's cross-platform feature, we can now use the same AMD64 machine to build both AMD64 and ARM64 Docker images.

Why is it relevant in CI/CD?

Many teams are building Docker images as part of their CI/CD pipelines. Hence, they can lean on the build cache and cross-platform capabilities of Docker Buildx to build various images faster and cheaper.

Let's discuss the two mentioned features a little deeper.

Caching

This pertains to the cache-from and cache-to options with the docker buildx build command.

Docker Buildx allows you to choose your caching strategy (e.g., inline, local, registry and etc), and each comes with its pros and cons.

Your choice will depend largely on your team's philosophy and the CI/CD provider.

For example, you can leverage GitHub's Cache service when running Docker Buildx on GitHub Actions.

For CircleCI users, you may find my exploratory project here useful.

Cross-platform

When building an ARM64 Docker image on a CI/CD pipeline, you would need to do so on an ARM64-based machine runner then (if not using Buildx).

Depending on your CI/CD provider, there may not be ARM64 support.

This can be worked around, if your CI/CD provider allows you to “bring you own runners” (also known as self-hosted runners). GitHub Actions and CircleCI support self-hosted runners. However, it does mean someone in your team now has to manage these runners on your infrastructure.

With Docker Buildx, we can now build cross-platform images within any arbitrary machine runner.

This can be a big win for team that prefers not owning additional infrastructures.

Resurfacing to Shore

We have explored the appeal of Docker Buildx, particularly in a CI/CD context here. As mentioned, it is ultimately a tool. For teams building Docker images in their CI/CD pipelines, I do encourage you to look into Docker Buildx if you have not!

#docker #buildx #cicd #performance

buy Kelvin a cup of coffee