Automatically deleting old cloudinary uploads

February 23, 2022

tl;dr

Use a GitHub action on a cron schedule to make requests to the Cloudinary API, retrieve images based on a set of conditions, and delete those images. Every request deletes a maximum of 100 images, adjust the cron schedule accordingly.

You'll end up with one that looks like this, from the Inscryber repo

Background

The case study for this feature is Inscryber, a personal project that lets users build custom Inscryption cards. Cloudinary has such a powerful API for transforming images that it was the clear choice for supporting this app.

One step in the card creation process is that users can upload custom card art. This gets stored on Cloudinary in a specific folder (think Inscryber/Uploads) and then used in transformations. As more cards get created, more images are thrown into this folder, and they're completely unused after the card gets generated. As such there's no point keeping them, they are just wasting space.

Cloudinary does not support a feature to remove these. The discussion on it spans 2012-2021, and ends with a dissatisfying “this is under consideration”. Don't take this as criticism of the Cloudinary team – it is a brilliant service, and by all means Inscryber could not exist without it. But it was important to automate Inscryber, as the manual process would be forgotten.

Prerequisites & Limitations

You should have a Cloudinary account set up.
You will need a repo on GitHub.

Every run of this action will destroy the 100 oldest images. The limiting factor for this is the Admin API, which accepts a maximum of 100 IDs per request.

Therefore, adjust the number of deleted images per day by adjusting the cron schedule and running the job more frequently if needed; Inscryber's runs every 6 hours, for a maximum of 400 images deleted per day.

Tool justification

So, for this specific situation:

The frontend is a nextjs app
This is hosted on Netlify
The code repo is on GitHub

Which presented two possible solutions for running code on a schedule:

1. Netlify scheduled functions

At time of writing, Netlify have recently announced the beta for their scheduled functions. These run Javascript code on a standard cron schedule.

Positives

We'll be using Javascript, so the language will be consistent with the rest of the app and it'll be easier to write complex code.
It'll be easier to debug code, as generic console.logs will render and we can use the netlify dev server to run this locally.

Negatives

It adds a dependency to Netlify, and means switching hosts will break this functionality.
Scheduled functions are a beta feature, with very little documentation or support.
Further to 2, the Netlify team “strongly recommend not using it in any production and/or critical workflows”.

2. GitHub Actions

GitHub Actions are GitHubs answer to CI/CD. In personal experience they are generally non-standard, needlessly confusing, and difficult to test. But one action made this a viable contender for this use case: @fjogeleit/http-request-action

This gives the GH action the power to make full-featured HTTP requests and inspect the response to them.

Positives

GitHub actions is already used to run automated tests. This will keep the process consistent with the existing Inscryber codebase.
There are mounds of documentation and blog posts from the community teaching people how to write them... Which is a little meta.
As we run on ubuntu, we have access to suites of command line tools that we can use to manipulate data. For example, jq.

Negatives

It is significantly less readable than JS code.
As always, it means locking Inscryber to GitHub. Moving to BitBucket becomes significantly more difficult... Which provides an excellent excuse to not move to BitBucket 😄
There is still little good support for testing locally. Writing GH actions generally involves a pull request with 200 commits, 1 file changed.

So, for Inscryber...

GitHub actions was used, with one such pull request

Setting up authentication secrets

The Cloudinary API and the GH action we use both support Basic Auth. Your credentials for this are shown when you log into Cloudinary, on the “Dashboard” tab. We will need the Cloud Name, API Key, and API Secret so keep this tab open.

To store these securely in a way that we can use, we are going to use GitHub Action secrets. Navigate to https://github.com/{GH-username}/{GH-repo}/settings/secrets/actions, or Your Repo > Settings > Secrets > Actions via the UI.

For each of the three secrets, click “Create New Secret”, give them a logical name, and copy-paste from the Cloudinary dashboard into “Value”. Now we can access these values in the action with the syntax ${{ secrets.SECRET_NAME }}

If you do some tests with these, you'll see that GH will always encrypt the values and show them as *** instead of actually printing the values. Don't worry, the actual values are used when needed.

The base action

This is pretty typical for most GH actions, it's included here for posterity. Future additions to this code will be added as new steps.

# Runs every six hours, ie every day @ midnight, 6am, midday, and 6pm
# Each run deletes a maximum of 100 images
name: Cloudinary Cleanup

on:
  schedule:
    - cron: "0 0,6,12,18 * * *"
  # Enables manual run from the Actions tab in GH
  workflow_dispatch:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:

The only notable things here:

This runs every day at midnight, 6am, midday, and 6pm
The job can be run manually from the GH UI. This is great for debugging, as you can run it from a specific branch

Retrieving image public IDs

With our secrets set up, it's time to query the API for the first time. We are using the Cloudinary Search API, which lets us perform complex queries with it's Expression parameter.

So, we add the first step:

      - name: Retrieve files in cloudinary
        id: get-request
        uses: fjogeleit/http-request-action@master
        with:
          url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUD_NAME }}/resources/search
          data: '{ "expression": "folder=Uploads AND uploaded_at<=6h", "max_results": 100, "sort_by": [{"uploaded_at": "asc"}] }'
          method: 'POST'
          username: ${{ secrets.API_KEY }}
          password: ${{ secrets.API_SECRET }}

Notably:

id is a required field to access the response to this request, as per the Request action docs
The url is standard. The username and password are how we configure the basic auth credentials for this
The exciting part of this is data.
- The expression argument is a query for all uploads within the Uploads folder that were uploaded 6 hours or more ago.
- max_results could theoretically be increased to 500; however, this will then be unusable with the Admin API later, so we reduce the limit.
- Setting the sort_by argument ensures we always delete the oldest images.

The search API is incredibly powerful, so examine the documentation for it and customise your expression to match your needs.

Remember, here we are trying to select images that will be deleted.

Cloudinary documents a sample response. It provides a huge amount of information for us, but all we need is the public ID for each image. We need to construct an array of ["public_id_1", "public_id_2"].

Format IDs

Therefore, we strip the excess data from the response using jq.

Through lots of trial and even more error, a jq expression like the following was built:

      - name: Filter output with jq
        id: jq-filter
        run: echo ::set-output name=RESOURCES::$(echo ${{ toJSON(steps.get-request.outputs.response) }} | jq '.resources | [.[].public_id]')

Once again, we need to specify an ID so we can access the result of this step
The echo ::set-output name=RESOURCES element of the run step is a way of storing the response of this query to use in future actions. It's a bit of a weird process, but has some useful guidance.
${{ toJSON(steps.get-request.outputs.response) }} is used to retrieve the output of the previous step
| jq '.resources | [.[].public_id]') is where we pipe the result into a jq expression to tidy it. AFAIK the expression evaluates to:
- .resources steps us into the resources key of the response, ie where image metadata for all returned images is kept
- .[].public_id somehow filters all keys other than “public_id” out of the response
- Wrapping the previous step in [] converts it to an array

🤷🏻‍♂️ if you know better, get in touch.

Delete images associated with these IDs

With this argument constructed, we query the Cloudinary Admin API to bulk delete records.

      - name: Delete selected files
        uses: fjogeleit/http-request-action@master
        with:
          url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUDINARY_CLOUD_NAME }}/resources/image/upload
          data: "public_ids[]=${{ join(fromJSON(steps.jq-filter.outputs.RESOURCES), '&public_ids[]=') }}"
          contentType: 'application/x-www-form-urlencoded'
          method: 'DELETE'
          username: ${{ secrets.CLOUDINARY_API_KEY }}
          password: ${{ secrets.CLOUDINARY_API_SECRET_KEY }}

Most of this is pretty typical and constant, as with the previous request. Things get tricky when we consider the data attribute.

We need to provide an array of data. This is a messy & imperfect solution, and suggestions for improving it are absolutely welcome.

One way to format an array argument is key[]=value1&key[]=value2 in the URL of a request. That's the string we're building here. Building outwards...

fromJSON(steps.jq-filter.outputs.RESOURCES) retrieves the formatted list of public IDs
join(..., '&public_ids[]=') creates any internal copies of public_ids[]= that we need. For example, just this would turn [1, 2] into 1&public_ids[]=2
We precede this with another instance of public_ids[]= to write the key before the first value.

And... that's it.

Conclusion

The full action at time of writing is here:

# Lists all images in the cloudinary upload folder & deletes those more than 6 hours old.
# When no files are found to be deleted, the delete request returns "not found"

# Runs every six hours, ie every day @ midnight, 6am, midday, and 6pm
# Each run deletes a maximum of 100 images
name: Cloudinary Cleanup

on:
  schedule:
    - cron: "0 0,6,12,18 * * *"
  # Enables manual run from the Actions tab in GH
  workflow_dispatch:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # The "expression" body argument is built using the Cloudinary search API
      # It translates to "find images in the Inscryption/Uploads folder that are
      # more than 6 hours old"
      # https://cloudinary.com/documentation/search_api#expressions
      - name: Retrieve files in cloudinary
        id: get-request
        uses: fjogeleit/http-request-action@master
        with:
          url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUDINARY_CLOUD_NAME }}/resources/search
          data: '{ "expression": "folder=Inscryption/Uploads AND uploaded_at<=1d", "max_results": 100, "sort_by": [{"uploaded_at": "asc"}] }'
          method: 'POST'
          username: ${{ secrets.CLOUDINARY_API_KEY }}
          password: ${{ secrets.CLOUDINARY_API_SECRET_KEY }}

      # Output of this response returns full metadata for each image. We don't need
      # most of it, so use command line tool JQ to build an array of public IDs.
      # This is stored as an output so we can use in later steps
      - name: Filter output with jq
        id: jq-filter
        run: echo ::set-output name=RESOURCES::$(echo ${{ toJSON(steps.get-request.outputs.response) }} | jq '.resources | [.[].public_id]')

      - name: LOG all IDs to be deleted
        run: echo ${{ steps.jq-filter.outputs.RESOURCES }}

      - name: LOG joined public IDs to be sent to DELETE request
        run: echo "public_ids[]=${{ join(fromJSON(steps.jq-filter.outputs.RESOURCES), '&public_ids[]=') }}"

      # Send request with ids to delete as 'application/x-www-form-urlencoded' content.
      # IDs are joined with the `public_ids[]` key.
      - name: Delete selected files
        uses: fjogeleit/http-request-action@master
        id: delete-request
        with:
          url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUDINARY_CLOUD_NAME }}/resources/image/upload
          data: "public_ids[]=${{ join(fromJSON(steps.jq-filter.outputs.RESOURCES), '&public_ids[]=') }}"
          contentType: 'application/x-www-form-urlencoded'
          method: 'DELETE'
          username: ${{ secrets.CLOUDINARY_API_KEY }}
          password: ${{ secrets.CLOUDINARY_API_SECRET_KEY }}

      - name: LOG response from DELETE request
        run: echo ${{ steps.delete-request.outputs.response }}

(Live version should be in the Inscryber repo)

A few comments and debug steps have been added just to make life easier, but fundamentally it is the three steps discussed in this post.

Congrats on making it this far!