Automatically deleting old cloudinary uploads
tl;dr
Use a GitHub action on a cron schedule to make requests to the Cloudinary API, retrieve images based on a set of conditions, and delete those images. Every request deletes a maximum of 100 images, adjust the cron schedule accordingly.
You'll end up with one that looks like this, from the Inscryber repo
Background
The case study for this feature is Inscryber, a personal project that lets users build custom Inscryption cards. Cloudinary has such a powerful API for transforming images that it was the clear choice for supporting this app.
One step in the card creation process is that users can upload custom card art. This gets stored on Cloudinary in a specific folder (think Inscryber/Uploads
) and then used in transformations. As more cards get created, more images are thrown into this folder, and they're completely unused after the card gets generated. As such there's no point keeping them, they are just wasting space.
Cloudinary does not support a feature to remove these. The discussion on it spans 2012-2021, and ends with a dissatisfying “this is under consideration”. Don't take this as criticism of the Cloudinary team – it is a brilliant service, and by all means Inscryber could not exist without it. But it was important to automate Inscryber, as the manual process would be forgotten.
Prerequisites & Limitations
- You should have a Cloudinary account set up.
- You will need a repo on GitHub.
Every run of this action will destroy the 100 oldest images. The limiting factor for this is the Admin API, which accepts a maximum of 100 IDs per request.
Therefore, adjust the number of deleted images per day by adjusting the cron schedule and running the job more frequently if needed; Inscryber's runs every 6 hours, for a maximum of 400 images deleted per day.
Tool justification
So, for this specific situation:
Which presented two possible solutions for running code on a schedule:
1. Netlify scheduled functions
At time of writing, Netlify have recently announced the beta for their scheduled functions. These run Javascript code on a standard cron schedule.
Positives
- We'll be using Javascript, so the language will be consistent with the rest of the app and it'll be easier to write complex code.
- It'll be easier to debug code, as generic
console.log
s will render and we can use thenetlify dev
server to run this locally.
Negatives
- It adds a dependency to Netlify, and means switching hosts will break this functionality.
- Scheduled functions are a beta feature, with very little documentation or support.
- Further to 2, the Netlify team “strongly recommend not using it in any production and/or critical workflows”.
2. GitHub Actions
GitHub Actions are GitHubs answer to CI/CD. In personal experience they are generally non-standard, needlessly confusing, and difficult to test. But one action made this a viable contender for this use case: @fjogeleit/http-request-action
This gives the GH action the power to make full-featured HTTP requests and inspect the response to them.
Positives
- GitHub actions is already used to run automated tests. This will keep the process consistent with the existing Inscryber codebase.
- There are mounds of documentation and blog posts from the community teaching people how to write them... Which is a little meta.
- As we run on
ubuntu
, we have access to suites of command line tools that we can use to manipulate data. For example,jq
.
Negatives
- It is significantly less readable than JS code.
- As always, it means locking Inscryber to GitHub. Moving to BitBucket becomes significantly more difficult... Which provides an excellent excuse to not move to BitBucket 😄
- There is still little good support for testing locally. Writing GH actions generally involves a pull request with 200 commits, 1 file changed.
So, for Inscryber...
GitHub actions was used, with one such pull request
Setting up authentication secrets
The Cloudinary API and the GH action we use both support Basic Auth. Your credentials for this are shown when you log into Cloudinary, on the “Dashboard” tab. We will need the Cloud Name
, API Key
, and API Secret
so keep this tab open.
To store these securely in a way that we can use, we are going to use GitHub Action secrets. Navigate to https://github.com/{GH-username}/{GH-repo}/settings/secrets/actions
, or Your Repo > Settings > Secrets > Actions
via the UI.
For each of the three secrets, click “Create New Secret”, give them a logical name, and copy-paste from the Cloudinary dashboard into “Value”. Now we can access these values in the action with the syntax ${{ secrets.SECRET_NAME }}
If you do some tests with these, you'll see that GH will always encrypt the values and show them as ***
instead of actually printing the values. Don't worry, the actual values are used when needed.
The base action
This is pretty typical for most GH actions, it's included here for posterity. Future additions to this code will be added as new steps
.
# Runs every six hours, ie every day @ midnight, 6am, midday, and 6pm
# Each run deletes a maximum of 100 images
name: Cloudinary Cleanup
on:
schedule:
- cron: "0 0,6,12,18 * * *"
# Enables manual run from the Actions tab in GH
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
The only notable things here:
- This runs every day at midnight, 6am, midday, and 6pm
- The job can be run manually from the GH UI. This is great for debugging, as you can run it from a specific branch
Retrieving image public IDs
With our secrets set up, it's time to query the API for the first time. We are using the Cloudinary Search API, which lets us perform complex queries with it's Expression parameter.
So, we add the first step:
- name: Retrieve files in cloudinary
id: get-request
uses: fjogeleit/http-request-action@master
with:
url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUD_NAME }}/resources/search
data: '{ "expression": "folder=Uploads AND uploaded_at<=6h", "max_results": 100, "sort_by": [{"uploaded_at": "asc"}] }'
method: 'POST'
username: ${{ secrets.API_KEY }}
password: ${{ secrets.API_SECRET }}
Notably:
id
is a required field to access the response to this request, as per the Request action docs- The
url
is standard. Theusername
andpassword
are how we configure the basic auth credentials for this - The exciting part of this is
data
.- The
expression
argument is a query for all uploads within theUploads
folder that were uploaded 6 hours or more ago. max_results
could theoretically be increased to 500; however, this will then be unusable with the Admin API later, so we reduce the limit.- Setting the
sort_by
argument ensures we always delete the oldest images.
- The
The search API is incredibly powerful, so examine the documentation for it and customise your expression to match your needs.
Remember, here we are trying to select images that will be deleted.
Cloudinary documents a sample response. It provides a huge amount of information for us, but all we need is the public ID for each image. We need to construct an array of ["public_id_1", "public_id_2"]
.
Format IDs
Therefore, we strip the excess data from the response using jq
.
Through lots of trial and even more error, a jq expression like the following was built:
- name: Filter output with jq
id: jq-filter
run: echo ::set-output name=RESOURCES::$(echo ${{ toJSON(steps.get-request.outputs.response) }} | jq '.resources | [.[].public_id]')
- Once again, we need to specify an ID so we can access the result of this step
- The
echo ::set-output name=RESOURCES
element of the run step is a way of storing the response of this query to use in future actions. It's a bit of a weird process, but has some useful guidance. ${{ toJSON(steps.get-request.outputs.response) }}
is used to retrieve the output of the previous step| jq '.resources | [.[].public_id]')
is where we pipe the result into a jq expression to tidy it. AFAIK the expression evaluates to:.resources
steps us into theresources
key of the response, ie where image metadata for all returned images is kept.[].public_id
somehow filters all keys other than “public_id” out of the response- Wrapping the previous step in
[]
converts it to an array
🤷🏻♂️ if you know better, get in touch.
Delete images associated with these IDs
With this argument constructed, we query the Cloudinary Admin API to bulk delete records.
- name: Delete selected files
uses: fjogeleit/http-request-action@master
with:
url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUDINARY_CLOUD_NAME }}/resources/image/upload
data: "public_ids[]=${{ join(fromJSON(steps.jq-filter.outputs.RESOURCES), '&public_ids[]=') }}"
contentType: 'application/x-www-form-urlencoded'
method: 'DELETE'
username: ${{ secrets.CLOUDINARY_API_KEY }}
password: ${{ secrets.CLOUDINARY_API_SECRET_KEY }}
Most of this is pretty typical and constant, as with the previous request. Things get tricky when we consider the data
attribute.
We need to provide an array of data. This is a messy & imperfect solution, and suggestions for improving it are absolutely welcome.
One way to format an array argument is key[]=value1&key[]=value2
in the URL of a request. That's the string we're building here. Building outwards...
fromJSON(steps.jq-filter.outputs.RESOURCES)
retrieves the formatted list of public IDsjoin(..., '&public_ids[]=')
creates any internal copies ofpublic_ids[]=
that we need. For example, just this would turn[1, 2]
into1&public_ids[]=2
- We precede this with another instance of
public_ids[]=
to write the key before the first value.
And... that's it.
Conclusion
The full action at time of writing is here:
# Lists all images in the cloudinary upload folder & deletes those more than 6 hours old.
# When no files are found to be deleted, the delete request returns "not found"
# Runs every six hours, ie every day @ midnight, 6am, midday, and 6pm
# Each run deletes a maximum of 100 images
name: Cloudinary Cleanup
on:
schedule:
- cron: "0 0,6,12,18 * * *"
# Enables manual run from the Actions tab in GH
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
# The "expression" body argument is built using the Cloudinary search API
# It translates to "find images in the Inscryption/Uploads folder that are
# more than 6 hours old"
# https://cloudinary.com/documentation/search_api#expressions
- name: Retrieve files in cloudinary
id: get-request
uses: fjogeleit/http-request-action@master
with:
url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUDINARY_CLOUD_NAME }}/resources/search
data: '{ "expression": "folder=Inscryption/Uploads AND uploaded_at<=1d", "max_results": 100, "sort_by": [{"uploaded_at": "asc"}] }'
method: 'POST'
username: ${{ secrets.CLOUDINARY_API_KEY }}
password: ${{ secrets.CLOUDINARY_API_SECRET_KEY }}
# Output of this response returns full metadata for each image. We don't need
# most of it, so use command line tool JQ to build an array of public IDs.
# This is stored as an output so we can use in later steps
- name: Filter output with jq
id: jq-filter
run: echo ::set-output name=RESOURCES::$(echo ${{ toJSON(steps.get-request.outputs.response) }} | jq '.resources | [.[].public_id]')
- name: LOG all IDs to be deleted
run: echo ${{ steps.jq-filter.outputs.RESOURCES }}
- name: LOG joined public IDs to be sent to DELETE request
run: echo "public_ids[]=${{ join(fromJSON(steps.jq-filter.outputs.RESOURCES), '&public_ids[]=') }}"
# Send request with ids to delete as 'application/x-www-form-urlencoded' content.
# IDs are joined with the `public_ids[]` key.
- name: Delete selected files
uses: fjogeleit/http-request-action@master
id: delete-request
with:
url: https://api.cloudinary.com/v1_1/${{ secrets.CLOUDINARY_CLOUD_NAME }}/resources/image/upload
data: "public_ids[]=${{ join(fromJSON(steps.jq-filter.outputs.RESOURCES), '&public_ids[]=') }}"
contentType: 'application/x-www-form-urlencoded'
method: 'DELETE'
username: ${{ secrets.CLOUDINARY_API_KEY }}
password: ${{ secrets.CLOUDINARY_API_SECRET_KEY }}
- name: LOG response from DELETE request
run: echo ${{ steps.delete-request.outputs.response }}
(Live version should be in the Inscryber repo)
A few comments and debug steps have been added just to make life easier, but fundamentally it is the three steps discussed in this post.
Congrats on making it this far!