Capturing webhooks whilst your app is undergoing maintenance

March 22, 2024

Recently, we needed to upgrade our Postgres database. This required a period of downtime where we needed to prevent writes to the database. Easy enough, maintenance time exists for a reason. The problem? Our app receives hundreds of webhooks from dozens of sources every minute, each containing important data that we don’t want to lose.

We end up with several challenges we need to solve:

we need to stop all database writes,
we need to capture webhooks, and
we need to be able to replay those webhooks once we complete maintenance.

We ended up using a “soft” maintenance mode; instead of blocking incoming requests we continued to receive them and processed them differently, including serving a generic “We are undergoing maintenance” page & stopping database write operations. This let us receive the requests and re-send them in their entirety to a separate server that was running code designed to receive these webhooks and save them. Here is the before & after of our architecture:

Before: Whilst out of maintenance mode, webhooks are received & processed on-the-fly

After: Webhooks are received. Instead of being processed, they are forwarded to a separate server where they are stored until the maintenance is finished. They are then sent back to the production server for processing.

So how did we get here…?

Designing a solution

As mentioned, our application receives a lot of webhooks from a lot of sources, often outside of our control. If we were to update them, it would be impossible to do so atomically; it would be a manual & labour-intensive process, prone to human error & difficult to roll back quickly. It was better to preserve the URL that receives webhooks instead of risking firing webhooks into the void and losing customer data.

So the problems to solve…

Where do we store our webhooks to replay them later?
How do we redirect webhook traffic (and webhook traffic only) to this store?

Let’s tackle them one at a time.

Choosing a home for our webhooks

At first we investigated off-the-shelf solutions to solve this for us – and we found one, Hookdeck. This gives you a new URL to which you can redirect webhooks, where it will save them & let you replay them or forward them to a different server. It would sit in our system like so:

Webhooks are fired to Hookdeck instead of our production server. Once the server exits maintenance, they are replayed from Hookdeck back to production

However, using this would require us to list it as a data processor under GDPR, a 45 day legal process, which made it untenable for our deadline. This also excluded other off-the-shelf solutions, so we’re left needing to build something ourselves.

Bootstrapping a whole new application with a datastore felt like overkill; we just needed a temporary, production-like environment to which we could deploy some code and store data during the maintenance window.

We also had a staging environment sat right there, running a recent version of our full Ruby on Rails application, ready to be taken over…

We modified the code on our staging server & introduced a new model called a QueuedWebhook, and... Well, it didn’t do much; it had:

utility methods to fire a webhook to a different URL,
boilerplate code to build & send a new request, and
attributes to store the body, headers, params, and path of the incoming webhook:

CREATE TABLE queued_webhooks (
    id SERIAL PRIMARY KEY, 
    -- Used to rebuild the request
    body JSON, 
    headers JSON, 
    params JSONB, 
    path VARCHAR, 

    -- Debugging utilities
    processed_at TIMESTAMP, 
    retry_count INTEGER DEFAULT 0, 
    error_message VARCHAR
);

We updated the controllers that received webhooks accordingly – instead of processing webhooks automatically, we extracted the information we needed & saved them to the DB.

Webhooks are fired to our staging server instead, which saves them. Once the production server exits maintenance they are replayed from staging back into production.

This was puzzle piece #1: “How to store the webhooks” in place. We launched this to our staging environment and then had to actually get our webhooks there.

How do we redirect webhook traffic?

💡 Did you know? If you redirect traffic with a 301 (permanent redirect) or 302 (temporary redirect) it will convert the request method to GET. You can use their equivalents, 307 & 308 to preserve the HTTP verb

Our application is managed by Cloudflare, which lets us manage traffic in interesting ways; for example they have a concept called “Redirect Rules”, which will seamlessly intercept traffic heading to server A and redirect it to server B. We implemented this, and our architecture was updated accordingly:

Webhook traffic is intercepted by Cloudflare & redirected to our staging server, allowing the webhook processing architecture to operate

However, these only support 301 or 302 status codes and, as we found out a little too late, this butchered our HTTP verbs.

💡 Did you also know? Some webhooks, those fired by GitHub included, will not follow redirects

Then we added a full Cloudflare worker to redirect traffic with a 307, only to find that some of the webhooks treated this as an error. Since we could not rely on webhook senders to consistently follow redirect rules, we needed to adjust our approach accordingly: Instead of redirecting requests, we needed to receive them, and then resend them exactly as they were.

… Sounds familiar, no?

We reused pretty much everything we’d written for staging, with one difference – instead of creating QueuedWebhook's, we just instantiated them and fired them straight away; in Ruby this was simply the difference between .create and .new.forward_webhook. This version of the code lived on production, so the webhook lifecycle became:

A webhook is fired to our production server
Instead of processing the webhook and writing data, the production server fired the webhook to the staging server
The staging server receives the webhook and saves it as a QueuedWebhook
Once the maintenance window is complete and production is ready to receive webhooks again, the webhooks are fired from staging back to production

The above process represented visually

A neat solution that helped us move fast and get the upgrade done!

What improvements could be made?

After the maintenance was complete, we discussed whether it would be useful to have this functionality always available in production. If for whatever reason the production database became unable to perform writes (e.g. due to hardware failure or overload, or more maintenance), we could instantly forward webhooks to an alternate server that could store and eventually send back the webhooks once production could perform writes again. This would be a more predictable solution than failing the webhooks and requiring the webhook senders to retry them using their own backoff logic. We envisioned a solution that would take advantage of Rails middleware to control what happens to webhooks, based on feature flags and environment variables:

# frozen_string_literal: true

module Middleware
  class QueueWebhooksMiddleware < Middleware::Base
    # Fill this in as needed to allow all request headers
    # needed for your webhooks
    VALID_HEADERS = %w[]

    def call(env)
      request = Rack::Request.new(env)

      if request.path =~ %r{^/webhooks/}
        http_headers = permit_headers(request, VALID_HEADERS)
        queued_webhook = QueuedWebhook.new(
                  path: request.path,
                  payload: request.body.read,
                  headers: http_headers)

        if Flipper.enabled?(:DANGER__receive_webhooks)
          queued_webhook.save!
        elsif Flipper.enabled?(:DANGER__forward_webhooks)
          queued_webhook.forward_webhook
        end
      end

      @app.call(env)
    end

    private

    def permit_headers(request, *allowed_headers)
      # Filter request.headers to include only allowed headers
      # ...
    end
  end
end

This would combine the code running on both the production & staging servers, and allow us to manage the state of different servers using feature flags and environment variables, instead of needing to deploy changes.

The server in maintenance mode would enable DANGER__forward_webhooks
The server where we are storing them would enable DANGER__receive_webhooks
With neither flag enabled, this code could lie dormant on production without impacting the day-to-day operation of the app

The URL to forward webhooks to would be set in env vars, and the queued webhooks could be relayed back to the main server using an admin UI or just the rails console.

Conclusion

I’m happy to report our database upgrade went well 🥰 But it also sparked a lot of conversation and ideas for a future system we could use should we need to put our app into maintenance again. If you have solved similar problems in different ways – or if you end up using an approach like ours – I’d love to hear about your experiences.