Hunting down a Postgres JIT error

This post chronicles my investigation on a Just-In-Time (JIT) error seen on a customer's Postgres server.

Given the complexity of the setup (Kubernetes), I thought it would be fun, and hopefully educational for anyone hitting similar errors.

Quick context

We provide our self-hosted solution as a Kubernetes (k8s) application. By default, the Postgres database server is implemented as a Stateful Set.

I had been using Postgres for > 5 years, but have never dug into Postgres's JIT feature.

Hello, Unexpected Error

Our customer noticed some data were not refreshed. We noticed the k8s CronJob responsible for refreshing data had been failing for a while.

Specifically, the Pod failed with the following error:

ERROR:  could not load library "/opt/bitnami/postgresql/lib/llvmjit.so": libLLVM-11.so.1: cannot open shared object file: No such file or directory

As a first troubleshooting step, we tried to confirm if the /opt/bitnami/postgresql/lib/llvmjit.so file was missing:

$ kubectl -n <ns> exec -it po/<Postgres pod> -- ls -lah /opt/bitnami/postgresql/lib/llvmjit.so

The file was there, so the next step was to Google what the error could mean.

Examples from the Interweb suggest the error is likely related to libLLVM-11.so.1 (example post).

Make it fail

As per Rule 2 of Debugging Rules, I elected to reproduce the error then.

At this point, I must share that no other customers reported this error. I and others in the Support team also did not experience the same error on our setup on the same application version.

We need to first understand what Postgres JIT is then, and try to reproduce the error.

It turns out Postgres >= v12 has JIT enabled by default.

Depending on the SQL query, Postgres may trigger JIT if it concludes the JIT compilation will speed up the query. You can read up the decision details here.

This makes reproduction an interesting challenge; We need to run a query that deterministically forces Postgres to trigger JIT compilation then.

Given that our k8s application sets up the Postgres server with defaults from Postgres v12, I was able to reduce the setup to testing with the underlying custom Postgres Docker image.

We can reproduce the same error when we run SELECT pg_jit_available(); query.

Implementing a workaround

I noted that we can disable JIT as a workaround. You can choose to disable JIT at the system level, or for a specific database.

ALTER SYSTEM SET jit = off;
-- or
ALTER DATABASE [db] SET jit = off;

To confirm this workaround works, we need to:

  1. Reproduce the error with an SQL query that triggers JIT deterministically.
  2. Apply the workaround, and show that the same SQL query passed without the error.

I was able to generate a computationally-expensive SQL query that Postgres triggered JIT compilation:

-- context: characters table has < 8000 rows
-- see https://github.com/kelvintaywl-cci/test-postgresql-sidecar/blob/5f9658ed646a17abd5abad1133a83ef270685f92/schema.sql#L7

SELECT AVG(c1.strength * c2.strength)
FROM characters AS c1 CROSS JOIN characters AS c2;

Using EXPLAIN ANALYZE would also reveal if Postgres did trigger JIT.

Below is the output when using docker.io/library/postgres:12.6 (the official Postgres 12.6 image). We can see JIT was utilized.

                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1107700.43..1107700.44 rows=1 width=32) (actual time=9992.650..9992.651 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..791290.30 rows=63282025 width=8) (actual time=91.183..7270.120 rows=62710561 loops=1)
         ->  Seq Scan on characters c1  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.008..0.891 rows=7919 loops=1)
         ->  Materialize  (cost=0.00..162.32 rows=7955 width=4) (actual time=0.012..0.339 rows=7919 loops=7919)
               ->  Seq Scan on characters c2  (cost=0.00..122.55 rows=7955 width=4) (actual time=91.163..91.833 rows=7919 loops=1)
 Planning Time: 0.191 ms
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.580 ms, Inlining 42.983 ms, Optimization 30.411 ms, Emission 17.599 ms, Total 91.572 ms
 Execution Time: 10040.688 ms
(11 rows)

When using our custom Postgres image, we can reproduce the error.

ERROR:  could not load library "/opt/bitnami/postgresql/lib/llvmjit.so": libLLVM-11.so.1: cannot open shared object file: No such file or directory

We then apply the workaround, and rerun the SQL query.

ALTER DATABASE
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1107700.43..1107700.44 rows=1 width=32) (actual time=10792.502..10792.503 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..791290.30 rows=63282025 width=8) (actual time=0.014..7371.175 rows=62710561 loops=1)
         ->  Seq Scan on characters c1  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.005..1.247 rows=7919 loops=1)
         ->  Materialize  (cost=0.00..162.32 rows=7955 width=4) (actual time=0.000..0.305 rows=7919 loops=7919)
               ->  Seq Scan on characters c2  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.003..0.738 rows=7919 loops=1)
 Planning Time: 0.193 ms
 Execution Time: 10792.636 ms
(7 rows)

The SQL query returned successfully, and without using JIT as expected from the workaround.

Our custom Postgres image was meant to not support LLVM and JIT. Hence, the workaround here has the same effect as the eventual fix (patch here).

buy Kelvin a cup of coffee