Kelvin Tay

enjoys technical writing, and a cheeky drink 🥃

This post chronicles my investigation on a Just-In-Time (JIT) error seen on a customer's Postgres server.

Given the complexity of the setup (Kubernetes), I thought it would be fun, and hopefully educational for anyone hitting similar errors.

Quick context

We provide our self-hosted solution as a Kubernetes (k8s) application. By default, the Postgres database server is implemented as a Stateful Set.

I had been using Postgres for > 5 years, but have never dug into Postgres's JIT feature.

Hello, Unexpected Error

Our customer noticed some data were not refreshed. We noticed the k8s CronJob responsible for refreshing data had been failing for a while.

Specifically, the Pod failed with the following error:

ERROR:  could not load library "/opt/bitnami/postgresql/lib/llvmjit.so": libLLVM-11.so.1: cannot open shared object file: No such file or directory

As a first troubleshooting step, we tried to confirm if the /opt/bitnami/postgresql/lib/llvmjit.so file was missing:

$ kubectl -n <ns> exec -it po/<Postgres pod> -- ls -lah /opt/bitnami/postgresql/lib/llvmjit.so

The file was there, so the next step was to Google what the error could mean.

Examples from the Interweb suggest the error is likely related to libLLVM-11.so.1 (example post).

Make it fail

As per Rule 2 of Debugging Rules, I elected to reproduce the error then.

At this point, I must share that no other customers reported this error. I and others in the Support team also did not experience the same error on our setup on the same application version.

We need to first understand what Postgres JIT is then, and try to reproduce the error.

It turns out Postgres >= v12 has JIT enabled by default.

Depending on the SQL query, Postgres may trigger JIT if it concludes the JIT compilation will speed up the query. You can read up the decision details here.

This makes reproduction an interesting challenge; We need to run a query that deterministically forces Postgres to trigger JIT compilation then.

Given that our k8s application sets up the Postgres server with defaults from Postgres v12, I was able to reduce the setup to testing with the underlying custom Postgres Docker image.

We can reproduce the same error when we run SELECT pg_jit_available(); query.

Implementing a workaround

I noted that we can disable JIT as a workaround. You can choose to disable JIT at the system level, or for a specific database.

ALTER SYSTEM SET jit = off;
-- or
ALTER DATABASE [db] SET jit = off;

To confirm this workaround works, we need to:

  1. Reproduce the error with an SQL query that triggers JIT deterministically.
  2. Apply the workaround, and show that the same SQL query passed without the error.

I was able to generate a computationally-expensive SQL query that Postgres triggered JIT compilation:

-- context: characters table has < 8000 rows
-- see https://github.com/kelvintaywl-cci/test-postgresql-sidecar/blob/5f9658ed646a17abd5abad1133a83ef270685f92/schema.sql#L7

SELECT AVG(c1.strength * c2.strength)
FROM characters AS c1 CROSS JOIN characters AS c2;

Using EXPLAIN ANALYZE would also reveal if Postgres did trigger JIT.

Below is the output when using docker.io/library/postgres:12.6 (the official Postgres 12.6 image). We can see JIT was utilized.

                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1107700.43..1107700.44 rows=1 width=32) (actual time=9992.650..9992.651 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..791290.30 rows=63282025 width=8) (actual time=91.183..7270.120 rows=62710561 loops=1)
         ->  Seq Scan on characters c1  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.008..0.891 rows=7919 loops=1)
         ->  Materialize  (cost=0.00..162.32 rows=7955 width=4) (actual time=0.012..0.339 rows=7919 loops=7919)
               ->  Seq Scan on characters c2  (cost=0.00..122.55 rows=7955 width=4) (actual time=91.163..91.833 rows=7919 loops=1)
 Planning Time: 0.191 ms
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.580 ms, Inlining 42.983 ms, Optimization 30.411 ms, Emission 17.599 ms, Total 91.572 ms
 Execution Time: 10040.688 ms
(11 rows)

When using our custom Postgres image, we can reproduce the error.

ERROR:  could not load library "/opt/bitnami/postgresql/lib/llvmjit.so": libLLVM-11.so.1: cannot open shared object file: No such file or directory

We then apply the workaround, and rerun the SQL query.

ALTER DATABASE
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1107700.43..1107700.44 rows=1 width=32) (actual time=10792.502..10792.503 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..791290.30 rows=63282025 width=8) (actual time=0.014..7371.175 rows=62710561 loops=1)
         ->  Seq Scan on characters c1  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.005..1.247 rows=7919 loops=1)
         ->  Materialize  (cost=0.00..162.32 rows=7955 width=4) (actual time=0.000..0.305 rows=7919 loops=7919)
               ->  Seq Scan on characters c2  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.003..0.738 rows=7919 loops=1)
 Planning Time: 0.193 ms
 Execution Time: 10792.636 ms
(7 rows)

The SQL query returned successfully, and without using JIT as expected from the workaround.

Our custom Postgres image was meant to not support LLVM and JIT. Hence, the workaround here has the same effect as the eventual fix (patch here).

buy Kelvin a cup of coffee

It’s the New Year’s Eve morning here in Tokyo, Japan. I’m typing this while cradling the baby at a Starbucks. Staring at the morning rain, I thought it’s a good time to reflect on 2023 while welcoming 2024.

Gratitude

  1. Flexibility in a remote-first company. Laundry, accompanying the wife for her doctor appointments, etc. Thankful for the team’s understanding.

  2. 3-month parental leave. It takes a village to raise a child. I think it would be too difficult for just one parent to navigate the first 3 months alone.

  3. Promotion. The new role encourages me to continue/start mentoring, glue work, technical leadership.

Failings

  1. Acted on things when I was not the best candidate to. I need to be be more aware of saying “no” to others and myself. Sometimes, you can become less helpful with your answers.

  2. Did not listen to the body. Our bodies will likely signal to us when we are overworked and stressed. Thankfully, it is better now, as I juggle priorities (family) but I wish I was better at this from the start.

Hey 2024

  1. Greater technical depth in the devops space. Metrics, k8s, networking.

  2. Champion others; would love to help others step up / promote.

if you stumble into this post, I hope you and your family enjoy a good 2024 ahead!

Best wishes!

buy Kelvin a cup of coffee

After 8+ years as a software engineer, I made a career switch in 2021. I joined CircleCI as a support engineer.

As I enter my 3rd year with CircleCI, I thought it may be good to reflect on this career change so far.

I noticed there are many literatures for support engineers looking to switch to software engineering / development. However, there aren't many opinions on the other direction.

I hope this can be helpful for folks deliberating on similar career changes.

Why I switched

I will be honest here. One of the major reason I switched was due to the salary compensation.

I was drawing a comfortable salary at my previous job. However, my wife and I were hoping to start a family and considered purchasing a home. I was able to negotiate for a better salary in this switch. (I know this is likely a rare case in many countries. For context, I am based in Tokyo, Japan.)

I also wanted to try something (slightly) different.

Covid-19 was much less of a pandemic by 2021, but it prompted me to relook at life and its (un)certainties. One of the interviewers asked me “where do you see yourself in 3 years?”. I took the opposite view and considered what happened in the last 2 years, from 2019 to 2021. So much has changed across societies as we battled Covid-19. For me, the unexpected Covid-19 did spur me to try something different. Life is too short to be doing the same thing for a lifetime.

Critically, I also enjoyed using CircleCI as a customer. Don't get me wrong; My past adventures were with companies I did believe in. However, being a heavy and happy customer of CircleCI, I found it empowering to be in a position to share my enthusiasm with the customers.

How has the journey been

Fortunately, CircleCI is a product for software teams. This meant that, as a support engineer, I am tackling problems with another software engineer on the other side. I can empathize with the customer.

I have found it really rewarding to be able to draw on my past experience, and share best practices and advice with the customers.

Being part of the JAPAC (Japan + Asia Pacific) team, I also get opportunities to brush up my Japanese. To be honest, my wife and Deepl helped me a lot. I never aspire to reach near-native level for Japanese, but this has been a positive bonus.

One of the major perks is also learning new technologies. I get to learn from both our engineers, and also our customers (e.g., their tech stack). Prior to this job, I've never dug deep with Packer, Nomad, Git LFS, and even Windows PowerShell to name a few. (You can find my public repos for testing here)

Would I recommend it?

In my case, I've enjoyed the transition so far. Thankfully, I still get to utilize much of my software engineering knowledge to troubleshoot with customers.

However, like any choice made, technical support as a career comes with challenges too.

I think being able to empathize with customers is critical. Beyond the technical know-how, you'd need to know how to handle delicate communications at times (e.g., system-wide incident).

If you are a software engineer reading this, I would encourage you to consider the switch if:

(Traits)

  1. You enjoy communicating with people as much as digging deep into code / configurations.
  2. You enjoy debugging, or chasing down root causes as much as implementing new features.

(Strengths)

  1. You can empathize with users, and can advocate on their behalf. I think folks contributing to open-source projects may be good candidates.
  2. You are patient and can negotiate tough communications.

(Motivations)

  1. You would like to be closer to customers and the product.
  2. You enjoy opportunities in sharing best practices with customers.
  3. You believe in and enjoy the product.

These are just my opinions based on my experience so far. Your mileage may vary.

However, remember that switching does not mean you cannot go back. You can and should re-evaluate this decision along the way.

buy Kelvin a cup of coffee

Here are some random facts about me, in case this ever becomes useful.

  1. I am blessed with an identical twin brother. He's the better one!
  2. I write with my right hand, but am left-footed football-wise.
  3. I became a father to a baby boy late 2023.

buy Kelvin a cup of coffee

A good friend educated me on how he set up a web application to inform the parents minute-by-minute re: the arrival of his baby. 👶

I thought that was a brilliant idea! This way, new parents can push updates, and avoid the thundering of “has the baby arrived yet?” questions. 💡

So, I set about trying to build one, with the following criteria:

  • I don't want to spend time and money maintaining any infrastructure. 💰
  • I would like to be able to limit the site's visibility to folks I share it with. ㊙️
  • I don't need updates to be “real-time”; Some delay or latency is fine. 📠
  • The updates should also be in Chinese and Japanese. 🤖
  • Building it should be fun! 🎮

“Why don't you just post it on Twitter/X?”

I also wanted to avoid social media for such announcements, if possible.

How did it go?

I ended up with a user experience where:

  1. Folks visit the site hosted on https://write.as with password protection.
  2. I can post updates via email + image attachments.

In terms of the “tech stack”, this is pretty low-code. I've set it up via:

pipedream workflow

You can use Deepl to translate for free (within limits). I also am using a free account on Pipedream, so I'm really thankful for such services.

I spent 2 weekends experimenting before tying this up. However, I really enjoyed building this simple workflow.

sample post on blog

buy Kelvin a cup of coffee

I am not a prolific contributor on open-source work in any sense. It has also been a while since I received any issues or pull-requests on my work.

I recently noticed an issue filed by a user on the (unofficial) CircleCI Terraform provider.

The issue raised was something I was aware of but chose not to fix. I was not sure if the impacted feature would be well-used for the bug to be noticed.

That someone reported it helped signal to me this project and feature is depended upon by users. (Thank you @ nagendrasanthosh)

I took a Saturday morning to release the fix, and it was a fun occasion.

I'd like to keep at this policy.

I am not going to jump into fixing any projects if I think it may take more than 1 weekend to resolve. It may sound selfish, but while the project is Free, my time is not; Family will come first. 👪

What is your policy or strategy on protecting your time on OSS work?

buy Kelvin a cup of coffee

My wife and I are expecting our first child this September.

Having been convinced the magic of a topponcino during a parenting class, we decided to make it instead of buying one.

“Oh, it's almost $100 to buy one? Why don't we make one ourselves?”

We thought this (1) would be more fun, (2) allows us to customize as we like, and (3) gives us a greater sense of ownership!

I am hoping I become good at this. No, not parenting, I meant sewing (I kid).

Fabrication

I was introduced to Yuzawaya, by the wife on a Sunday afternoon. Here in Japan, Yuzawaya is the hobby shop, where you can find reasonably-priced fabric, and then of course, Marimekko fabric as well.

Base Cover
topponcino base topponcino cover

It felt empowering knowing you get to choose what design and material goes into the product. We also had fun people-watching – guessing what others may be creating for their weekend project.

If you are looking to make other items like bibs, Yuzawaya also offers blueprints (they call them recipes). Unfortunately, there was no blueprints for topponcinos.

Sizing things up

Thankfully, the Interweb is an amazing place. I was able to find a blueprint from this organic cotton shop: https://www.fuwarico.com/topponcino/

Still, this presents a bit of a challenge, since we would have to print it on A1-size paper.

The convenience stores here allow prints up to A3 size. Photo-printing services like Kinko's also offer easy A2 prints.

However, I think I would need to fork out $40+ to print in A1. Most services assume you would be printing colour posters at that scale.

Being an engineer, and mainly because I'm skint, I decided to:

  1. crop the blueprint to only the intricate portions (within A3 sizes)
  2. draw the rest (mainly regular lines) myself onto the fabric

top of topponcino

Why pay that $40, if most of the to-be-printed portions can be drawn by myself? I may be imagining things, but I assume my Asian grandparents would be proud.

thinking

(Of course, some may argue I could have simply drawn all parts by myself.)

Here's hoping things go well! We'll share the finished product soon, hopefully.

#parenting #sewing

buy Kelvin a cup of coffee

I have released support for the CircleCI Runner resource-class and token in my unofficial Terraform provider for CircleCI, as per v0.10.3.

Developers can now manage the provisioning (and teardown) of CircleCI self-hosted Runners within Terraform. You can explore an example here.

This was a fun challenge, and I wanted to document my journey on this work.

Investigation

Unlike other resources, self-hosted runners are not manageable under the official V2 API. Developers had to use the CircleCI CLI to manage resource-classes and tokens instead.

To port this into my Terraform provider, I was hoping there was a HTTP API available. This way, I can continue using my approach of abstracting the HTTP API away to a Go SDK.

I assumed (wrongly) the CLI was using GraphQL under the hood for Runner operations, as with many others (e.g., for Orbs).

Digging into the source-code, I then realized Runner resource-classes and tokens can be managed via a HTTP API; It was simply not publicly documented, yet.

Glue

After the legwork mentioned above, I tested the HTTP APIs with an OpenAPI (Swagger 2.0) document. This enabled me to generate a Go SDK for Runner APIs.

Why are the Go SDKs separated? Wouldn't it be easier to keep it all in one?

That was something I mulled over for some time indeed. I have documented my reasons for keeping them separate.

Assembly

With the Go SDK published, I “simply” have to then expose the Runner resource-classes and tokens in the Terraform provider codebase.

The main work was done within a pull request here. This also included acceptance tests, and examples.

Ultimately, I noted self-hosted (machine) runners are also available for CircleCI's Server customers (i.e., self-hosted CircleCI). We would want to extend and ensure this addition can be used by platform teams using CircleCI Server.

It turned out that the Runner API:

Thankfully, this was a quick patch. I was also able to verify this fix against my own CircleCI Server instance.

Things I learnt

This feature was satisfying for me to build, and I had many learning points along the journey.

  1. Read the code: This feature would not have been completed if I did not dig deeper into the publicly-available source code 📖

  2. Keep trying: I am still learning (and failing) at Go. However, I think it is important to keep trying and learning. Keeping this source code open-sourced forces me to keep myself honest too about my lack of knowledge. For fellow engineers out there, let's keep at it! 🤓

#terraform #circleci #runner #go

buy Kelvin a cup of coffee

knock them dead

Is it really though?

I apologize for the misleading title. I've done that intentionally, much like how DHH and Fireship may have titled their responses to the Amazon Prime Video team's recent engineering blog post (Mar 2023).

If you skimmed through these pieces, you may conclude hastily that going Serverless is going to hurt your team and product.

via GIPHY

Having worked with serverless technologies at a past career, I have seen its pros and cons. This is much like any other approach, whether monoliths or micro-services.

Let's take a walk.

Would AWS really tell us that “We are totally wrong about Serverless all along” if they still have much at stake to sell you serverless technologies?

In addition, the Amazon Prime Video team had to re-architecture just one of their services (the monitoring service, specifically).

Overtime, business and technical challenges can evolve. Logically, as software engineers, we should also reevaluate our solutions overtime (e.g., architecture) to verify if they still meet the requirements well. I think it is important not to be married to one idea or paradigm as an engineer.

Also, Amazon Prime Video is a huge product, in terms of their user traffic. The scaling challenge with their monitoring service will match that.

Unless your product is similar in scale, I'd suggest calming down before hurrying to re-architecture anything. Watch something on Amazon Prime, perhaps :)

If anything, I do hope this experience from the Amazon Prime Video's engineering team helped provide some useful feedback to AWS internally.

buy Kelvin a cup of coffee

I have no underlying data to back this up, but the consensus seems to be that a technical support career is not that coveted.

Most aspiring software engineers would prefer to jump into software development than start off in technical support. I was also rejected by a coding bootcamp when I offered to give a guest talk about a career in technical support. (It's a shame.)

I can understand that.

I am considering my next steps in becoming a staff support engineer. I would like to remain on the technical path if possible.

I had tried looking up resources for career advancement in customer-facing engineering roles. However, this has been challenging indeed.

If you have any recommendations of similar resources like https://staffeng.com/book/ for technical support engineers, please let me know!

#individualcontributor #career #engineering

buy Kelvin a cup of coffee

Enter your email to subscribe to updates.