Kelvin Tay

My memorable fail

July 5, 2024

It was past 6pm on a Friday of 2018. I hit enter, and watch the script run the tests for database migration. My team-mates were busy enjoying the Friday beers while leadership delivered the week's highlights.

The terminal output showed all tests passed. I smirked before glancing at the production database. Usernames like Erlich Bachman, and Richard Hendricks stood out on my screen.

These fictional characters from Silicon Valley... how did they end up being our customers?

Panic.

DROP TABLE users

In the backdrop of tipsy engineers, I sobered to my realisation that my tests earlier ran against production.

All user records dropped in place of dummy ones, just like that.

“Hey guys, I think I might have just dropped the user database on production...”

Context on my mistake

What simply happened there, was I have used the wrong environment file, and ended up pointing to production instead of our development environment.

My main project then was to migrate the Firebase database to Postgres (GCP CloudSQL) in phases. I had been context-switching between development, staging and production environments prior to the mistake.

Murphy's Law strikes

“Thank goodness we have scheduled backups on the Firebase database. We can restore from there!” my colleague rushed to my rescue.

We restored the latest backup, and watched in horror as random usernames showed up again.

It turned out that we did have regular backups. However, these backups were mistakenly made against staging, not production.

Luck

Luck came in the form of a local backup of the production database I had on my machine. This was 1. not compliant; I should not have a local copy 2. a few weeks old so there are missing users still.

It also happened to be a 3-day weekend. This allowed us more time to try to recover the data over the weekend.

We were able to recover about 80% of the users, and recreate user records for the “missing” customers (our account managers helped a lot).

Mistakes and corrections thereafter

All engineers then had “god mode” across all environments then. We tightened our IAM permissions, and also leveraged role-based accounts.
Our scheduled backups were never verified until then. We performed regular checks on our database restore process.

Aftermath

Today, many of us have left that startup but still catch up from time to time. We often reminisce about this incident, in good spirits.

For me, this remains my go-to “Tell me about your toughest time” story.

#failure #production #horror

buy Kelvin a cup of coffee ☕

Hunting down a Postgres JIT error

January 21, 2024

This post chronicles my investigation on a Just-In-Time (JIT) error seen on a customer's Postgres server.

Given the complexity of the setup (Kubernetes), I thought it would be fun, and hopefully educational for anyone hitting similar errors.

Quick context

We provide our self-hosted solution as a Kubernetes (k8s) application. By default, the Postgres database server is implemented as a Stateful Set.

I had been using Postgres for > 5 years, but have never dug into Postgres's JIT feature.

Hello, Unexpected Error

Our customer noticed some data were not refreshed. We noticed the k8s CronJob responsible for refreshing data had been failing for a while.

Specifically, the Pod failed with the following error:

ERROR:  could not load library "/opt/bitnami/postgresql/lib/llvmjit.so": libLLVM-11.so.1: cannot open shared object file: No such file or directory

As a first troubleshooting step, we tried to confirm if the /opt/bitnami/postgresql/lib/llvmjit.so file was missing:

$ kubectl -n <ns> exec -it po/<Postgres pod> -- ls -lah /opt/bitnami/postgresql/lib/llvmjit.so

The file was there, so the next step was to Google what the error could mean.

Examples from the Interweb suggest the error is likely related to libLLVM-11.so.1 (example post).

Make it fail

As per Rule 2 of Debugging Rules, I elected to reproduce the error then.

At this point, I must share that no other customers reported this error. I and others in the Support team also did not experience the same error on our setup on the same application version.

We need to first understand what Postgres JIT is then, and try to reproduce the error.

It turns out Postgres >= v12 has JIT enabled by default.

Depending on the SQL query, Postgres may trigger JIT if it concludes the JIT compilation will speed up the query. You can read up the decision details here.

This makes reproduction an interesting challenge; We need to run a query that deterministically forces Postgres to trigger JIT compilation then.

Given that our k8s application sets up the Postgres server with defaults from Postgres v12, I was able to reduce the setup to testing with the underlying custom Postgres Docker image.

We can reproduce the same error when we run SELECT pg_jit_available(); query.

Implementing a workaround

I noted that we can disable JIT as a workaround. You can choose to disable JIT at the system level, or for a specific database.

ALTER SYSTEM SET jit = off;
-- or
ALTER DATABASE [db] SET jit = off;

To confirm this workaround works, we need to:

Reproduce the error with an SQL query that triggers JIT deterministically.
Apply the workaround, and show that the same SQL query passed without the error.

I was able to generate a computationally-expensive SQL query that Postgres triggered JIT compilation:

-- context: characters table has < 8000 rows
-- see https://github.com/kelvintaywl-cci/test-postgresql-sidecar/blob/5f9658ed646a17abd5abad1133a83ef270685f92/schema.sql#L7

SELECT AVG(c1.strength * c2.strength)
FROM characters AS c1 CROSS JOIN characters AS c2;

Using EXPLAIN ANALYZE would also reveal if Postgres did trigger JIT.

Below is the output when using docker.io/library/postgres:12.6 (the official Postgres 12.6 image). We can see JIT was utilized.

                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1107700.43..1107700.44 rows=1 width=32) (actual time=9992.650..9992.651 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..791290.30 rows=63282025 width=8) (actual time=91.183..7270.120 rows=62710561 loops=1)
         ->  Seq Scan on characters c1  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.008..0.891 rows=7919 loops=1)
         ->  Materialize  (cost=0.00..162.32 rows=7955 width=4) (actual time=0.012..0.339 rows=7919 loops=7919)
               ->  Seq Scan on characters c2  (cost=0.00..122.55 rows=7955 width=4) (actual time=91.163..91.833 rows=7919 loops=1)
 Planning Time: 0.191 ms
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.580 ms, Inlining 42.983 ms, Optimization 30.411 ms, Emission 17.599 ms, Total 91.572 ms
 Execution Time: 10040.688 ms
(11 rows)

When using our custom Postgres image, we can reproduce the error.

ERROR:  could not load library "/opt/bitnami/postgresql/lib/llvmjit.so": libLLVM-11.so.1: cannot open shared object file: No such file or directory

We then apply the workaround, and rerun the SQL query.

ALTER DATABASE
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1107700.43..1107700.44 rows=1 width=32) (actual time=10792.502..10792.503 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..791290.30 rows=63282025 width=8) (actual time=0.014..7371.175 rows=62710561 loops=1)
         ->  Seq Scan on characters c1  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.005..1.247 rows=7919 loops=1)
         ->  Materialize  (cost=0.00..162.32 rows=7955 width=4) (actual time=0.000..0.305 rows=7919 loops=7919)
               ->  Seq Scan on characters c2  (cost=0.00..122.55 rows=7955 width=4) (actual time=0.003..0.738 rows=7919 loops=1)
 Planning Time: 0.193 ms
 Execution Time: 10792.636 ms
(7 rows)

The SQL query returned successfully, and without using JIT as expected from the workaround.

Our custom Postgres image was meant to not support LLVM and JIT. Hence, the workaround here has the same effect as the eventual fix (patch here).

buy Kelvin a cup of coffee ☕

Penciling career goals for 2024

December 30, 2023

It’s the New Year’s Eve morning here in Tokyo, Japan. I’m typing this while cradling the baby at a Starbucks. Staring at the morning rain, I thought it’s a good time to reflect on 2023 while welcoming 2024.

Gratitude

Flexibility in a remote-first company. Laundry, accompanying the wife for her doctor appointments, etc. Thankful for the team’s understanding.
3-month parental leave. It takes a village to raise a child. I think it would be too difficult for just one parent to navigate the first 3 months alone.
Promotion. The new role encourages me to continue/start mentoring, glue work, technical leadership.

Failings

Acted on things when I was not the best candidate to. I need to be be more aware of saying “no” to others and myself. Sometimes, you can become less helpful with your answers.
Did not listen to the body. Our bodies will likely signal to us when we are overworked and stressed. Thankfully, it is better now, as I juggle priorities (family) but I wish I was better at this from the start.

Hey 2024

Greater technical depth in the devops space. Metrics, k8s, networking.
Champion others; would love to help others step up / promote.

if you stumble into this post, I hope you and your family enjoy a good 2024 ahead!

Best wishes!

buy Kelvin a cup of coffee ☕

Reflections: From Software Engineering to Technical Support

October 19, 2023

After 8+ years as a software engineer, I made a career switch in 2021. I joined CircleCI as a support engineer.

As I enter my 3rd year with CircleCI, I thought it may be good to reflect on this career change so far.

I noticed there are many literatures for support engineers looking to switch to software engineering / development. However, there aren't many opinions on the other direction.

I hope this can be helpful for folks deliberating on similar career changes.

Why I switched

I will be honest here. One of the major reason I switched was due to the salary compensation.

I was drawing a comfortable salary at my previous job. However, my wife and I were hoping to start a family and considered purchasing a home. I was able to negotiate for a better salary in this switch. (I know this is likely a rare case in many countries. For context, I am based in Tokyo, Japan.)

I also wanted to try something (slightly) different.

Covid-19 was much less of a pandemic by 2021, but it prompted me to relook at life and its (un)certainties. One of the interviewers asked me “where do you see yourself in 3 years?”. I took the opposite view and considered what happened in the last 2 years, from 2019 to 2021. So much has changed across societies as we battled Covid-19. For me, the unexpected Covid-19 did spur me to try something different. Life is too short to be doing the same thing for a lifetime.

Critically, I also enjoyed using CircleCI as a customer. Don't get me wrong; My past adventures were with companies I did believe in. However, being a heavy and happy customer of CircleCI, I found it empowering to be in a position to share my enthusiasm with the customers.

How has the journey been

Fortunately, CircleCI is a product for software teams. This meant that, as a support engineer, I am tackling problems with another software engineer on the other side. I can empathize with the customer.

I have found it really rewarding to be able to draw on my past experience, and share best practices and advice with the customers.

Being part of the JAPAC (Japan + Asia Pacific) team, I also get opportunities to brush up my Japanese. To be honest, my wife and Deepl helped me a lot. I never aspire to reach near-native level for Japanese, but this has been a positive bonus.

One of the major perks is also learning new technologies. I get to learn from both our engineers, and also our customers (e.g., their tech stack). Prior to this job, I've never dug deep with Packer, Nomad, Git LFS, and even Windows PowerShell to name a few. (You can find my public repos for testing here)

In my case, I've enjoyed the transition so far. Thankfully, I still get to utilize much of my software engineering knowledge to troubleshoot with customers.

However, like any choice made, technical support as a career comes with challenges too.

I think being able to empathize with customers is critical. Beyond the technical know-how, you'd need to know how to handle delicate communications at times (e.g., system-wide incident).

If you are a software engineer reading this, I would encourage you to consider the switch if:

(Traits)

You enjoy communicating with people as much as digging deep into code / configurations.
You enjoy debugging, or chasing down root causes as much as implementing new features.

(Strengths)

You can empathize with users, and can advocate on their behalf. I think folks contributing to open-source projects may be good candidates.
You are patient and can negotiate tough communications.

(Motivations)

You would like to be closer to customers and the product.
You enjoy opportunities in sharing best practices with customers.
You believe in and enjoy the product.

These are just my opinions based on my experience so far. Your mileage may vary.

However, remember that switching does not mean you cannot go back. You can and should re-evaluate this decision along the way.

buy Kelvin a cup of coffee ☕

Trivia

September 7, 2023

Here are some random facts about me, in case this ever becomes useful.

I am blessed with an identical twin brother. He's the better one!
I write with my right hand, but am left-footed football-wise.
I became a father to a baby boy late 2023.

buy Kelvin a cup of coffee ☕

Baby Birth Announcements on Serverless

August 28, 2023

A good friend educated me on how he set up a web application to inform the parents minute-by-minute re: the arrival of his baby. 👶

I thought that was a brilliant idea! This way, new parents can push updates, and avoid the thundering of “has the baby arrived yet?” questions. 💡

So, I set about trying to build one, with the following criteria:

I don't want to spend time and money maintaining any infrastructure. 💰
I would like to be able to limit the site's visibility to folks I share it with. ㊙️
I don't need updates to be “real-time”; Some delay or latency is fine. 📠
The updates should also be in Chinese and Japanese. 🤖
Building it should be fun! 🎮

“Why don't you just post it on Twitter/X?”

I also wanted to avoid social media for such announcements, if possible.

How did it go?

I ended up with a user experience where:

Folks visit the site hosted on https://write.as with password protection.
I can post updates via email + image attachments.

In terms of the “tech stack”, this is pretty low-code. I've set it up via:

Orchestrating all logic within a Pipedream workflow
Using an email trigger to capture email body and attached images
Leveraging Deepl's API to translate EN content to ZH and JA
Uploading images via the Snap.As API
Publishing via the WriteAs API
Putting the business logic within a 106-line Python function

pipedream workflow

You can use Deepl to translate for free (within limits). I also am using a free account on Pipedream, so I'm really thankful for such services.

I spent 2 weekends experimenting before tying this up. However, I really enjoyed building this simple workflow.

sample post on blog

buy Kelvin a cup of coffee ☕

Protecting my time on OSS projects

August 6, 2023

I am not a prolific contributor on open-source work in any sense. It has also been a while since I received any issues or pull-requests on my work.

I recently noticed an issue filed by a user on the (unofficial) CircleCI Terraform provider.

The issue raised was something I was aware of but chose not to fix. I was not sure if the impacted feature would be well-used for the bug to be noticed.

That someone reported it helped signal to me this project and feature is depended upon by users. (Thank you @ nagendrasanthosh)

I took a Saturday morning to release the fix, and it was a fun occasion.

I'd like to keep at this policy.

I am not going to jump into fixing any projects if I think it may take more than 1 weekend to resolve. It may sound selfish, but while the project is Free, my time is not; Family will come first. 👪

What is your policy or strategy on protecting your time on OSS work?

buy Kelvin a cup of coffee ☕

Coverstory

July 30, 2023

This is the 2nd and final part to my adventures making a topponcino. You can read the 1st part here

I will also try to stitch a few puns around this topic, because why not? 🤡

Equipment Needed

This was the minimal list of items I got to complete the topponcino. Hopefully, this gets you covered:

Item	Remarks
Sewing machine	We bought a portable one
Cotton	100% cotton would be ideal, I feel
Fabric	Purchased at Yuzawaya
Bobbins	For bottom thread on the sewing machine
Thread	You can also find some at 100-yen shops like CanDo
Long Ruler	Really recommended
Fabric Pen	For marking seam lines on the fabric
Seam Ripper	Your life-saver for undoing your wrongs

T(h)reading a fine line

My Japanese is still terrible, so I had a hard time watching the set-up videos from the sewing machine manufacturer. My wife helped a lot, but I was still fumbling.

Thankfully, it wasn't a radically-different machine, so I was also able to refer to English content such as Glory Allen's. Highly recommended, unless you prefer annoying your partner like I did.

As a Singaporean, I did learn some stitching during Home Economics classes in the early 2000s. 🪡 However, I don't think we ever learned how to operate a sewing machine. My wife, who had her education in Japan, was practically showing off her life skills. 🇯🇵 1 – 0 🇸🇬.

Navigating the sharp curves for the seam was challenging. Fortunately, the sewing machine came with a foot pedal switch; I was able to control the speed via pressure on the pedal! This seamed too good to be true. 🧶

Alas, like a beginner, I overestimated my rhythm when I got the hang of it. Having a seam ripper definitely helps, as you can undo your embarrassment.

when things don't line up

Here is the sewn base, after a week of procrastination.

topponcino base sewn

If one looks closely, there are irregular seam lines. However, I opted against perfecting them, since I reasoned it looks more “handmade” that way.

You reap what you sew. Well, the baby is using it, anyways 👶

Tying it up

We finally completed our first topponcino for the baby!

topponcino final

I think it turned out pretty well.

I'm hoping we will work on more sewing projects overtime. I hope this inspires anyone to also try sewing as a new hobby.

You don't know how much fun you're ミシン (missing) out!

#parenting #sewing #hobby

Being Creative While Skint

July 16, 2023

My wife and I are expecting our first child this September.

Having been convinced the magic of a topponcino during a parenting class, we decided to make it instead of buying one.

“Oh, it's almost $100 to buy one? Why don't we make one ourselves?”

We thought this (1) would be more fun, (2) allows us to customize as we like, and (3) gives us a greater sense of ownership!

I am hoping I become good at this. No, not parenting, I meant sewing (I kid).

Fabrication

I was introduced to Yuzawaya, by the wife on a Sunday afternoon. Here in Japan, Yuzawaya is the hobby shop, where you can find reasonably-priced fabric, and then of course, Marimekko fabric as well.

Base	Cover

It felt empowering knowing you get to choose what design and material goes into the product. We also had fun people-watching – guessing what others may be creating for their weekend project.

If you are looking to make other items like bibs, Yuzawaya also offers blueprints (they call them recipes). Unfortunately, there was no blueprints for topponcinos.

Sizing things up

Thankfully, the Interweb is an amazing place. I was able to find a blueprint from this organic cotton shop: https://www.fuwarico.com/topponcino/

Still, this presents a bit of a challenge, since we would have to print it on A1-size paper.

The convenience stores here allow prints up to A3 size. Photo-printing services like Kinko's also offer easy A2 prints.

However, I think I would need to fork out $40+ to print in A1. Most services assume you would be printing colour posters at that scale.

Being an engineer, and mainly because I'm skint, I decided to:

crop the blueprint to only the intricate portions (within A3 sizes)
draw the rest (mainly regular lines) myself onto the fabric

top of topponcino

Why pay that $40, if most of the to-be-printed portions can be drawn by myself? I may be imagining things, but I assume my Asian grandparents would be proud.

thinking

(Of course, some may argue I could have simply drawn all parts by myself.)

Here's hoping things go well! We'll share the finished product soon, hopefully.

#parenting #sewing

buy Kelvin a cup of coffee ☕

How I added CircleCI Runner support for Terraform provider

June 25, 2023

I have released support for the CircleCI Runner resource-class and token in my unofficial Terraform provider for CircleCI, as per v0.10.3.

Developers can now manage the provisioning (and teardown) of CircleCI self-hosted Runners within Terraform. You can explore an example here.

This was a fun challenge, and I wanted to document my journey on this work.

Investigation

Unlike other resources, self-hosted runners are not manageable under the official V2 API. Developers had to use the CircleCI CLI to manage resource-classes and tokens instead.

To port this into my Terraform provider, I was hoping there was a HTTP API available. This way, I can continue using my approach of abstracting the HTTP API away to a Go SDK.

I assumed (wrongly) the CLI was using GraphQL under the hood for Runner operations, as with many others (e.g., for Orbs).

Digging into the source-code, I then realized Runner resource-classes and tokens can be managed via a HTTP API; It was simply not publicly documented, yet.

Glue

After the legwork mentioned above, I tested the HTTP APIs with an OpenAPI (Swagger 2.0) document. This enabled me to generate a Go SDK for Runner APIs.

Why are the Go SDKs separated? Wouldn't it be easier to keep it all in one?

That was something I mulled over for some time indeed. I have documented my reasons for keeping them separate.

Assembly

With the Go SDK published, I “simply” have to then expose the Runner resource-classes and tokens in the Terraform provider codebase.

The main work was done within a pull request here. This also included acceptance tests, and examples.

Ultimately, I noted self-hosted (machine) runners are also available for CircleCI's Server customers (i.e., self-hosted CircleCI). We would want to extend and ensure this addition can be used by platform teams using CircleCI Server.

It turned out that the Runner API:

is via https://runner.circleci.com for CircleCI Cloud
is via https://YOUR-SELF-HOSTED-DOMAIN for CircleCI Server

Thankfully, this was a quick patch. I was also able to verify this fix against my own CircleCI Server instance.

Things I learnt

This feature was satisfying for me to build, and I had many learning points along the journey.

Read the code: This feature would not have been completed if I did not dig deeper into the publicly-available source code 📖
Keep trying: I am still learning (and failing) at Go. However, I think it is important to keep trying and learning. Keeping this source code open-sourced forces me to keep myself honest too about my lack of knowledge. For fellow engineers out there, let's keep at it! 🤓

#terraform #circleci #runner #go

buy Kelvin a cup of coffee ☕

DROP TABLE users

Context on my mistake

Murphy's Law strikes

Luck

Mistakes and corrections thereafter

Aftermath

Quick context

Hello, Unexpected Error

Make it fail

Implementing a workaround

Gratitude

Failings

Hey 2024

Why I switched

How has the journey been

Would I recommend it?

How did it go?

Equipment Needed

T(h)reading a fine line

Tying it up

Fabrication

Sizing things up

Investigation

Glue

Assembly

Things I learnt