Friday night, the first cold snap of a New Zealand winter, somewhere around twenty to eleven. I had a cup of tea and zero plans involving a laptop. Then I checked my phone and FairwayPlan was down. The browser was blunt about it: invalid SSL certificate.
I have spent most of my career as a data scientist, and more recently as a data engineer. My production incidents have a particular flavour. A pipeline fails quietly at 3am. A dashboard shows yesterday's numbers. A model turns a column to null and nobody notices until a meeting on Tuesday. Data does not page you. It just sits there being subtly wrong until a human stumbles into it.
Web infrastructure is not so polite. It pages you. At 10:42pm. On a Friday.
526
The site sits behind Cloudflare, and the error my browser showed was hiding a more specific one underneath: Cloudflare error 526. The certificate the whole world sees, the one Cloudflare serves at the edge, was perfectly healthy. The certificate on my own little server, the one Cloudflare quietly checks behind the scenes, had expired. Exactly ninety days after it was issued, which is precisely how long a Let's Encrypt certificate is supposed to last.
Those certificates are meant to renew themselves. Mine had been failing to renew for the entire ninety days, silently, in a log file I was never going to open. Which is, I admit, the most data-engineering failure mode imaginable.
The corner I never look at
I can happily reason about the solver's objective function. I can walk you through a weather lookup that falls back across three tiers when the forecast runs out. But the TLS handshake, the ACME challenge, the question of which process is allowed to hold port 80: that lives in a corner of the stack I set up exactly once, by following a guide, and then never thought about again.
It worked. Working things become invisible. Invisible things expire.
It was never going to renew
Here is the part that made me laugh, eventually. The renewal needed port 80 to prove I owned the domain. My web server already owns port 80. So every renewal attempt for ninety days reached for a door that was already taken, gave up, and wrote itself a note nobody read. And on top of that, Cloudflare was redirecting the exact request the renewal depended on before it ever reached my server.
Two independent reasons it could never have worked. It was never going to renew. It was always going to expire. The only open question was which evening it would choose.
It was never going to renew. It was always going to expire. The only open question was which evening it would choose.
The fix
I swapped the whole thing for a Cloudflare Origin certificate, which is valid for fifteen years instead of ninety days. I will quite possibly have moved on from this project, and maybe from this entire career, before it next expires. One nginx restart later, plus a brief detour through a Docker quirk I will spare you, and the site came back. Inside half an hour, most of it spent rereading my own logs in disbelief.
Nobody was even there
And this is the genuinely funny bit. Almost nobody uses FairwayPlan. It is a toy. I built it to plan golf trips and to get to be a software engineer for a few hours a week instead of a data person. On a cold June Friday, the number of people trying to plan a summer golf trip through my hobby site was, generously, zero.
The outage hurt no one. And yet the moment it broke, there I was, tea going cold, SSH'd into a server, grepping certbot logs as if my rent depended on it.
A project nobody was using took itself offline, and I still dropped everything on a Friday night to bring it back.
Maybe that is the real lesson, and it has nothing to do with certificates. The things you build for fun earn the same 10:42pm loyalty as the things you build for work, whether or not they have done anything to deserve it. The certificate will renew itself for the next fifteen years. I, apparently, remain on call indefinitely.