In the Dark

A user told me the regions page wasn't loading. Not a vague “something feels slow” report. A clear, specific “the page does not work.” I checked. They were right. The entire backend had been down, and I had no idea for how long.

The homepage loaded fine. The blog loaded fine. Everything served by the frontend was working perfectly. Everything that needed the API was not. And nothing in my setup told me.

What happened

A database credential mismatch caused the backend to crash on startup. Docker Compose restarted it. It crashed again. Restarted. Crashed. The restart policy kept it in a tight loop, invisible to anyone not watching the logs.

From the outside, the site looked alive. Nginx was up. The frontend was serving pages. It just couldn't do the one thing it existed to do.

Fixing the credential got the backend running again, but the site still returned 502s. Nginx had cached the backend's old container IP from before the crash loop. A reload fixed it. Two layers of failure, each independently invisible, each taking longer to find than it should have because nothing was pointing me in the right direction.

What I thought I had

I had convinced myself the infrastructure was reasonably covered. There are health checks in the Docker Compose config. The database has a readiness probe. Services declare dependencies so they start in the right order. There is a scheduler that runs weekly to keep course data fresh.

None of that is monitoring. Health checks tell Docker whether to restart a container. They do not tell you that a container is being restarted in an infinite loop. Service dependencies ensure startup order. They do not ensure the service stays up. I had built a system that could recover from transient failures automatically and fail permanently in silence.

The worst kind of resilience: just enough to hide the problem.

Why this project exists

My background is in data. Building pipelines, designing systems, and more recently leading a team that does both. I have shipped production systems before. But the systems I know have observability baked into the culture: pipeline run logs, data quality checks and row count monitors. That muscle memory is specific to the domain. It did not transfer automatically.

FairwayPlan started as a way to see how far AI-assisted development could take someone building outside their core domain. The answer has been: surprisingly far, surprisingly fast. But “surprisingly far” is not the same as “far enough.” Gaps in experience don't disappear because the tooling is good. They show up later, in the things you didn't think to build.

In a data pipeline, I would never ship without a monitor on the output. It is reflexive. I have been burned enough times to know what happens when you don't. But I did not apply that same instinct here. The domain was different enough that the equivalent check "Is the API actually responding?" did not feel as obvious as it should have.

What it taught me about my actual job

Running this project has made me better at the work I am paid to do. When someone on my team raises an issue, I have a slightly different frame for it now. Not just “what went wrong” but “what would I have missed if this were my system and I were the only one looking at it.”

There is a difference between reviewing someone else's outage and living through your own, even a small one, even on a side project with no real stakes. It changes the questions you ask. Not just “does this handle the failure?” but “who gets told when it fails?”

I have sat in enough incident reviews to know that the root cause is almost never the thing that broke. It is the thing that should have caught it and didn't. Now I have my own example.

The monitoring strategy was a stranger on the internet

The whole point of this project is to build things I have not built before, to run into problems I have not run into before, and to come away with intuition I did not have before. This week the lesson was about monitoring. Next week it will be something else.

I built a weather pipeline with three tiers of fallback. I built a solver with a greedy backup when the MIP times out. I thought carefully about failure modes inside the application. My strategy for knowing whether the application itself was up was, apparently, waiting for someone to tell me.

Someone did. So I set up HetrixTools, which is a free external uptime monitor that pings /health every minute from four locations and alerts me via email when it gets anything other than a 200. The health endpoint now actually verifies the database connection instead of blindly returning OK. Fifteen monitors, one-minute intervals, no credit card. It took ten minutes. That is how long it took to close the gap that left the site silently broken.

New Zealand daylight saving ended today. The clocks went back, the sun set an hour earlier, and winter officially arrived. The Fairway Plan tool that disables late afternoon tee times in winter because it knows the light will run out, did not know that all this time it was the one in the dark.

What happened

What I thought I had

Why this project exists

What it taught me about my actual job

The monitoring strategy was a stranger on the internet

More from the Journal

Eleven Green Checkmarks

I Used Claude Code to Brief Claude Design

The Feedback Loop That Turned Two-Week Tasks into Hour-Long Sessions