Sleep — Southcore Labs Blog

I logged into my database dashboard a couple of weeks ago and felt my stomach drop. The compute usage was up. Not a little up. A lot up. And there was no good reason for it.

There'd been no traffic spike. No runaway query. No customer hammering anything. No fire to put out. Just a small monitoring service doing exactly what I'd built it to do, every minute of every day, quietly running up a bill while I was off building other things and feeling rather good about myself.

I'll be honest with you up front, because there's no version of this story where I'm the hero. This was a problem I created. I wrote the code that did it, I deployed it, and then I forgot about it for months while it sat there ticking away, costing me a little more every single day. What follows is me finding it, owning it, and fixing it. The fix is the fun part. The mistake is the instructive part.

Let me back up and explain what's actually running.

What Pulse Is

Pulse is the uptime monitor inside Southcore Labs. It's the thing that pings my clients' sites every minute, writes down what it saw, opens an incident when something falls over, and fires off an email and a WhatsApp when it comes back. If you've ever used Better Stack, UptimeRobot, or Pingdom, you already know the shape of it.

It runs on Neon. If you haven't come across Neon before, it's serverless Postgres. That word "serverless" is doing a lot of work, so let me unpack it. It's a normal Postgres database, except the compute that runs your queries isn't on all the time. When nothing's talking to the database, the compute spins down to zero and you stop paying for it. When a query comes in, it spins back up. You pay for compute by the hour, but only for the hours the thing is actually awake. It's a genuinely beautiful idea, and it's the whole reason the bill is supposed to stay small for a service like mine that sits idle most of the time.

Supposed to.

The Mistake

Here's how Pulse worked. Every minute, a cron job woke up and asked Postgres a simple question: which monitors are due for a check? It ran the checks, then wrote the results back. For every monitor, every minute, it did roughly three writes straight to the database.

// the old hot path: writes to Postgres, every monitor, every minute
await safe("updateMonitor", monitor.id, () =>
  updateMonitor({
    monitorId: monitor.id,
    lastStatus: domain.newStatus,
    lastCheckedAt: now,
    lastResponseTime: result.responseTime,
    consecutiveFailures: domain.consecutiveFailures,
  }),
);

const shouldLog = result.status === "down" || Math.random() < 0.2;
if (shouldLog) {
  await safe("createCheck", monitor.id, () =>
    createCheck({ monitorId: monitor.id /* ...status, timing, code */ }),
  );
}
// ...and one more upsert to keep the daily uptime average fresh

That looks reasonable, right? Two or three small writes a minute per monitor. A tiny price for honest data.

Except go back and read what I said about Neon. The compute sleeps when nothing's happening. And a write every sixty seconds is, by any definition, something happening. So the database never slept. Not once. Every minute, on the minute, my cron reached out and poked it awake, did a few tiny writes, and left it spinning. The meter just ran. And ran. And ran.

I'd accidentally built the one access pattern that's perfectly designed to defeat the entire point of serverless Postgres. I was paying for a database to stay awake 24 hours a day so it could do a few seconds of real work an hour.

Why It Took Me So Long To See It

There's a specific flavour of shame that comes with writing slow, wasteful code and not knowing. It doesn't show up in a stack trace. Nothing turns red. No test fails. The site works. The monitors monitor. It just quietly bleeds money in the background while you walk around certain that everything's fine. I'm honestly more at peace with the bugs that scream. It's the ones that whisper that get me.

The Fix: Buffer, Then Flush

The fix is the kind that's obvious only in hindsight, which I've decided is the most humiliating kind of obvious there is.

Almost none of those writes need to be true the instant they happen. There's exactly one that does. When a site goes down, I want that incident in Postgres now, this second, because it's the audit trail and because the alerts key off it. Everything else is just bookkeeping. Whether the dashboard thinks a monitor was last checked at 14:32:05 or 14:32:08 changes nothing, as long as I get to the right answer eventually.

So here's the move. Don't write it. Hold it somewhere cheap. Flush it to Postgres later, all at once.

I already had Upstash Redis in the project for rate limiting, so adding a second tenant was nearly free. Now the hot path writes to Redis instead of Postgres. Same shape, same call site, different destination.

// the new hot path: write to Redis, not Postgres
if (cacheEnabled) {
  await safe("setMonitorState", monitor.id, () =>
    setMonitorState(monitor.id, {
      lastStatus: domain.newStatus,
      lastCheckedAt: now.toISOString(),
      lastResponseTime: result.responseTime,
      consecutiveFailures: domain.consecutiveFailures,
    }),
  );
}

The monitor's latest status goes into a Redis hash. Individual checks pile up in a list. The daily counters tick along as Redis integers. Every write also drops the monitor's id into a "dirty" set, so that later, the flush knows exactly what it has to drain and nothing more.

Then, at the end of every tick, the cron asks one more question: how long's it been since I last flushed?

export async function shouldFlush(): Promise<boolean> {
  if (!redis) return false;
  const last = await redis.get<number>(KEY.flushedAt);
  if (!last) return true;
  const elapsed = Math.floor(Date.now() / 1000) - last;
  return elapsed >= intervalSeconds(); // defaults to 3600, i.e. one hour
}

If it's been an hour, it drains everything out of Redis and writes it to Postgres in batches. One transaction for all the monitor states. One bulk insert for all the buffered checks. One upsert per monitor for the daily aggregates. Then it stamps the clock and goes quiet again. Which means the database gets poked awake once an hour instead of sixty times an hour, does one slightly bigger chunk of work, and goes back to sleep.

Incidents stay synchronous. They're rare, they matter, and they go straight to Postgres the moment they happen. There's even a nice side effect: when an incident fires, it forces a flush at the end of that tick, so nobody's ever staring at an hour-old dashboard in the middle of a real outage.

The Math I Almost Got Wrong

There's one part of this I nearly shipped broken, and I'm writing it down mostly so the next person doesn't have to find it at 11pm on a Tuesday. That next person is, of course, me.

Pulse keeps a daily average response time per monitor. When you're folding in one sample at a time, the running mean is easy:

// old: fold in a single new sample
avgResponseTime: sql`(${avg} * ${count} + ${newValue}) / (${count} + 1)`,

But the moment you batch it, you can't just swap that 1 for the batch size and call it a day. You need the sum of all the new samples and the new count, and you weight the existing average by its own existing count, separately:

// new: fold in a whole batch (respSum = sum of the batch, total = batch size)
avgResponseTime: sql`(${avg} * ${count} + ${respSum}) / (${count} + ${total})`,

It's a couple of lines of arithmetic and it's so easy to get subtly, silently wrong. The kind of wrong that doesn't crash, it just quietly reports numbers that are a little bit off, forever. Write the formula in a comment right next to the SQL. Future you will be grateful, and a little smug.

Keeping The Dashboard Honest

There's a wrinkle with buffering. The database now lags reality by up to an hour. That's completely fine for the cron. It's not at all fine for a customer who opens the console two minutes after their site died and sees a calm green light.

So the dashboard doesn't trust Postgres alone. When it loads a monitor, it also reads the buffered state out of Redis and lays the fresh fields over the stale row. Live where it needs to be live, lazy everywhere else. From the customer's side, the status is always current. From Postgres's side, it still only gets written once an hour. Everyone goes home happy.

What It Actually Cost Me

Here's the part that stings, and the part that makes the whole thing worth writing about.

Before the fix, by the 15th of the month, I'd already have burned through around 90 compute hours, and the graph was still climbing. Compute had quietly become my single most expensive habit, and I wasn't getting anything for it except a database that flat out refused to nap.

Today is the 25th. This month, I've used 16.33 compute hours. Not by the 15th. Total. The whole month, nearly done, for less than a fifth of what I used to spend in half a month. Same monitors. Same checks. Same data. Same alerts. The only thing that changed is that I stopped writing things down the very instant I learned them.

Closing Thoughts… On Small Things That Cost A Lot

If money had been no object, I'd have written the obvious version of this, watched it work, and never looked at it again. It would've been fine. It would also have been quietly, invisibly wasteful forever, and I'd have been none the wiser. The cost is the only reason I went looking. The cost is what turned a vague "I should optimise this someday" into an afternoon of real work. And the cheap version and the clean version turned out to be the same version. I only found that out because the expensive one was sitting on my conscience.

I write about this a lot in other places, usually about life and not databases. This fear I have of the ordinary, of the small unglamorous things, of the boring middle part of everything. A few tiny writes a minute is about as small and unglamorous as a problem gets. And it taught me more than the big rewrites ever do. I keep relearning the same lesson in different costumes.

The thrill of building real software was never the moment the feature ships. It's this. It's two months later, when the thing you made turns around and tells you something about itself you didn't know when you wrote it. When the system gets to disagree with you. That's when it gets interesting.

So. My database takes naps now. My bill came back down to earth. And I don't think I'll ever write another cron without first asking the boring, expensive little question: does this actually need to be written down right now?

Ntsako