Closing the Estimation Loop: When the Rule Isn't Enough

Two weeks ago I wrote about quoting two estimates instead of one: wall-clock and effort, with the delegation strategy named out loud. I felt very wise. I added it to the Playbook as a new subsection in Phase 1, Step 7. I wrote a Field Notes entry about it. Then I went back to work and kept quoting estimates, and a few days later I realized something embarrassing.

I had no idea if any of my numbers were getting better.

I had a format. I did not have a measurement loop. A rule without a loop is a vibe. I was telling people to trust the system while not collecting the data the system needs to actually be a system.

So this week I closed the loop. Three small additions to Phase 2, all in the templates folder.

The Kickoff Side: One Row in the Sprint Tracker

Every sprint already has a tracker doc that gets filled in at kickoff. Goal, dates, demo scenario, that kind of thing. I added one row.

Estimate — N sessions / ~N hours (log actuals at close; copy row into docs/estimation-tracker.md)

That is the entire change. Five seconds at kickoff. The point of putting it in the tracker is that the number now has to exist in writing before any work starts. Verbal estimates don't survive contact with reality. Written ones at least force you to commit to a thing you can later be wrong about.

"Sessions" is the unit I care most about. A session is one focused working block, anchored to me being at the keyboard. They are countable in a way that hours aren't. Hours are vibes once you account for tool waits, idle time, context switches, and the general lossiness of measuring yourself.

The Actuals Side: A Tracker That Outlives Any One Sprint

The estimation tracker is a separate doc that lives at the project level, not the sprint level. It accumulates across slices. The columns are the boring ones you'd guess: date closed, slice name, type of work, estimated sessions and hours, actual sessions and hours, variance, notes.

Variance is just actual divided by estimate. Greater than 1 means it took longer than I thought. Less than 1 means it was faster. I am mostly interested in the median variance and the variance broken out by type of work, because I already know my biases shift by category. Design work over-estimates by 3-10× when delegated. Debug work under-estimates by 2-3×. The rule was already in memory from the first piece. What I was missing was the data behind it.

The trigger I built into the doc is: recompute aggregate ratios after every five to ten data points. Not after every slice. The point isn't to chase the variance number, it's to detect drift over a window large enough to be real.

The Recalibration Side: The Retro Pulls a Variance Summary

The last piece is the connecting tissue. The retro template now has a section called Estimation Variance, which pulls a few rows out of the tracker for that sprint and asks one question:

Calibration action: update wall-clock claim in memory, adjust default estimates in scoping docs, or no change yet because we need more data.

That last option is the important one. Most retros, the right answer is "no change yet." Three data points is not enough to update a multiplier. Five is borderline. Ten starts to mean something. The retro just makes sure I look at the variance every sprint instead of waiting for a vague future moment when I'll get around to recalibrating, which is to say, never.

Sessions Are Tight, Hours Are Loose, On Purpose

I want to be honest about one design choice. I am intentionally not measuring exact hours. I tried to in an earlier draft of this and quit by the second slice.

Real precise time tracking is expensive. You have to start a clock, stop the clock, account for the time you spent staring at a build that took eight minutes to finish, decide whether the bathroom break counts, all that. The friction makes you stop logging. Stopping logging is worse than rough numbers.

So I keep sessions tight (you can count those without thinking) and hours rough (a feel number you fill in at slice close, plus or minus 20%). Both numbers are useful, but for different things. Sessions tell me how chunked the work was. Hours tell me, very approximately, how much time it cost. The variance ratio works on either column.

What This Fixes and What It Doesn't

This doesn't make my next estimate accurate. One estimate is still a guess. What it does is make my next ten estimates honest, because after ten of them I'll know whether my multiplier is real or whether I'm flattering myself.

It also doesn't solve the cold-start problem on a new project. The first five slices on a project, I'm flying blind by definition. The tracker just makes sure I'm collecting the data those first five slices generate, instead of throwing it away.

The thing I keep reminding myself: the goal isn't precision, it's drift detection. I want to notice when my biases shift, not predict the future. The Playbook only stays useful if the methodology adapts when reality contradicts it. Writing the rule was the easy half. The boring half, the one with the spreadsheet, is the half that makes it true.