Does our code-health score actually predict bugs? A leakage-free benchmark
Repowise ships a code-health score. It's one number per file, from 0 to 10, built deterministically from 25 biomarkers, with no model and no LLM, so the same inputs always give the same output. People put numbers like this in dashboards and PR gates and act on them, so the question I couldn't stop asking was the uncomfortable one: does it actually predict where bugs show up, or is it just a well-dressed line count?
This is a writeup of the benchmark I built to answer that, including the parts where the answer was "not as much as I'd hoped." If you only remember one number, let it be that the score reaches a cross-project mean ROC AUC of 0.737 at predicting which files get bug-fixed over the following six months. If you remember two, let the second be that on raw discrimination it ties a file's line count, and the rest of this post is about why that is both true and not the whole story.
The one mistake that makes every benchmark like this a lie
Before any numbers, here is the methodology, because it's where almost everyone (including an earlier version of me) goes wrong.
The naive way to test a defect predictor goes like this. You take a repo at HEAD, score every file, label a file "defective" if it got a fix: commit in the last six months, and compute how well the score ranks the buggy files above the clean ones. It's easy, it's what a lot of published tooling quietly does, and it leaks.
Here is the leak. Several of the strongest biomarkers are evolutionary, meaning they read git history over a recent window: how much a file churned, how many people touched it, how scattered its co-changes are, how chaotic its commit pattern looks (change entropy). Now think about what a bug-fix commit actually is. It is a commit, it touches files, and it bumps their churn, their author count, and their entropy. When you score at HEAD and label from the same recent window, the bug-fix commits you are trying to predict are the very activity your biomarker "detects," so the score isn't predicting the future, it's reading the answer key.
The fix is to score in the past. For every repo I check out the last commit on or before a fixed date, which I'll call T0 (2025-11-23 for the headline), in a detached worktree, then index and score that. The labels come strictly afterward: a file is defective if a fix: commit lands on it in the window (T0, HEAD]. The measurement now precedes the labels, so nothing in the future can reach back and inflate it.
There is a subtle trap inside the trap. Repowise's windowed biomarkers compute "the last 90 days" relative to the repo's HEAD, so if you check out a worktree six months in the past, that window points at its head, an empty span, and every windowed biomarker silently never fires while the whole evolutionary half of the score goes dark without erroring. The benchmark has to anchor those windows back to the real repo head (REPOWISE_GIT_WINDOW_ANCHOR=head). I lost an afternoon to a suspiciously flat result before I found that. The thing worth noticing is that scoring at T0 rather than HEAD moved my supposed accuracy down, which is the tell that it was the honest thing to do.
The corpus
21 open-source repositories, covering every one of Repowise's nine Full-tier languages: Python, TypeScript, JavaScript, Rust, Go, Java, Kotlin, C++, and C#. That comes to 2,826 source files, 379 of them touched by a bug-fix in the window. Selection was criteria-driven rather than cherry-picked, in that a repo had to index fast enough to score at T0 inside a budget, use Conventional Commits cleanly, and produce at least five defect-bearing source files in the window. Dead or dormant repos that yield close to zero fixes were dropped as all-negative noise, and that exclusion was decided before scoring, which matters.
Per-repository ROC AUC across the full 21-repo corpus
Each dot is one repo's AUC, the probability that a randomly chosen buggy file scores worse than a randomly chosen clean one. The orange band is the cross-project mean and its bootstrap 95% confidence interval, resampling repos rather than files, because the unit you actually generalize to is a new repository, not a new file in a repo you've already seen.
There are a few things I want to point at honestly. zod at 0.90 is a gift, since it's small, clean, and structurally varied. axios at 0.55 is a punishment, because it's a micro-library where so much of the surface got touched in the window that there is almost nothing clean left to discriminate against. The spread from 0.55 to 0.90 is real, and it's why I don't quote any single repo. The headline is the resampled cross-project number, 0.737 with a CI of [0.683, 0.787].
Is it just file size in a trench coat?
This is the question every defect-prediction result has to survive, because in this field file size is the single strongest predictor there is. Big files have more code, more code has more bugs, and a "predictor" that's secretly just counting lines is worthless, since you already have wc -l.
There are two answers, and they point in opposite directions, which is the interesting part.
Health vs trivial baselines on AUC and Popt
On raw AUC, the score ties a pure line-count baseline, 0.737 against 0.742, with a paired DeLong test giving p = 0.92. If I stopped here you would be right to be unimpressed. But raw AUC has a bias baked in, in that it rewards size, because big files genuinely carry more bugs, so "always guess the big file" scores well on it. The metric that charges for that is Popt, an effort-aware measure that gives you a fixed inspection budget in lines of code and asks how many defects you catch. Under Popt, "just read the big files" is expensive and gets penalized, and the health score beats line count by +0.134 [0.080, 0.198], which is a clean and significant win.
So the honest statement is that the score is not a size proxy. It carries signal beyond size, confirmed separately by a size-controlled partial Spearman correlation of −0.156 whose CI excludes zero, but it is correlated with size, because size is real signal you don't want to throw away.
Against the process baselines the score wins outright on discrimination: it beats recent churn by +0.10 AUC (DeLong p = 5e-10) and beats prior-defect history by +0.12 (p = 3e-15), separating buggy from clean files better than "what changed lately" or "what broke before."
And here is the result I think most tools would bury. For pure triage ordering, prior-defect history beats the health score on Popt, 0.609 against 0.524. "Re-inspect whatever broke before" is a brutally effective and nearly free heuristic, and the health score does not beat it at raw bug-finding-per-line. What the score adds is discrimination plus an attributable, structural explanation, in that it tells you which of 25 measurable things are wrong with a file, not just that the file has a rap sheet. Those are different jobs, and I would rather say that plainly than pretend one number wins every axis.
Where it fails, in detail
A score is only trustworthy if you know where it breaks, so I sliced the errors.
ROC AUC computed strictly within file-size bands
This is the most important chart in the post and the least flattering. If you compute AUC within a single file-size band, so the size advantage is held constant, most of the signal evaporates. On large files (the top two quartiles) the score discriminates fine, at around 0.67. On the smallest files it's weak, and on the 29-to-68-line band it actually inverts below random. The worst false negatives are all small files with zero findings, because every biomarker gate needs some size or activity to fire, and a tidy 40-line file that is nonetheless buggy gives the score nothing to grab. It is structurally blind to small files.
I dug into the Q2 inversion. Two biomarkers, primitive_obsession and dry_violation, fire constantly on small modules where that shape is idiomatic and harmless, and on those files they are anti-correlated with bugs, while the genuine small-file predictors barely fire there. That diagnosis led to a narrow gate fix, where primitive_obsession now only fires in modules of 60 or more non-blank lines, which nudges the Q2 band back above random without regressing anything. It was evidence-driven and validated by re-scoring the cached findings rather than guessed.
I'm showing you this because "where does it stop working" is the most useful thing a benchmark produces, and "big files, mostly" is the honest boundary.
Did I just get lucky with the date?
A single T0 could be a fluke window, so I re-ran the entire leakage-free pipeline at three rolling six-month start dates, re-indexing every repo at each one.
Cross-project mean AUC across three rolling T0 windows
The three came out at 0.771, 0.703, and 0.754, a mean of 0.743 that brackets the full-corpus 0.737. Individual repos wander between windows (hono swings from 0.54 to 0.72 as its in-window bug set changes), which is exactly why I report the cross-project mean over a diverse corpus and not any single cell. The aggregate is stable, so it isn't a lucky date.
I also re-ran everything under a completely different labeling strategy, leakage-free SZZ, which git blames each fix back to the commit that introduced the buggy line and only counts files whose bug already existed at T0. SZZ strips 17% of the keyword labels as noise, yet the measured accuracy barely moves (mean AUC 0.744 against 0.734) and every significance verdict reproduces. The score predicts "where bugs originate" about as well as "where fixes land," and that robustness mattered more to me than any single headline.
One out-of-distribution check
Everything above is my corpus and my labels. To place the score against the field, I ran it on a dataset it has never seen and I didn't build: the PROMISE/Jureczko jEdit benchmark, the canonical CK-metrics-plus-post-release-bugs set that hundreds of defect-prediction papers use. On a single-release snapshot there is no git history, so only the structural half of the score runs, and the strongest evolutionary biomarkers contribute nothing. Even so it lands at AUC 0.76 and 0.78 across two releases, within about 0.03 of the dataset's own cross-validated full-CK-metric model, and again beats line count on Popt. It is one project and structural-only, so I won't oversell it, but it's a real external check rather than a home-field number.
What I actually believe now
The score is a useful triage signal rather than an oracle. Mean AUC around 0.74 is roughly the ceiling for file-level defect prediction from static and process signals, and anyone quoting 0.95 is either leaking, overfitting one repo, or selling something. Within that ceiling, the score earns its place: it beats every cheap process baseline on discrimination, it carries signal beyond file size, it's stable across time and across two independent label definitions, and, the part I care about most, it's attributable. When it flags a file it can tell you which biomarkers fired and why.
It is also honestly size-correlated, near-blind on small files, and beaten by "inspect what broke before" if all you want is the cheapest possible review queue. Those aren't bugs in the benchmark, they're the shape of the problem, and I would rather hand you a score whose failure modes I can draw than one whose marketing I can.
The whole study reproduces from a committed cache with one command, and every number here carries a seeded bootstrap CI and an n. If you want to see what the score looks like on a real codebase rather than in aggregate, the live examples run it end to end, and the architecture page shows where it sits in the pipeline.
One thread I left dangling on purpose is that the benchmark also told me which of the 25 biomarkers carry the prediction, and the answer surprised me enough to write it up separately. The strongest predictors aren't the complexity metrics everyone reaches for, they're the evolutionary ones, and that's the subject of the next post.


