Does our code-health score actually predict bugs? A leakage-free benchmark

Raghav Chamadiya·June 3, 2026·14 min read

code health scoredefect predictionsoftware metrics benchmarkROC AUC code qualitystatic analysis bugsSZZ

0.74 AUC across 21 repos and 9 languages. That is how often a deterministic code-health score, computed six months before the bugs landed, ranks a file that later gets bug-fixed below one that stays clean.

I couldn't stop asking the uncomfortable version of the question first. A score like this is one number per file, 0 to 10, built from 25 markers with no model in the loop, the same inputs always giving the same output. People put numbers like that in dashboards and PR gates and act on them. So does it actually predict where bugs show up, or is it a well-dressed line count?

This is the benchmark I built to find out, including the parts where the answer was "not as much as I'd hoped." One number to keep: cross-project mean ROC AUC 0.737, at predicting which files get bug-fixed over the following six months. A second, if you keep two: on raw discrimination it ties a file's line count, and most of this post is why that is both true and not the whole story. For the full picture, see the complete code-health guide.

The one mistake that makes every benchmark like this a lie

The methodology comes before any numbers, because it is where almost everyone, including an earlier version of me, goes wrong.

The naive way to test a defect predictor goes like this. You take a repo at HEAD, score every file, label a file "defective" if it got a fix: commit in the last six months, and compute how well the score ranks the buggy files above the clean ones. It's easy, it's what a lot of published tooling quietly does, and it leaks.

The leak is structural. Several of the strongest markers are evolutionary, meaning they read git history over a recent window: how much a file churned, how many people touched it, how scattered its co-changes are, how chaotic its commit pattern looks (change entropy). Now think about what a bug-fix commit actually is. It is a commit, it touches files, and it bumps their churn, their author count, and their entropy. When you score at HEAD and label from the same recent window, the bug-fix commits you are trying to predict are the very activity your marker "detects," so the score isn't predicting the future, it's reading the answer key.

The fix is to score in the past. For every repo I check out the last commit on or before a fixed date, which I'll call T0 (2025-11-23 for the headline), in a detached worktree, then index and score that. The labels come strictly afterward: a file is defective if a fix: commit lands on it in the window (T0, HEAD]. The measurement now precedes the labels, so nothing in the future can reach back and inflate it.

There is a subtle trap inside the trap. The windowed markers compute "the last 90 days" relative to the repo's HEAD, so if you check out a worktree six months in the past, that window points at its head, an empty span, and every windowed marker silently never fires while the whole evolutionary half of the score goes dark without erroring. The benchmark has to anchor those windows back to the real repo head (REPOWISE_GIT_WINDOW_ANCHOR=head). I lost an afternoon to a suspiciously flat result before I found that. Scoring at T0 instead of HEAD moved my accuracy down. That is the tell it was the honest thing to do.

The corpus

21 open-source repositories, covering nine languages: Python, TypeScript, JavaScript, Rust, Go, Java, Kotlin, C++, and C#. That comes to 2,826 source files, 379 of them touched by a bug-fix in the window. Selection was criteria-driven rather than cherry-picked, in that a repo had to index fast enough to score at T0 inside a budget, use Conventional Commits cleanly, and produce at least five defect-bearing source files in the window. Dead or dormant repos that yield close to zero fixes were dropped as all-negative noise, and that exclusion was decided before scoring, which matters.

Per-repository ROC AUC across the full 21-repo corpus

Each dot is one repo's AUC, the probability that a randomly chosen buggy file scores worse than a randomly chosen clean one. The orange band is the cross-project mean and its bootstrap 95% confidence interval, resampling repos rather than files, because the unit you actually generalize to is a new repository, not a new file in a repo you've already seen.

There are a few things I want to point at honestly. zod at 0.90 is a gift, since it's small, clean, and structurally varied. axios at 0.55 is a punishment, because it's a micro-library where so much of the surface got touched in the window that there is almost nothing clean left to discriminate against. The spread from 0.55 to 0.90 is real, and it's why I don't quote any single repo. The headline is the resampled cross-project number, 0.737 with a CI of [0.683, 0.787].

Is it just file size in a trench coat?

This is the question every defect-prediction result has to survive, because in this field file size is the single strongest predictor there is. Big files have more code, more code has more bugs, and a "predictor" that's secretly just counting lines is worthless, since you already have wc -l.

There are two answers, and they point in opposite directions, which is the interesting part.

Health vs trivial baselines on AUC and Popt

On raw AUC, the score ties a pure line-count baseline, 0.737 against 0.742, with a paired DeLong test giving p = 0.92. If I stopped here you would be right to be unimpressed. But raw AUC has a bias baked in, in that it rewards size, because big files genuinely carry more bugs, so "always guess the big file" scores well on it. The metric that charges for that is Popt, an effort-aware measure that gives you a fixed inspection budget in lines of code and asks how many defects you catch. Under Popt, "just read the big files" is expensive and gets penalized, and the health score beats line count by +0.134 [0.080, 0.198], which is a clean and significant win.

So the honest statement is that the score is not a size proxy. It carries signal beyond size, confirmed separately by a size-controlled partial Spearman correlation of −0.156 whose CI excludes zero, but it is correlated with size, because size is real signal you don't want to throw away.

Against the process baselines the score wins outright on discrimination: it beats recent churn by +0.10 AUC (DeLong p = 5e-10) and beats prior-defect history by +0.12 (p = 3e-15), separating buggy from clean files better than "what changed lately" or "what broke before."

I expected the score to beat "inspect what broke before" on every axis. It doesn't. For pure triage ordering, prior-defect history wins on Popt, 0.609 against 0.524. Re-inspecting whatever broke before is a brutally effective and nearly free heuristic, and the score does not beat it at raw bug-finding-per-line. What the score adds is discrimination and an attributable explanation: it tells you which of 25 measurable things are wrong with a file. A rap sheet only tells you the file has broken before. Those are different jobs, and I would rather say that plainly than pretend one number wins every axis.

Where it fails, in detail

A score is only trustworthy if you know where it breaks, so I sliced the errors.

ROC AUC computed strictly within file-size bands

This is the most important chart in the post and the least flattering. If you compute AUC within a single file-size band, so the size advantage is held constant, most of the signal evaporates. On large files (the top two quartiles) the score discriminates fine, at around 0.67. On the smallest files it's weak, and on the 29-to-68-line band it actually inverts below random. The worst false negatives are all small files with zero findings, because every marker gate needs some size or activity to fire, and a tidy 40-line file that is nonetheless buggy gives the score nothing to grab. It is structurally blind to small files.

I dug into the Q2 inversion. Two markers, primitive_obsession and dry_violation, fire constantly on small modules where that shape is idiomatic and harmless, and on those files they are anti-correlated with bugs, while the genuine small-file predictors barely fire there. That diagnosis led to a narrow gate fix, where primitive_obsession now only fires in modules of 60 or more non-blank lines, which nudges the Q2 band back above random without regressing anything. It was evidence-driven and validated by re-scoring the cached findings rather than guessed.

I'm showing you this because "where does it stop working" is the most useful thing a benchmark produces, and "big files, mostly" is the honest boundary.

Did I just get lucky with the date?

A single T0 could be a fluke window, so I re-ran the entire leakage-free pipeline at three rolling six-month start dates, re-indexing every repo at each one.

Cross-project mean AUC across three rolling T0 windows

The three came out at 0.771, 0.703, and 0.754, a mean of 0.743 that brackets the full-corpus 0.737. Individual repos wander between windows (hono swings from 0.54 to 0.72 as its in-window bug set changes), which is exactly why I report the cross-project mean over a diverse corpus and not any single cell. The aggregate is stable, so it isn't a lucky date.

I also re-ran everything under a completely different labeling strategy, leakage-free SZZ, which git blames each fix back to the commit that introduced the buggy line and only counts files whose bug already existed at T0. SZZ strips 17% of the keyword labels as noise, yet the measured accuracy barely moves (mean AUC 0.744 against 0.734) and every significance verdict reproduces. The score predicts "where bugs originate" about as well as "where fixes land," and that robustness mattered more to me than any single headline.

One out-of-distribution check

Everything above is my corpus and my labels. To place the score against the field, I ran it on a dataset it has never seen and I didn't build: the PROMISE/Jureczko jEdit benchmark, the canonical CK-metrics-plus-post-release-bugs set that hundreds of defect-prediction papers use. On a single-release snapshot there is no git history, so only the structural half of the score runs, and the strongest evolutionary markers contribute nothing. Even so it lands at AUC 0.76 and 0.78 across two releases, within about 0.03 of the dataset's own cross-validated full-CK-metric model, and again beats line count on Popt. It is one project and structural-only, so I won't oversell it, but it's a real external check rather than a home-field number.

What I actually believe now

The score is a useful triage signal rather than an oracle. Mean AUC around 0.74 is roughly the ceiling for file-level defect prediction from static and process signals, and anyone quoting 0.95 is either leaking, overfitting one repo, or selling something. Within that ceiling, the score earns its place: it beats every cheap process baseline on discrimination, it carries signal beyond file size, it's stable across time and across two independent label definitions, and, the part I care about most, it's attributable. When it flags a file it can tell you which markers fired and why.

It is also honestly size-correlated, near-blind on small files, and beaten by "inspect what broke before" if all you want is the cheapest possible review queue. Those aren't bugs in the benchmark, they're the shape of the problem, and I would rather hand you a score whose failure modes I can draw than one whose marketing I can.

The whole study reproduces from a committed cache with one command, and every number here carries a seeded bootstrap CI and an n. If you want to see what the score looks like on a real codebase rather than in aggregate, the live examples run it end to end, and the architecture page shows where it sits in the pipeline.

One thread I left dangling on purpose is that the benchmark also told me which of the 25 markers carry the prediction, and the answer surprised me enough to write it up separately. The strongest predictors turned out to be the evolutionary ones. The complexity metrics everyone reaches for came second. That is the next post.

Does our code-health score actually predict bugs? A leakage-free benchmark

The one mistake that makes every benchmark like this a lie

The corpus

Is it just file size in a trench coat?

Where it fails, in detail

Did I just get lucky with the date?

One out-of-distribution check

What I actually believe now

Does Code Health Predict Bugs? 21 Repos, 9 Languages, ROC AUC 0.74

Is AI-written code buggier than human code? We blamed 112,000 commits to find out

Process metrics beat structural metrics for predicting defects

Try repowise on your repo