CODE HEALTH GUIDE

Code Health: The Complete Guide to Defect-Validated Code Quality

How repowise scores every file from 1 to 10 with 25 deterministic biomarkers, why the score predicts real bugs, and how to turn it into a ranked worklist you act on the same day.

0.74industry
Cross-project ROC AUC across 21 repos and 9 languages, up to 0.90 per repo
2.3x
More defects caught than a leading commercial tool at equal review budget
25
Deterministic biomarkers, zero LLM, reproducible on your repo
<30s
To score a 3,000-file repo, fully local
By Raghav ChamadiyaUpdated June 2026 · 14 min
TL;DR

repowise scores every file in your repo from 1 to 10 using 25 deterministic biomarkers, with no LLM and no cloud, reproducible byte-for-byte. The weights are calibrated against a real defect corpus, so the score reaches 0.74 cross-project ROC AUC across 21 repos and 9 languages and catches 2.3x more defects than a leading commercial tool at the same review budget. It runs in under 30 seconds on a 3,000-file repo and turns the score into a ranked worklist of what to fix first.

DEFINITION

Code health is a per-file score from 1 (worst) to 10 (best) that estimates how likely a file is to harbor defects. repowise computes it from 25 deterministic biomarkers over tree-sitter ASTs and git history, with weights calibrated against a real defect corpus. It is reproducible: the same code always yields the same score, with zero LLM calls.

repowise code health dashboard showing a 1-10 code health score per file with defect-risk KPIs
One defect-validated number per file, with the three repo-level KPIs above it.

Why does code health matter?

Most code-quality tools hand you a score and ask you to trust it. None show whether the score actually finds the bugs.

A maintainability number is only useful if the files it flags are the files that break. Untracked, that gap costs real money: review time spent on the wrong files, regressions in code nobody flagged, and "tech debt" arguments with no evidence behind them.

  • A score you cannot defend in review is a score the team ignores.
  • The files that break are rarely the files that feel risky, so gut feel mis-ranks effort.
  • Without calibration, a quality score measures what is easy to measure, not what predicts bugs.

How does code health work?

repowise treats "does this score find bugs?" as a measurable claim and publishes the answer. The mechanism is deterministic end to end.

1. Index. repowise parses your repo into a graph and reads its git history. No code is sent anywhere.

2. Score with 25 biomarkers. Each file starts at 10.0; biomarker findings deduct from the score, capped per category so no single category dominates.

CategoryBiomarkers
Structural complexitybrain_method, nested_complexity, bumpy_road, complex_conditional, complex_method, large_method, primitive_obsession
Cohesion & sizelow_cohesion (LCOM4), god_class
Duplicationdry_violation (native Rabin-Karp clone detection)
Error handlingerror_handling (empty catches, swallowed errors, bare panics)
Test coverageuntested_hotspot, coverage_gap, coverage_gradient
Test qualitylarge_assertion_block, duplicated_assertion_block
Organizational / gitdeveloper_congestion, knowledge_loss, hidden_coupling, function_hotspot, code_age_volatility, ownership_risk, churn_risk, change_entropy, co_change_scatter, prior_defect
Performanceio_in_loop (N+1), string_concat_in_loop, blocking_sync_in_async, resource_construction_in_loop, serial_await_in_loop, and more

3. Calibrate, don't hand-tune. Each file is scored at the commit before a 6-month defect window (T0, so the measurement cannot leak the future). An L2-regularized logistic regression with file size (NLOC) as an explicit control fits each biomarker's defect lift beyond size. The strongest calibrated predictors: co_change_scatter, change_entropy, ownership_risk, nested_complexity. Only the learned constants ship.

4. Split into three co-equal signals. One shared scoring kernel runs against three independent weight/cap tables. They never feed back into each other; the surfaced score stays exactly the defect score (a golden test locks this byte-for-byte).

PillarWhat it scores
Defect riskThe headline 1-10 number, calibrated against the defect corpus (Alert files carry roughly 17x the defect rate of Healthy files).
MaintainabilitySmells the defect calibration floors (low_cohesion, brain_method, primitive_obsession, dry_violation, error_handling) scored at full weight where they actually live.
Performance riskStatic N+1 / IO-in-loop shapes that waste work: high-precision, low-recall, always advisory.

How does code health help you?

The score is the start. repowise turns it into something you act on the same day.

Refactoring prioritization

repowise ranks the files worth fixing by impact for effort, so you spend a refactoring budget where it pays back most.

Guardrail: this is a worklist, not an auto-refactor — repowise ranks and explains, it never rewrites your code.

  • Each target carries a score, the driving biomarker, an impact estimate, and an effort bucket (S / M / L / XL).
  • Ranking is deterministic and rule-based, with no LLM, so the worklist is reproducible across runs.
  • High-impact / low-effort targets float to the top; you can mark findings acknowledged, resolved, or false_positive.
repowise refactoring prioritization worklist ranking files by impact per effort with S M L XL effort buckets
Impact-for-effort, ranked: the top rows buy the most health for the least work.

Performance risk

The third pillar flags code whose structure does redundant I/O, not measured runtime.

Guardrail: this is a static performance-risk pillar, not an APM or profiler — no runtime, no traces.

  • The headline detector is io_in_loop (the N+1): a db / network / filesystem / subprocess call that runs once per loop iteration, resolved through a shared I/O-boundary classifier so it only fires on a real round-trip.
  • A bounded-depth (≤3 hops) call-graph walk catches the interprocedural case no file-local linter sees; cross-function findings carry the resolved caller → … → sink path.
  • Hand-label validated across an 11-repo OSS corpus: Go 96.7%, TypeScript 100%, Python 96.2% precision. A language without a dialect emits no perf findings, never a wrong one.

Coverage to untested hotspots

Point repowise at an LCOV or Cobertura file and it intersects coverage with churn-complexity to find the files that are both risky and untested.

Guardrail: coverage here is an input that sharpens risk, not a coverage dashboard.

  • repowise health --coverage cov.lcov lights up untested_hotspot, coverage_gap, and coverage_gradient biomarkers.
  • Accepts LCOV, Cobertura, Clover, or normalized JSON.
  • The output is "fix these untested risky files," not a wall of per-line coverage percentages.
repowise untested hotspots view intersecting low test coverage with high churn-complexity risk
Coverage as a risk input: untested *and* risky rises to the top, plain coverage does not.

Dead code

Before a cleanup sprint, repowise surfaces unreachable files, unused exports, and zombie packages, tiered by confidence.

  • Findings tier high (≥0.8) / medium / low, with per-directory and per-owner rollups.
  • safe_only returns deletion-ready findings with no runtime-load risk.
  • Available over MCP via get_dead_code and in the dashboard.

A rolling 50-row snapshot history powers Declining Health and Predicted Decline alerts, so decay surfaces before it compounds.

  • repowise health --trend shows snapshots plus declining and predicted-decline alerts.
  • The Repowise PR Bot comments deterministically when health regresses, with zero LLM calls.
  • Trend data is surfaced in get_health and on the dashboard.

Walkthrough: from score to fix

Step 1 — Read the dashboard. Open code health and read the three KPIs: Hotspot Health (NLOC-weighted over hotspot files), Average Health (all files), and Worst Performer (the single lowest file).

code health dashboard KPIs hotspot health average health worst performer
Start here: the three KPIs frame the whole repo.

Step 2 — Open a low-scoring file. Click into a file to see its biomarker findings, each tagged with a dimension (defect / maintainability / performance) so you can filter by pillar.

Step 3 — Check the worklist. Switch to refactoring targets, sorted by impact for effort. Pick the highest-impact S or M rows first.

refactoring worklist ranked by impact per effort in repowise
The first row is the best trade you can make this week.

Step 4 — Add coverage. Run repowise health --coverage cov.lcov to surface untested hotspots and re-rank with test gaps factored in.

Step 5 — Watch the trend. Run repowise health --trend (or let the PR bot do it) so any file's decline raises an alert before it ships.

Proof: does the score predict real bugs?

Every claim below is reproducible on your own repo: the heuristics are open source under AGPL-3.0.

ResultValue
Cross-project mean ROC AUC0.74 [95% CI 0.68-0.79], up to 0.90 per repo
Defects caught vs a leading commercial tool, equal budget2.3x (recall 0.173 vs 0.074, Popt 0.607 vs 0.462)
Head-to-head discrimination (ROC AUC)0.731 vs 0.705, on the same 2,770 files / 9 languages
Survives controlling for file sizepartial Spearman ρ = −0.16
Beats recent-churn baseline+0.10 AUC (DeLong p < 1e-9)
Beats prior-defect baseline+0.12 AUC
External, never-seen dataset (PROMISE/jEdit)AUC 0.76-0.78
Biomarkers25, deterministic, zero LLM, reproducible
Speed< 30s on a 3,000-file repo

Full methodology and reproduction steps live in the defect-prediction validation study.

Try it on your repo

See whether the score finds the bugs in your repo: every heuristic is open source and runs locally.

pip install repowise
repowise health                       # KPIs + lowest-scoring files
repowise health --coverage cov.lcov   # untested-hotspot detection
repowise health --refactoring-targets # ranked by impact / effort
repowise health --trend               # snapshots + decline alerts
FOR YOUR ROLE

How each role uses this feature

FREQUENTLY ASKED

Questions, answered

Does code health predict bugs?

Yes, that is the whole design. The biomarker weights are calibrated against a real defect corpus at a leakage-free commit, and the score reaches 0.74 cross-project ROC AUC across 21 repos and 9 languages (up to 0.90 per repo). On the same 2,770 files, ranking by repowise health surfaces 2.3x the defects of a leading commercial tool at the same review budget.

Does code health check performance?

Yes, as a static performance-risk pillar, not an APM or profiler, with no runtime and no traces. It flags code whose structure does redundant I/O: N+1 / io_in_loop, string-concat-in-loop, blocking-sync-in-async, and more, resolved through an I/O-boundary classifier and a bounded call-graph walk. It is high-precision and low-recall (Go 96.7%, TypeScript 100%, Python 96.2% precision), so it under-reports rather than cries wolf.

Does repowise refactor my code for me?

No. Refactoring targets are a worklist, not an auto-refactor: repowise ranks and explains, it never rewrites your code. Each target carries an impact estimate, an effort bucket (S/M/L/XL), and the driving biomarker, ranked deterministically by impact for effort so you decide what to fix.

Can code health use my test coverage?

Yes. repowise ingests LCOV, Cobertura, Clover, or normalized JSON via repowise health --coverage. Coverage is an input that sharpens risk, not a coverage dashboard: it intersects low coverage with high churn-complexity to surface untested hotspots through the untested_hotspot, coverage_gap, and coverage_gradient biomarkers.

How is code health different from CodeScene or SonarQube?

Both score code quality, but repowise is open source so every heuristic is inspectable and reproducible on your own repo, and the score is defect-validated and published head-to-head against a leading commercial tool (2.3x more defects at equal budget). The score also ships inside a broader platform (architecture wiki, git intelligence, decisions, and MCP tools), and repowise does not offer AI auto-refactoring.

Does code health use an LLM?

No. Scoring is fully deterministic: 25 biomarkers with weights calibrated offline against a defect corpus. Only the learned constants ship, so the same code always produces the same score in under 30 seconds on a 3,000-file repo, with no cloud, no API calls, and no drift.

Last reviewed: June 2026

See whether the score finds the bugs in your repo