Case Study: We Build repowise With repowise

Raghav Chamadiya·July 4, 2026·7 min read

repowise case studydogfoodingai coding agent tokensmcp codebase contextcodebase intelligence benchmark

Every line of repowise is written in Claude Code sessions that run against a repowise index of the repowise codebase itself. That has been true since the first month of the project. This post is the case study we would want to read before adopting a tool like this: what the setup looks like, what we measured, and where the honest limits are.

The headline number, measured on a paired benchmark rather than estimated: loading context through repowise's get_context tool took 2,391 tokens per task where naive file loading took 64,039. That is 96% fewer tokens, roughly 27x less, at answer quality parity. The rest of this post explains exactly where that number comes from and what it looks like in daily use.

The setup

The repowise monorepo (Python core, TypeScript UI, ~1,500 source files across the architectural layers) is indexed by repowise. Four things sit on top of that index in every working session:

A generated CLAUDE.md. repowise init writes the orientation block that Claude Code loads on session start: architecture summary, key modules, entry points, hotspots with owners, and current health scores. When the index updates, the block updates. This is the same file the CLAUDE.md generator produces for any repo.
MCP tools. The agent queries the index instead of walking the tree: get_answer for "how does X work", get_context for triage cards on files it is about to edit, get_symbol for verified source ranges, search_codebase for hybrid search.
Output distillation. Noisy commands (test runs, git log, builds) run through repowise distill, which keeps every error line and compresses the rest.
The PR bot and health gate. Every pull request gets a deterministic risk and health check before a human looks at it.

What we measured

The numbers below come from the public benchmark repository. They are run on public codebases (Flask and scikit-learn), not on repowise itself, precisely so anyone can reproduce them. Each benchmark pairs two identical agents on identical tasks; the only variable is whether repowise's MCP tools are available.

Token efficiency: understanding a commit

Measured on the 30 most recent non-merge commits of pallets/flask. The question: how many tokens does each strategy need to give a model enough context to understand the change?

Strategy	Tokens per commit
Naive (full contents of changed files)	64,039
`git diff` only	14,888
repowise `get_context`	2,391

Against naive loading that is a 26.8x pooled reduction (209x mean across commits, 1,214x in the best case). Against git diff it is still 6.2x pooled.

Paired agent runs: 48 tasks on Flask, 48 on scikit-learn

SWE-QA tasks, same model (claude-sonnet-4-6), same prompts, same budget, same LLM judge. C0 is the bare agent (Read, Grep, Glob, Bash); C2 adds four repowise MCP tools.

Metric (mean per task)	Flask: C0	Flask: C2	scikit-learn: C0	scikit-learn: C2
Cost	$0.1396	$0.0890 (-36.2%)	$0.1180	$0.0834 (-29.3%)
Tool calls	7.4	3.8 (-49.2%)	8.1	2.4 (-70.5%)
Files read	1.9	0.2 (-89.0%)	1.8	0.6 (-69.3%)
Judge score (0-10)	8.82	8.81	8.72	8.23

On Flask, 32 of 48 tasks were cheaper with repowise; on scikit-learn, 33 of 48. Quality was tied on Flask and similar on the scikit-learn sample.

Where the cost win actually comes from

A follow-up run split the story in two, and this part matters if you care about agent economics:

On short Q&A tasks, the win depends on a curated tool surface. Advertising all nine MCP tools costs 4,520 schema tokens; the core profile (four tools) costs 1,884, a 58% cut. With the full surface the cost saving was only 4%; with the lean surface it was 25%.
On long investigation tasks, the win comes from distillation. Running the command floods through repowise distill cut cache-read tokens by 41% and cost by 26%.

What it changes day to day

The benchmarks measure single tasks. Day to day on our own repo, the compounding effects are the point:

Orientation is free. A new session starts already knowing that the complexity walker is the highest-churn file in the analysis package and who owns it. Nobody re-derives the architecture each morning.
Risk shows up before the edit. The change-risk signals flag hotspot files and likely co-changes when the agent proposes touching them, which is when that information is cheap to act on.
The health score gates our own PRs. The same defect-validated score we ship (cross-project ROC AUC 0.74 on a 21-repo, 9-language corpus) runs on every repowise pull request. We feel the false positives before users do.

The honest limits

Three caveats we would want stated if we were evaluating this from outside. The efficiency numbers are measured on Flask and scikit-learn, chosen for reproducibility; your repo will differ, and the benchmark harness is public so you can run it on your own codebase. Quality is at parity in these runs, not better; the claim is that the agent does the same work with far less reading, not that it becomes smarter. And the short-task cost win requires the lean tool profile; a fully loaded MCP surface eats most of the navigation saving on quick questions.

Reproduce it

Everything above is in the benchmark repository: harness, configs, raw results, and the report the numbers are quoted from. Point repowise at your own repo with repowise init, or browse pre-indexed open-source repos to see the output on a codebase you already know.

FAQ

Are the 96% token numbers from the repowise codebase itself?

No, and that is deliberate. The token-efficiency and paired-agent benchmarks run on public repositories (Flask and scikit-learn) so the results are reproducible by anyone. The repowise codebase is where we use the tool daily; the public corpora are where we measure it.

What does "quality at parity" mean?

Both agent configurations were scored by the same LLM judge on the same tasks. On Flask the mean scores were 8.82 (bare) vs 8.81 (with repowise); on scikit-learn, 8.72 vs 8.23. repowise reduces the work an agent does to reach an answer, measured as tokens, file reads, and tool calls. It does not claim to improve the answer.

Can I run the benchmark on my own repository?

Yes. The harness, configurations, and analysis scripts are in the public benchmark repository, and the token-efficiency script takes any git repo as input.

Does this require Claude Code?

No. The index is exposed over the Model Context Protocol, so it works with Claude Code, Cursor, Cline, Codex, and any other MCP client. One index serves every agent.