The Big Picture

Repowise takes a codebase and produces a living wiki — AI-generated documentation that stays current as the code changes. Each layer has a single responsibility and a clear boundary.

Single responsibility per layer

Ingestion, generation, and persistence are fully decoupled.

Swap the LLM provider without touching core

The LLM abstraction is isolated behind a provider interface.

Change the database without touching generation

Persistence is accessed through a repository pattern.

Replace the frontend without touching persistence

The API is the only contract between UI and data.

Ingestion Pipeline

The ingestion pipeline turns raw source code into structured data. Five components run in sequence.

FileTraverser

Walks the directory tree and applies three filter layers to find only documentable files.

ASTParser

Extracts functions, classes, imports, and exports from 9+ languages using tree-sitter queries with unified capture names.

@symbol.def      — full definition node (provides line numbers)
@symbol.name     — the name identifier
@symbol.params   — parameter list
@symbol.modifiers — decorators, visibility keywords
@import.statement — full import node
@import.module   — the module path being imported

Adding a new language requires only one .scm query file and one entry in LANGUAGE_CONFIGS.

GitIndexer

Extracts ownership, churn, co-change relationships, and significant commits from git history using a single git log pass — O(1) processes instead of O(N).

Category	Metric	What it tells you
Timeline	commit_count_90d, age_days	How active the file is
Ownership	primary_owner, bus_factor	Who knows this code, single points of failure
Churn	churn_percentile, is_hotspot	How volatile — risky to change
Co-change	co_change_pairs	What files change together (invisible coupling)

Bus factor = minimum contributors needed to account for 80% of commits. Bus factor of 1 means a single point of failure — if that person leaves, institutional knowledge is lost.

ChangeDetector

After the initial indexing, ChangeDetector determines what to regenerate on each update run.

Generation Pipeline

The generation pipeline takes ingested data and produces wiki pages using LLMs.

Page Types & Levels

Repowise generates 10 page types in 8 ordered levels. Later levels depend on earlier ones — a module page references the file pages inside it.

Level	Page Type	Description
0	api_contract	API definitions (OpenAPI, Proto, GraphQL)
1	symbol_spotlight	Individual important symbols
2	file_page	Individual code files (most common)
3	scc_page	Circular dependency documentation
4	module_page	Directory-level summaries
5	cross_package	Inter-package boundaries (monorepos)
6	repo_overview	Repo-wide summary
6	architecture_diagram	Mermaid dependency diagram
7	infra_page	Dockerfiles, Makefiles, Terraform
7	diff_summary	Change summaries

ContextAssembler

Each page type has a ContextAssembler method that builds everything the LLM needs to know. Context is rendered into Jinja2 templates and token-budgeted to fit the LLM window.

FilePageContext
├── file_path, language
├── symbols (public first, then private)
│   └── name, kind, signature, docstring, visibility
├── imports, exports
├── dependencies + dependents
├── file_source_snippet (trimmed to token budget)
├── pagerank_score, betweenness_score, community_id
├── git_metadata (ownership, churn, significant commits)
├── co_change_pages
├── dead_code_findings
├── depth: "minimal" | "standard" | "thorough"
└── rag_context (related pages from vector search)

Job System

Generation for large repos may take hours. The job system checkpoints every page so a crash can resume from where it left off.

{
  "status": "running",
  "total_pages": 300,
  "completed_pages": 150,
  "completed_page_ids": ["file_page:src/auth.py", "..."],
  "current_level": 2
}

Persistence Layer

Repowise writes to three independent stores — SQL, vector store, and graph — optimizing for different query patterns.

The repowise doctor --repair command checks SQL↔Vector↔FTS consistency and auto-repairs orphaned entries — re-embedding missing vectors, re-indexing missing FTS entries.

Graph Algorithms

Every graph metric is computed from the import dependency graph — a directed graph where each node is a file and each edge is an import.

PageRank

Answers “How important is this file?” A file is important if many other important files depend on it.

score(B) = (1 - α) / N  +  α × Σ [ score(A) / out_degree(A) ]
                                        for each A that imports B

α = 0.85 (damping factor — probability of following an edge vs. teleporting)

Doc priority: highest PageRank files get wiki pages generated first
Generation depth: low PageRank + low churn → “minimal” docs (saves LLM tokens)
CLAUDE.md: files sorted by PageRank descending — most important appears first for AI assistants

Betweenness Centrality

Answers “Is this file a bridge between subsystems?” High betweenness files sit on shortest paths connecting many file pairs. Removing them would disconnect parts of the codebase.

PageRank says “I'm important.” Betweenness says “I'm a bottleneck.” A bridge file with low PageRank still gets a Level 2 wiki page because it's a coupling point — changes to it ripple across subsystem boundaries.

Community Detection

Louvain algorithm on the undirected graph finds natural clusters — files that are more connected to each other than to the rest. These clusters become subsystem labels in the wiki and color groups in the dependency graph visualization.

Dead Code Detection

Files with in-degree = 0 (nothing imports them) are dead code candidates. Repowise filters out entry points, test files, config files, and framework-loaded files before flagging, then scores confidence using git activity.

Confidence	Condition
1.0	No imports + no commits in 90d + file age > 180d
0.7	No imports + no commits in 90d
0.4	No imports but has recent commits (dynamic loading?)

MCP Tools

9 MCP tools expose all of Repowise's intelligence to AI coding assistants — Claude Code, Cursor, Cline, and any MCP-compatible editor.

get_overview()

Architecture summary, module map, entry points, tech stack.

when to use: First call when exploring an unfamiliar codebase.

get_answer(question)

One-call RAG Q&A over the wiki — returns a cited answer with a confidence score.

when to use: First call on any code question.

get_context(targets=[...])

Docs, ownership, history, decisions, freshness for files or symbols.

when to use: Before reading or modifying specific code.

get_symbol(symbol_id)

Raw source bytes for one indexed symbol with exact line bounds.

when to use: When you need the body of one function or class.

search_codebase(query="...")

Semantic search over the full wiki.

when to use: When you don't know where something lives.

get_risk(targets=[...])

Hotspot score, dependents, co-change partners, risk summary.

when to use: Before modifying files — assess blast radius.

get_why(query="...")

Decisions search, path-based lookup, or supersession lineage.

when to use: Before architectural changes — understand existing intent.

get_dead_code()

Unreachable files, unused exports, zombie packages.

when to use: Before cleanup or refactoring.

get_health(targets=[...])

Per-file health scores, worst files, and refactoring targets.

when to use: Before refactoring — find the riskiest files first.

Critical Analysis

An honest assessment of what Repowise solves well, where it can fail, and what's been addressed.

What it solves well

Graph-first architecture captures structural dependencies that keyword search misses
Temporal intelligence (git history, co-change) adds a dimension pure static analysis can't provide
Bounded cascade with PageRank-sorted budgets ensures important docs stay fresh
Multi-source decision extraction mines architectural intent from commits, comments, and documentation
MCP integration makes intelligence immediately useful inside AI coding workflows

Failure probability matrix

Failure	Probability	Impact	Recovery
Three-store inconsistency	Medium	Low–Medium	repowise doctor --repair
LLM hallucination in docs	Medium	Medium	Re-generate page
Documentation staleness	High	Medium	Run repowise update
Import resolution gaps (dynamic imports)	Medium–High	Medium	Partial (framework edges added)
Scale limits (>50K files)	Low–High (repo-dependent)	High	Architecture changes needed
Decision mining false positives	Medium	Low	Dismiss via CLI

What has been implemented

doctor --repair

Three-store consistency check and auto-repair

Symbol cross-check

LLM output validated against AST symbols

Framework edges

Synthetic edges for pytest, Django, FastAPI, Flask

Adaptive cascade budget

Scales 10–50 based on change magnitude

Generation report

Cost/quality summary after every update run