The Big Picture
Repowise takes a codebase and produces a living wiki — AI-generated documentation that stays current as the code changes. Each layer has a single responsibility and a clear boundary.
Single responsibility per layer
Ingestion, generation, and persistence are fully decoupled.
Swap the LLM provider without touching core
The LLM abstraction is isolated behind a provider interface.
Change the database without touching generation
Persistence is accessed through a repository pattern.
Replace the frontend without touching persistence
The API is the only contract between UI and data.
Ingestion Pipeline
The ingestion pipeline turns raw source code into structured data. Five components run in sequence.
FileTraverser
Walks the directory tree and applies three filter layers to find only documentable files.
ASTParser
Extracts functions, classes, imports, and exports from 9+ languages using tree-sitter queries with unified capture names.
@symbol.def — full definition node (provides line numbers) @symbol.name — the name identifier @symbol.params — parameter list @symbol.modifiers — decorators, visibility keywords @import.statement — full import node @import.module — the module path being imported
Adding a new language requires only one .scm query file and one entry in LANGUAGE_CONFIGS.
GitIndexer
Extracts ownership, churn, co-change relationships, and significant commits from git history using a single git log pass — O(1) processes instead of O(N).
| Category | Metric | What it tells you |
|---|---|---|
| Timeline | commit_count_90d, age_days | How active the file is |
| Ownership | primary_owner, bus_factor | Who knows this code, single points of failure |
| Churn | churn_percentile, is_hotspot | How volatile — risky to change |
| Co-change | co_change_pairs | What files change together (invisible coupling) |
ChangeDetector
After the initial indexing, ChangeDetector determines what to regenerate on each update run.
Generation Pipeline
The generation pipeline takes ingested data and produces wiki pages using LLMs.
Page Types & Levels
Repowise generates 10 page types in 8 ordered levels. Later levels depend on earlier ones — a module page references the file pages inside it.
| Level | Page Type | Description |
|---|---|---|
| 0 | api_contract | API definitions (OpenAPI, Proto, GraphQL) |
| 1 | symbol_spotlight | Individual important symbols |
| 2 | file_page | Individual code files (most common) |
| 3 | scc_page | Circular dependency documentation |
| 4 | module_page | Directory-level summaries |
| 5 | cross_package | Inter-package boundaries (monorepos) |
| 6 | repo_overview | Repo-wide summary |
| 6 | architecture_diagram | Mermaid dependency diagram |
| 7 | infra_page | Dockerfiles, Makefiles, Terraform |
| 7 | diff_summary | Change summaries |
ContextAssembler
Each page type has a ContextAssembler method that builds everything the LLM needs to know. Context is rendered into Jinja2 templates and token-budgeted to fit the LLM window.
FilePageContext ├── file_path, language ├── symbols (public first, then private) │ └── name, kind, signature, docstring, visibility ├── imports, exports ├── dependencies + dependents ├── file_source_snippet (trimmed to token budget) ├── pagerank_score, betweenness_score, community_id ├── git_metadata (ownership, churn, significant commits) ├── co_change_pages ├── dead_code_findings ├── depth: "minimal" | "standard" | "thorough" └── rag_context (related pages from vector search)
Job System
Generation for large repos may take hours. The job system checkpoints every page so a crash can resume from where it left off.
{
"status": "running",
"total_pages": 300,
"completed_pages": 150,
"completed_page_ids": ["file_page:src/auth.py", "..."],
"current_level": 2
}Persistence Layer
Repowise writes to three independent stores — SQL, vector store, and graph — optimizing for different query patterns.
repowise doctor --repair command checks SQL↔Vector↔FTS consistency and auto-repairs orphaned entries — re-embedding missing vectors, re-indexing missing FTS entries.Graph Algorithms
Every graph metric is computed from the import dependency graph — a directed graph where each node is a file and each edge is an import.
PageRank
Answers "How important is this file?" A file is important if many other important files depend on it.
score(B) = (1 - α) / N + α × Σ [ score(A) / out_degree(A) ]
for each A that imports B
α = 0.85 (damping factor — probability of following an edge vs. teleporting)- Doc priority: highest PageRank files get wiki pages generated first
- Generation depth: low PageRank + low churn →
"minimal"docs (saves LLM tokens) - CLAUDE.md: files sorted by PageRank descending — most important appears first for AI assistants
Betweenness Centrality
Answers "Is this file a bridge between subsystems?" High betweenness files sit on shortest paths connecting many file pairs. Removing them would disconnect parts of the codebase.
Community Detection
Louvain algorithm on the undirected graph finds natural clusters — files that are more connected to each other than to the rest. These clusters become subsystem labels in the wiki and color groups in the dependency graph visualization.
Dead Code Detection
Files with in-degree = 0 (nothing imports them) are dead code candidates. Repowise filters out entry points, test files, config files, and framework-loaded files before flagging, then scores confidence using git activity.
| Confidence | Condition |
|---|---|
| 1.0 | No imports + no commits in 90d + file age > 180d |
| 0.7 | No imports + no commits in 90d |
| 0.4 | No imports but has recent commits (dynamic loading?) |
MCP Tools
8 MCP tools expose all of Repowise's intelligence to AI coding assistants — Claude Code, Cursor, Cline, and any MCP-compatible editor.
get_overview()
Architecture summary, module map, entry points, tech stack.
when to use: First call when exploring an unfamiliar codebase.
get_context(targets=[...])
Docs, ownership, history, decisions, freshness for files or symbols.
when to use: Before reading or modifying specific code.
get_risk(targets=[...])
Hotspot score, dependents, co-change partners, risk summary.
when to use: Before modifying files — assess blast radius.
get_why(query="...")
Decisions search, path-based lookup, or health dashboard.
when to use: Before architectural changes — understand existing intent.
search_codebase(query="...")
Semantic search over the full wiki.
when to use: When you don't know where something lives.
get_dependency_path(source=..., target=...)
Connection path between two files.
when to use: Understand how two things are connected.
get_dead_code()
Unreachable files, unused exports, zombie packages.
when to use: Before cleanup or refactoring.
get_architecture_diagram(scope=...)
Mermaid diagram for the full repo or a module scope.
when to use: For documentation, presentations, or onboarding.
Critical Analysis
An honest assessment of what Repowise solves well, where it can fail, and what's been addressed.
What it solves well
- Graph-first architecture captures structural dependencies that keyword search misses
- Temporal intelligence (git history, co-change) adds a dimension pure static analysis can't provide
- Bounded cascade with PageRank-sorted budgets ensures important docs stay fresh
- Multi-source decision extraction mines architectural intent from commits, comments, and documentation
- MCP integration makes intelligence immediately useful inside AI coding workflows
Failure probability matrix
| Failure | Probability | Impact | Recovery |
|---|---|---|---|
| Three-store inconsistency | Medium | Low–Medium | repowise doctor --repair |
| LLM hallucination in docs | Medium | Medium | Re-generate page |
| Documentation staleness | High | Medium | Run repowise update |
| Import resolution gaps (dynamic imports) | Medium–High | Medium | Partial (framework edges added) |
| Scale limits (>50K files) | Low–High (repo-dependent) | High | Architecture changes needed |
| Decision mining false positives | Medium | Low | Dismiss via CLI |
What has been implemented
doctor --repair
Three-store consistency check and auto-repair
Symbol cross-check
LLM output validated against AST symbols
Framework edges
Synthetic edges for pytest, Django, FastAPI, Flask
Adaptive cascade budget
Scales 10–50 based on change magnitude
Generation report
Cost/quality summary after every update run