Architecture

A complete guide to how Repowise works — ingestion, generation, persistence, graph algorithms, and MCP tools.

architecture-guide.mdcritical-analysis.mdgraph-algorithms-guide.md

The Big Picture

Repowise takes a codebase and produces a living wiki — AI-generated documentation that stays current as the code changes. Each layer has a single responsibility and a clear boundary.

Single responsibility per layer

Ingestion, generation, and persistence are fully decoupled.

Swap the LLM provider without touching core

The LLM abstraction is isolated behind a provider interface.

Change the database without touching generation

Persistence is accessed through a repository pattern.

Replace the frontend without touching persistence

The API is the only contract between UI and data.

Ingestion Pipeline

The ingestion pipeline turns raw source code into structured data. Five components run in sequence.

FileTraverser

Walks the directory tree and applies three filter layers to find only documentable files.

ASTParser

Extracts functions, classes, imports, and exports from 9+ languages using tree-sitter queries with unified capture names.

@symbol.def      — full definition node (provides line numbers)
@symbol.name     — the name identifier
@symbol.params   — parameter list
@symbol.modifiers — decorators, visibility keywords
@import.statement — full import node
@import.module   — the module path being imported

Adding a new language requires only one .scm query file and one entry in LANGUAGE_CONFIGS.

GitIndexer

Extracts ownership, churn, co-change relationships, and significant commits from git history using a single git log pass — O(1) processes instead of O(N).

CategoryMetricWhat it tells you
Timelinecommit_count_90d, age_daysHow active the file is
Ownershipprimary_owner, bus_factorWho knows this code, single points of failure
Churnchurn_percentile, is_hotspotHow volatile — risky to change
Co-changeco_change_pairsWhat files change together (invisible coupling)
Bus factor = minimum contributors needed to account for 80% of commits. Bus factor of 1 means a single point of failure — if that person leaves, institutional knowledge is lost.

ChangeDetector

After the initial indexing, ChangeDetector determines what to regenerate on each update run.

Generation Pipeline

The generation pipeline takes ingested data and produces wiki pages using LLMs.

Page Types & Levels

Repowise generates 10 page types in 8 ordered levels. Later levels depend on earlier ones — a module page references the file pages inside it.

LevelPage TypeDescription
0api_contractAPI definitions (OpenAPI, Proto, GraphQL)
1symbol_spotlightIndividual important symbols
2file_pageIndividual code files (most common)
3scc_pageCircular dependency documentation
4module_pageDirectory-level summaries
5cross_packageInter-package boundaries (monorepos)
6repo_overviewRepo-wide summary
6architecture_diagramMermaid dependency diagram
7infra_pageDockerfiles, Makefiles, Terraform
7diff_summaryChange summaries

ContextAssembler

Each page type has a ContextAssembler method that builds everything the LLM needs to know. Context is rendered into Jinja2 templates and token-budgeted to fit the LLM window.

FilePageContext
├── file_path, language
├── symbols (public first, then private)
│   └── name, kind, signature, docstring, visibility
├── imports, exports
├── dependencies + dependents
├── file_source_snippet (trimmed to token budget)
├── pagerank_score, betweenness_score, community_id
├── git_metadata (ownership, churn, significant commits)
├── co_change_pages
├── dead_code_findings
├── depth: "minimal" | "standard" | "thorough"
└── rag_context (related pages from vector search)

Job System

Generation for large repos may take hours. The job system checkpoints every page so a crash can resume from where it left off.

{
  "status": "running",
  "total_pages": 300,
  "completed_pages": 150,
  "completed_page_ids": ["file_page:src/auth.py", "..."],
  "current_level": 2
}

Persistence Layer

Repowise writes to three independent stores — SQL, vector store, and graph — optimizing for different query patterns.

The repowise doctor --repair command checks SQL↔Vector↔FTS consistency and auto-repairs orphaned entries — re-embedding missing vectors, re-indexing missing FTS entries.

Graph Algorithms

Every graph metric is computed from the import dependency graph — a directed graph where each node is a file and each edge is an import.

PageRank

Answers "How important is this file?" A file is important if many other important files depend on it.

score(B) = (1 - α) / N  +  α × Σ [ score(A) / out_degree(A) ]
                                        for each A that imports B

α = 0.85 (damping factor — probability of following an edge vs. teleporting)
  • Doc priority: highest PageRank files get wiki pages generated first
  • Generation depth: low PageRank + low churn → "minimal" docs (saves LLM tokens)
  • CLAUDE.md: files sorted by PageRank descending — most important appears first for AI assistants

Betweenness Centrality

Answers "Is this file a bridge between subsystems?" High betweenness files sit on shortest paths connecting many file pairs. Removing them would disconnect parts of the codebase.

PageRank says "I'm important." Betweenness says "I'm a bottleneck." A bridge file with low PageRank still gets a Level 2 wiki page because it's a coupling point — changes to it ripple across subsystem boundaries.

Community Detection

Louvain algorithm on the undirected graph finds natural clusters — files that are more connected to each other than to the rest. These clusters become subsystem labels in the wiki and color groups in the dependency graph visualization.

Dead Code Detection

Files with in-degree = 0 (nothing imports them) are dead code candidates. Repowise filters out entry points, test files, config files, and framework-loaded files before flagging, then scores confidence using git activity.

ConfidenceCondition
1.0No imports + no commits in 90d + file age > 180d
0.7No imports + no commits in 90d
0.4No imports but has recent commits (dynamic loading?)

MCP Tools

8 MCP tools expose all of Repowise's intelligence to AI coding assistants — Claude Code, Cursor, Cline, and any MCP-compatible editor.

get_overview()

Architecture summary, module map, entry points, tech stack.

when to use: First call when exploring an unfamiliar codebase.

get_context(targets=[...])

Docs, ownership, history, decisions, freshness for files or symbols.

when to use: Before reading or modifying specific code.

get_risk(targets=[...])

Hotspot score, dependents, co-change partners, risk summary.

when to use: Before modifying files — assess blast radius.

get_why(query="...")

Decisions search, path-based lookup, or health dashboard.

when to use: Before architectural changes — understand existing intent.

search_codebase(query="...")

Semantic search over the full wiki.

when to use: When you don't know where something lives.

get_dependency_path(source=..., target=...)

Connection path between two files.

when to use: Understand how two things are connected.

get_dead_code()

Unreachable files, unused exports, zombie packages.

when to use: Before cleanup or refactoring.

get_architecture_diagram(scope=...)

Mermaid diagram for the full repo or a module scope.

when to use: For documentation, presentations, or onboarding.

Critical Analysis

An honest assessment of what Repowise solves well, where it can fail, and what's been addressed.

What it solves well

  • Graph-first architecture captures structural dependencies that keyword search misses
  • Temporal intelligence (git history, co-change) adds a dimension pure static analysis can't provide
  • Bounded cascade with PageRank-sorted budgets ensures important docs stay fresh
  • Multi-source decision extraction mines architectural intent from commits, comments, and documentation
  • MCP integration makes intelligence immediately useful inside AI coding workflows

Failure probability matrix

FailureProbabilityImpactRecovery
Three-store inconsistencyMediumLow–Mediumrepowise doctor --repair
LLM hallucination in docsMediumMediumRe-generate page
Documentation stalenessHighMediumRun repowise update
Import resolution gaps (dynamic imports)Medium–HighMediumPartial (framework edges added)
Scale limits (>50K files)Low–High (repo-dependent)HighArchitecture changes needed
Decision mining false positivesMediumLowDismiss via CLI

What has been implemented

doctor --repair

Three-store consistency check and auto-repair

Symbol cross-check

LLM output validated against AST symbols

Framework edges

Synthetic edges for pytest, Django, FastAPI, Flask

Adaptive cascade budget

Scales 10–50 based on change magnitude

Generation report

Cost/quality summary after every update run