Semantic Search Over Your Codebase: LanceDB + pgvector in Practice

repowise team··10 min read
semantic code searchvector search codebaselancedb code searchpgvector codecode search tool

Every developer has experienced the "grep fatigue." You’re navigating a 100k+ LOC codebase, trying to find where a specific business logic—say, the grace period for subscription cancellations—is implemented. You grep for "grace period," "cancellation," and "subscription," only to be met with hundreds of log lines, test mocks, and variable declarations that have nothing to do with the actual logic.

Traditional text-based search is a blunt instrument. It relies on literal string matching, which fails the moment a developer uses a synonym or follows a different naming convention. To truly understand a codebase, we need to move beyond strings and toward concepts. This is the promise of semantic code search. By leveraging vector embeddings and specialized databases like LanceDB and pgvector, we can query our codebase using natural language, finding relevant logic even when the exact words don't match.

Beyond grep: Why Semantic Search Changes Everything

In the traditional development workflow, search is lexical. If you search for get_user_balance, the engine looks for those exact characters. If the function is actually named fetch_account_equity, grep will never find it.

grep Finds Strings, Semantic Search Finds Concepts

Semantic search operates on the "meaning" of the code. It uses Large Language Models (LLMs) to transform code snippets or documentation into high-dimensional vectors (embeddings). In this vector space, pieces of code with similar functionality are placed close together.

When you search for "How do we handle user money?", the system doesn't look for the word "money." It looks for the vector closest to that concept, which might lead it directly to AccountService.java or balance.py. This transition from keyword matching to conceptual mapping is the core of a modern code search tool.

Natural Language Queries Over Code

With semantic search, your queries look like questions you'd ask a senior developer:

  • "Where do we validate JWT tokens?"
  • "Find the logic that calculates shipping costs for international orders."
  • "Show me how we handle database retries in the worker service."

This lowers the cognitive load for onboarding developers and speeds up debugging for veterans who might have forgotten the specific naming conventions of a module written six months ago.

One of the most powerful aspects of semantic search is its ability to find "neighboring" concepts. If you search for "authentication," a semantic engine will likely surface "authorization," "login," "session management," and "OAuth providers." It understands the relationship between these terms because they frequently appear in similar contexts within the training data of the embedding model.

Lexical vs Semantic Search ComparisonLexical vs Semantic Search Comparison

How Semantic Code Search Works

Building a semantic search engine for a codebase involves a multi-step pipeline that transforms raw text into a searchable mathematical space.

Step 1: Generate Embeddings From Documentation

The first step is turning your code into numbers. However, raw code is often noisy. It contains boilerplate, imports, and syntax that can dilute the "meaning." A more effective approach—and the one we use at repowise—is to first generate high-quality documentation for every file and function, and then embed that documentation.

Documentation provides a high-level summary that is much closer to the natural language queries users actually type. We use models like text-embedding-3-small (OpenAI) or local models via Ollama to generate these vectors.

Step 2: Store in a Vector Database

Once you have these vectors (typically arrays of 768 or 1536 floating-point numbers), you need a place to store them. Standard relational databases aren't optimized for "nearest neighbor" searches across high-dimensional space. This is where vector databases like LanceDB and extensions like pgvector come in. They use specialized indexes (like HNSW or IVFFlat) to make searching millions of vectors nearly instantaneous.

Step 3: Query With Natural Language

When a user types a query, that query is also converted into a vector using the same embedding model.

Step 4: Rank by Relevance

The database then performs a "cosine similarity" or "Euclidean distance" calculation to find the vectors in the database that are most similar to the query vector. The results are returned as a ranked list, often with a confidence score indicating how closely the result matches the intent.

Vector Database Options

Choosing the right storage layer depends on your infrastructure and team size.

LanceDB: Embedded, Zero-Config

LanceDB is an open-source, serverless vector database. It’s "embedded," meaning it runs inside your application process (like SQLite).

  • Pros: Zero management, extremely fast for local development, stores data in an efficient columnar format (Lance).
  • Best for: Individual developers, CLI tools, or self-hosted instances where you want to avoid the overhead of a separate database server.

pgvector: PostgreSQL Extension for Teams

If your organization already uses PostgreSQL, pgvector is often the logical choice. It adds a vector data type and distance operators to Postgres.

  • Pros: Leverages existing Postgres reliability, backups, and security. Allows you to join vector search results with regular relational data (e.g., "Find docs related to 'auth' where the file was modified in the last 30 days").
  • Best for: Enterprise environments and teams that want a unified data stack.

When to Use Which

FeatureLanceDBpgvector (Postgres)
ArchitectureEmbedded (Serverless)Client-Server
Setup ComplexityLow (pip install)Medium (Extension required)
Data PersistenceLocal Disk / S3Relational Database
Query LanguagePython / JS APISQL
Ideal Use CaseLocal AI Agents, CLI toolsShared Team Knowledge Bases

Building Semantic Search with repowise

At repowise, we've integrated these technologies to provide a "turnkey" codebase intelligence experience. You don't need to write the indexing pipelines yourself; the platform handles the heavy lifting.

When you point repowise at a repository, it doesn't just read the files. It performs a deep analysis:

  1. Parsing: It builds a dependency graph across 10+ languages.
  2. Summarization: It uses LLMs to generate a "Wiki" for the repo. You can see auto-generated docs for FastAPI to see the level of detail provided.
  3. Embedding: It takes these generated summaries, which contain the "why" and "how" of the code, and stores them in a vector store.

What Gets Embedded (Docs, Not Raw Code)

Embedding raw code often leads to "hallucinations" in search results because the model might get distracted by a variable name like temp_var. By embedding the LLM-generated documentation, we ensure the search index is populated with high-signal, descriptive text. This significantly improves the accuracy of a vector search codebase implementation.

Using search_codebase() MCP Tool

One of the most powerful ways to interact with this search is through the Model Context Protocol (MCP). Repowise exposes a search_codebase() tool that AI agents (like Claude Code or Cursor) can call.

# Example of how an agent might use the tool internally
search_codebase(query="How is the rate limiting implemented for the API?")

The agent receives the most relevant documentation chunks, allowing it to answer questions or write code with much higher context than a simple file-open command would provide. You can see all 8 MCP tools in action to understand how this fits into the broader agentic workflow.

The Repowise Semantic Indexing PipelineThe Repowise Semantic Indexing Pipeline

Practical Examples

How does this look in practice? Let's look at three common scenarios where semantic search outperforms traditional tools.

"Where is authentication handled?"

Grep result: Thousands of matches for auth in node_modules, test files, and CSS classes. Semantic result: Points directly to src/middleware/auth.ts and src/services/identity_provider.go because the documentation for those files explicitly mentions "handling user authentication and session validation."

"Find the rate limiting logic"

Grep result: Might find nothing if the developer used the term "throttling" or "request quotas." Semantic result: Correct identifies the RateLimiter class or the Redis-based counter logic because the embedding model understands that "rate limiting" and "throttling" are semantically identical in a software context.

"How does the billing system work?"

Grep result: Too broad. Returns every file that mentions invoice, price, subscription, or stripe. Semantic result: Returns the get_overview() summary for the billing module, providing a high-level architectural explanation of the data flow between the checkout UI and the backend webhook handlers.

Optimizing Search Quality

Semantic search is not a "set it and forget it" feature. To get the best results from your lancedb code search or pgvector code setup, consider these three factors:

Documentation Quality Affects Search Quality

The "Garbage In, Garbage Out" rule applies here. If your documentation is sparse, your search results will be poor. This is why repowise focuses so heavily on generating "freshness-scored" documentation. When the code changes, the docs (and the embeddings) must change with it.

Embedding Model Choice

text-embedding-3-small is excellent for cost and speed. However, for massive codebases, text-embedding-3-large or specialized models like voyage-code-2 can provide better nuance for technical terminology. Repowise allows you to configure your provider, whether it's OpenAI, Anthropic, or a local Ollama instance.

Chunking Strategy

You can't just embed a 5,000-line file as one vector. The "meaning" gets lost. You must break the documentation into meaningful chunks—usually by module, class, or function. Repowise uses its knowledge of the code's AST (Abstract Syntax Tree) to chunk documentation logically, ensuring that each vector represents a discrete, understandable unit of logic.

LanceDB vs pgvector Technical SpecsLanceDB vs pgvector Technical Specs

Comparison: Semantic Search vs Code Search vs grep

Featuregrep / ripgrepGitHub Code SearchSemantic Search (Repowise)
MatchingExact StringN-gram / RegexConceptual / Vector
Synonym SupportNoLimitedYes (Excellent)
Natural LanguageNoNoYes
Context AwareNoPartiallyYes (via Docs & Graph)
SpeedFast (Local)Fast (Cloud)Fast (Indexed)

Key Takeaways

  1. Semantic search is about intent, not strings. It allows you to find code based on what it does, not just what it is named.
  2. Embed documentation, not just code. Raw code contains too much noise for high-quality semantic mapping. Using LLM-generated summaries (like those in repowise) provides a much cleaner signal.
  3. Choose your DB based on your scale. Use LanceDB for local, zero-config projects and pgvector for team-wide, persistent knowledge bases.
  4. Integrate with AI Agents. Semantic search is the "eyes" of an AI agent. Using the search_codebase() MCP tool allows tools like Claude or Cursor to navigate your repo with the precision of a senior engineer.

If you're tired of digging through grep results, it's time to index your codebase. Check our architecture page to understand how we build these indexes, or see what repowise generates on real repos in our live examples.

FAQ

Q: Does semantic search replace grep? A: No. Grep is still superior for finding specific variable names or specific error strings. Semantic search is a complementary tool for architectural exploration and high-level discovery.

Q: Is my code sent to an LLM? A: If you use OpenAI or Anthropic for embeddings, yes. However, repowise supports local models via Ollama, allowing you to run the entire semantic search pipeline on your own hardware for maximum privacy.

Q: How often should I re-index? A: Ideally, every time your main branch is updated. Repowise tracks "freshness" to ensure that your search results stay in sync with your actual implementation. You can see how this looks in the hotspot analysis demo.

Try repowise on your repo

One command indexes your codebase.