Practical RAG with Marten and pgvector, Part 2: A Q&A Assistant for the Marten Docs

Part 1 built semantic search and a small RAG pipeline over movie plots, all on plain Postgres: Marten with the Marten.PgVector package for storage and vector search, and Ollama running nomic-embed-text and llama3.2 locally. Movies were a fun way to show that semantic search beats keyword search. This part points the exact same machinery at something I actually reach for: the Marten documentation, building an assistant you can ask “how do I enable pgvector in Marten?” and get a grounded, cited answer.

This post reads on its own, but it leans on the pieces Part 1 builds in depth (the IEmbeddingProvider, VectorSearchAsync, the HNSW index as a Marten schema object). I will recap those quickly and spend the words here on the one thing docs RAG needs that movie plots did not: chunking.

Why docs need chunking and movie plots did not

A movie plot is a couple of paragraphs. It embeds into one vector cleanly, and a search returns the whole thing. A documentation page is different: it can run to thousands of words covering installation, configuration, half a dozen APIs, and a troubleshooting section. Embed that whole page into a single 768-dimensional vector and you get mush: the vector is an average of every topic on the page, so it matches everything weakly and nothing strongly. Worse, when you feed a 4,000-word page into the LLM as “context”, you bury the two relevant sentences in noise and pay for the tokens.

The fix is to split each page into chunks, embed each chunk separately, and retrieve at the chunk level. Get chunking wrong and retrieval quality collapses no matter how good your embedding model is. So this is where the work goes.

The corpus: vendored VitePress docs

Marten publishes its docs as VitePress markdown. Rather than make the sample clone the repo, I vendored a copy of just the markdown into a single archive, corpus/docs.zip, laid out as marten/<path>.md. A small script refreshes it from local clones:

# scripts/vendor-docs.sh (abridged): source markdown only, then zip it up
rsync -a --prune-empty-dirs \
  --exclude='.vitepress/' --exclude='dist/' \
  --include='*/' --include='*.md' --exclude='*' \
  "$MARTEN_DOCS/" "$stage/marten/"
( cd "$stage" && zip -q -r -X corpus/docs.zip marten )

One gotcha worth calling out: a VitePress docs folder often contains a .vitepress/dist build directory with a second copy of every page. Include it by accident and your corpus is half duplicates, which then show up as near-identical search hits. Excluding .vitepress/ and dist/ roughly halves the corpus, leaving the 156 source pages we actually want, about 476 KB zipped. Keeping it as one archive means the repo carries a single artifact instead of hundreds of loose files, and the loader reads markdown straight out of the zip with ZipArchive, so there is no extraction step.

The stack, in one breath

Nothing about the storage or model layer changes from Part 1. Postgres with the vector extension and Ollama run in two containers; Marten is configured with opts.UsePgVector(); embeddings come from the same OllamaEmbeddingProvider hitting Ollama’s /api/embed. If any of that is unfamiliar, Part 1 walks through it slowly. The only additions here are a new document type and a chunking loader.

The chunk document

Each retrievable passage is a Marten document. Alongside the text and its embedding, it carries enough provenance to cite the source: which library, which file, the page title, the section heading, and the published URL.

public class DocChunk
{
    public string Id { get; set; } = "";        // "{source}:{path}#{index}"
    public string Source { get; set; } = "";     // e.g. "marten"
    public string Path { get; set; } = "";       // "postgres/pgvector.md"
    public string Url { get; set; } = "";         // published doc URL, for citations
    public string Title { get; set; } = "";       // page H1
    public string Heading { get; set; } = "";     // nearest "## " heading
    public string Content { get; set; } = "";     // the chunk text
    public float[]? Embedding { get; set; }       // 768-dim, null until ingested

    // "pgvector Support > Vector similarity search"
    public string Breadcrumb => Heading.Length > 0 ? $"{Title} > {Heading}" : Title;
}

Register it next to the Part 1 Movie type, and give it its own HNSW index (the HnswVectorIndex schema object from Part 1 is parameterised by table and column, so it is a one-liner):

opts.RegisterDocumentType<DocChunk>();
opts.Storage.ExtendedSchemaObjects.Add(new HnswVectorIndex(
    "public", "mt_doc_docchunk", "embedding", EmbeddingDimensions));

Chunking markdown

The loader reads each .md entry from the zip and does three things: clean the VitePress markup, find the page title, and split the body into section-sized chunks.

Cleaning. VitePress markdown is not plain markdown. It has YAML frontmatter, custom containers (::: tip), and most importantly snippet machinery: the actual code is wrapped in  / <a id='snippet-...'></a> markers, with a “snippet source” footer, and some pages pull whole files in with <<< @/.... None of that belongs in an embedding. We strip the markers but keep the code fences, since the code is exactly what people search for.

foreach (var line in md.Split('\n'))
{
    var t = line.TrimEnd();
    if (title.Length == 0 && t.StartsWith("# ")) { title = t[2..].Trim(); continue; }

    if (t.StartsWith("<!-- snippet:") || t.StartsWith("<!-- endSnippet") ||
        t.StartsWith("<a id='snippet-") || t.StartsWith("<<< @") ||
        t.StartsWith(":::") || t == "[[toc]]" || t.Contains("snippet source | anchor"))
        continue;

    sb.Append(t).Append('\n');
}

Splitting. The natural seam in a doc page is the ## heading: each section is about one topic, which is exactly the granularity we want. So the primary split is on ## lines, with each chunk tagged by its heading. A few sections are still too long (a big reference section can run past a thousand words), so any section over ~1,500 characters is packed into sub-chunks on paragraph boundaries, with a one-paragraph overlap so a fact that straddles a boundary still lands whole in at least one chunk.

private static IEnumerable<(string Heading, string Content)> Chunk(string body)
{
    var heading = "";
    var section = new StringBuilder();

    foreach (var line in body.Split('\n'))
    {
        if (line.StartsWith("## "))
        {
            foreach (var c in Emit(heading, section.ToString())) yield return c;
            heading = line[3..].Trim().TrimStart('#', ' ');
            section.Clear();
        }
        else section.Append(line).Append('\n');
    }
    foreach (var c in Emit(heading, section.ToString())) yield return c;
}

Emit yields the section as-is when it fits, or packs paragraphs up to the cap with overlap when it does not. The choice of ~1,500 characters is a deliberate middle ground: small enough that a chunk is about one idea (good for retrieval precision), large enough to keep a code sample and its explanation together (good for the answer). Tune it to your docs.

Citations. The published URL is derived from the file path, following VitePress’s routing: index.md maps to the directory root, everything else to a .html page.

private static string ToUrl(string source, string path)
{
    var baseUrl = BaseUrls[source];   // e.g. "marten" -> "https://martendb.io/"
    var p = path.EndsWith(".md") ? path[..^3] : path;
    if (p == "index") return baseUrl;
    if (p.EndsWith("/index")) return baseUrl + p[..^5];
    return baseUrl + p + ".html";
}

Ingesting

Ingestion mirrors Part 1 (batch-embed, accumulate, then BulkInsertAsync with OverwriteExisting), with one refinement: we embed the breadcrumb (title and heading) together with the chunk body. That gives the vector a little context about where the passage sits, so a chunk under “pgvector Support > Installation” is nudged toward installation queries even if the word never appears in the body.

var chunks = new List<DocChunk>();

foreach (var batch in DocsLoader.Load(zip, source, filter).Take(limit).Chunk(64))
{
    var inputs = batch
        .Select(c => OllamaEmbeddingProvider.DocumentPrefix + c.Breadcrumb + "\n\n" + c.Content)
        .ToArray();

    var vectors = await embedder.GenerateEmbeddingsAsync(inputs);
    for (var i = 0; i < batch.Length; i++)
        batch[i].Embedding = vectors[i].Memory.ToArray();

    chunks.AddRange(batch);
}

await store.BulkInsertAsync(chunks, BulkInsertMode.OverwriteExisting);

The CLI takes --source and --filter so you can ingest a slice while developing instead of waiting on the whole corpus. To embed just the Postgres docs:

dotnet run -- docs-ingest --filter postgres
# Embedded 63 chunks ...
# Bulk inserting 63 chunks ...
# Done.
dotnet run -- index    # adds the HNSW index for the new table

The full Marten docs are about 1,500 chunks; as Part 1 noted, embedding throughput on a CPU is the constraint, not Postgres (the bulk insert at the end is instant), so budget a few minutes or filter down while iterating.

Retrieving from the docs: all the relevant content, not a fixed top-K

Like movie-search in Part 1, docs-search retrieves first and answers second; this section is the retrieval half. For movies, returning the top 5 was fine. For docs it is wrong: one question might be answered by two passages and another by twelve, so a fixed K either starves the answer or pads it with noise. Docs retrieval instead returns every chunk within a cosine-distance threshold. VectorSearchAsync gives the nearest N but not their distances, so we pull a candidate pool, recompute the distance client-side (plain arithmetic on the two vectors), and keep everything inside the relevance bar:

var candidates = await session.VectorSearchAsync<DocChunk>(
    x => x.Embedding, queryVector, limit: maxCandidates, distance: DistanceFunction.Cosine);

var q = queryVector.Memory.ToArray();
var relevant = candidates
    .Select(c => (Chunk: c, Distance: CosineDistance(q, c.Embedding!)))
    .Where(t => t.Distance <= threshold)     // all relevant, not a fixed K
    .OrderBy(t => t.Distance)
    .ToList();

Before the model sees anything, retrieval gathers every passage within the threshold. After it answers, docs-search prints those passages as a Sources list, so every claim is traceable to a page. For the pgvector question it retrieves ten:

Sources:
  [1] pgvector Support - https://martendb.io/postgres/pgvector.html
  [2] pgvector Support > Installation - https://martendb.io/postgres/pgvector.html
  [3] pgvector Support > Bring-your-own embeddings - https://martendb.io/postgres/pgvector.html
  [4] pgvector Support > Enabling pgvector on a store - https://martendb.io/postgres/pgvector.html
  [5] Modeling documents > Storing Documents - https://martendb.io/tutorials/modeling-documents.html
  [6] Storing Documents - https://martendb.io/documents/storing.html
  [7] pgvector Support > Storing vectors on a document - https://martendb.io/postgres/pgvector.html
  [8] pgvector Support > Notes & limitations - https://martendb.io/postgres/pgvector.html
  [9] Bootstrapping Marten > Register DocumentStore with AddMarten() - https://martendb.io/configuration/hostbuilder.html
  [10] PostGIS Spatial Support > Installation - https://martendb.io/postgres/postgis.html

Ten distinct sections, not an arbitrary five: mostly the pgvector page, plus the document-storage and bootstrapping pages that genuinely bear on the question. The threshold is the knob: on this corpus with nomic-embed-text, genuinely relevant chunks land below ~0.30 while loosely-related Marten pages pack in just past it, so a 0.30 threshold (MOVIERAG_DOCS_THRESHOLD, or --threshold) captures the relevant set without dragging in noise. Calibrate it against your own docs by reading the distances off a few real queries and cutting where the relevant cluster ends. The count is deliberately not fixed, which raises the question the next section answers: a broad query legitimately matches dozens of chunks, so how many do you actually hand to the model?

Answering: retrieve widely, feed the model narrowly

Here is the lesson I learned the hard way. My first cut passed every relevant chunk to the model. For a tight question that was ten passages and the answer was great. For a broad one like “how do I register Marten with DI” the threshold matched 36 chunks, and llama3.2 fell apart: it answered a different question and invented an IMongoContext that exists nowhere in Marten. A 3-billion-parameter model does not get more accurate when you bury the two relevant passages under thirty-four others; it gets lost.

So the rule is retrieve widely but feed narrowly: take all the relevant chunks, then hand the model only the nearest --context of them (default 10). The nearest few carry the answer, and the cap keeps the prompt focused.

var relevant = await RetrieveRelevantDocsAsync(store, qv, threshold, max);
var hits = relevant.Take(contextLimit).ToList();   // retrieve widely, feed narrowly

The rest is Part 1’s generation, with a docs-tuned prompt: answer only from the excerpts, and cite them by number. After streaming the answer, we print the sources so every claim is traceable to a doc page.

var context = string.Join("\n\n", hits.Select((h, i) =>
    $"[{i + 1}] {h.Chunk.Breadcrumb} ({h.Chunk.Url})\n{h.Chunk.Content}"));

var prompt =
    $"""
    You are a precise documentation assistant for the Marten .NET library.
    Answer the question using ONLY the documentation excerpts below. Cite the excerpts you
    use by their bracketed number, like [2]. If the excerpts do not contain the answer, say
    you don't know rather than guessing.

    EXCERPTS:
    {context}

    QUESTION: {question}

    ANSWER:
    """;

Asked a real question, llama3.2 answers from the retrieved passages and grounds the code in them:

$ dotnet run -- docs-search "How do I enable pgvector in Marten and store an embedding on a document?"
10 relevant chunk(s); using the top 10. Asking llama3.2 ...

To enable pgvector in Marten and store an embedding on a document, you need to
follow these steps:

1. Add the Marten.PgVector NuGet package to your project.
2. Ensure your local PostgreSQL instance has the `vector` extension installed.
3. Call UsePgVector() when creating your DocumentStore.
4. Create a document type with a vector property:

    public class ProductWithVector
    {
        public Guid Id { get; set; }
        public float[]? Embedding { get; set; }   // stored in JSONB, cast to vector() at query time
    }

[... followed by the Sources list shown above ...]

The steps and the code are correct, drawn straight from the retrieved passages, and the Sources list above makes every page traceable. The “answer only from these excerpts” instruction is doing real work: with a small local model especially, grounding is what stops it from inventing a plausible-but-wrong API. One honest wrinkle: llama3.2 does not reliably honour the “cite by bracketed number” request inline, which is one more reason the Sources list earns its keep, and one more thing a sharper hosted model does better.

What carries over for free

Because this is the same stack, everything Part 1 built applies without change:

The HNSW index is the same ISchemaObject, pointed at mt_doc_docchunk. Marten creates and tracks it like any other schema object.
Swapping to a hosted model is still a one-line change. For a real internal docs assistant this is where I would reach for a frontier model: doc Q&A rewards a model that synthesises across several retrieved sections and refuses cleanly when the docs do not cover something, which is exactly where small local models struggle.

A few things I would tune for production

Chunk size and overlap are the biggest levers on quality. Too small and a code sample gets cut from its explanation; too large and retrieval goes mushy again. Measure on real questions.
Embed the breadcrumb. Prefixing the title and heading to each chunk’s embedding text is cheap and noticeably improves retrieval on short, keyword-light passages.
Re-ingest on doc changes. The chunk Id is {source}:{path}#{index}, so re-running ingestion upserts a page’s chunks in place. A docs site that ships often wants this wired to a build step.
Mix in versioning if your docs are versioned: a Version field on DocChunk plus a filter keeps answers from blending two major versions.

Wrapping up

The jump from “search movie plots” to “answer questions from real documentation” turned out to be small: one new document type and a markdown chunker. Storage, embeddings, vector search, indexing, caching, and generation all came straight from Part 1. The lesson is that the hard, interesting part of docs RAG is not the database or the model, it is turning your source material into well-sized, well-labelled chunks. Postgres and Marten handle the rest.

The full project, including the docs loader, the vendored corpus/docs.zip, and the docs-ingest and docs-search commands, is on GitHub at mysticmind/marten-pgvector-rag. Point it at your own docs and you have a cited Q&A assistant over them in an afternoon.