Practical RAG with Marten and pgvector, Part 1: Semantic Search over Movie Plots

There is a reflex, the moment someone says “vector search”, to reach for a brand new piece of infrastructure: a dedicated vector database, a managed embeddings API, a bill. If you are already running Postgres, you usually need none of that. The pgvector extension turns Postgres into a perfectly good vector store, and the Marten.PgVector companion package wires it into Marten so you keep working with plain C# documents.

This post builds a small but complete RAG pipeline with exactly three moving parts, all of them free and local:

Marten + Marten.PgVector: document storage and vector similarity search, on top of Postgres
pgvector: the nearest-neighbour math, inside Postgres
Ollama: nomic-embed-text for embeddings and llama3.2 for generation, running on your machine

Running the models locally keeps the getting-started cost at zero and nothing leaves the machine, which is ideal for learning the moving parts. (In production you would likely point the generation step at a hosted frontier model for sharper answers, which is a one-line swap we will get to.) The one thing that stays constant throughout is the storage layer: no second database. Your vectors live in the Postgres you already run.

To make it concrete we will search the CMU Movie Summary Corpus, about 42,000 plot summaries, and the payoff is the kind of query keyword search simply cannot answer:

dotnet run -- movie-search "which film involves a treasure map and bandits in the desert?"

movie-search runs the two halves of RAG behind one command: it first retrieves the plots most similar in meaning to the query (matches that need not contain any of the query’s words), then hands those plots to a local LLM to answer in prose. We will build it in exactly those two steps, retrieval then generation, because the first half is where the “semantic search beats keyword search” magic lives. The full, runnable code is on GitHub at mysticmind/marten-pgvector-rag.

The shape of a RAG pipeline

RAG (Retrieval-Augmented Generation) is less exotic than the acronym suggests. It is two steps:

Retrieve the handful of documents most relevant to a question. We do this with vector similarity: convert text to a vector (“embedding”) such that similar meanings land near each other, then find the nearest neighbours of the question vector.
Generate an answer by stuffing those documents into an LLM prompt as context.

Everything hard about step 1, generating embeddings, storing them, searching them quickly, is what Marten.PgVector and Ollama take off your plate. Step 2 is a single HTTP call.

Concern	Handled by
Turn text into a 768-dim vector	Ollama + `nomic-embed-text`
Store the vector next to its document	Marten (it lives in the JSONB)
`CREATE EXTENSION vector`, type mapping	`opts.UsePgVector()`
Nearest-neighbour query	`session.VectorSearchAsync(...)`
Generate the final answer	Ollama + `llama3.2`

Standing up the infrastructure

Two containers. pgvector publishes an official Postgres image with the extension already compiled in, so there is nothing to build:

# docker-compose.yml
services:
  postgres:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: movierag
    ports: ["5432:5432"]
    volumes: ["pgdata:/var/lib/postgresql/data"]

  ollama:
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes: ["ollama:/root/.ollama"]

volumes:
  pgdata:
  ollama:

Bring it up and pull the two models into the Ollama container:

docker compose up -d

docker compose exec ollama ollama pull nomic-embed-text   # ~274 MB, 768-dim embeddings
docker compose exec ollama ollama pull llama3.2            # ~2 GB, the generator

nomic-embed-text is a deliberately boring choice, and that is the point: it is small, fast on a CPU, produces 768-dimensional vectors, and is good enough that the retrieval quality will not be your bottleneck. Swap in mxbai-embed-large or all-minilm later by changing one string and one number.

The dataset

The CMU Movie Summary Corpus ships as two tab-separated files joined on a Wikipedia movie id: movie.metadata.tsv (title, year, genres as a Freebase JSON dict) and plot_summaries.txt (id, then the plot). A small script fetches and unpacks them:

curl -L http://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz | tar -xz

Parsing is unremarkable C#: split on tabs, pull the genre names out of the Freebase dictionary, and keep the plot:

public static IEnumerable<Movie> Load(string dataDir)
{
    var metadata = LoadMetadata(Path.Combine(dataDir, "movie.metadata.tsv"));

    foreach (var line in File.ReadLines(Path.Combine(dataDir, "plot_summaries.txt")))
    {
        var tab = line.IndexOf('\t');
        if (tab <= 0 || !int.TryParse(line.AsSpan(0, tab), out var wikiId)) continue;
        if (!metadata.TryGetValue(wikiId, out var meta)) continue;

        var plot = line[(tab + 1)..].Trim();
        if (plot.Length == 0) continue;

        yield return new Movie
        {
            Id = wikiId,            // reuse the Wikipedia id so re-ingest is an upsert
            Title = meta.Title,
            Year = meta.Year,
            Genres = meta.Genres,
            Plot = plot.Length > 6000 ? plot[..6000] : plot   // keep within the model's window
        };
    }
}

A quick movie-peek over the parsed output is reassuring: title, year and genres line up with the plot text:

#31186339  The Hunger Games (2012)  genres=[Action/Adventure, Science Fiction, Action, Drama]
   The nation of Panem consists of a wealthy Capitol and twelve poorer districts...

#595909  A Cry in the Dark (1988)  genres=[Crime Fiction, Drama, Docudrama, World cinema, Courtroom Drama]
   Seventh-day Adventist Church pastor Michael Chamberlain, his wife Lindy...

Generating embeddings with Ollama

Ollama exposes embeddings at POST /api/embed. It takes a model and either a single string or an array of strings, and returns one vector per input, so we can embed a whole batch in one round trip.

Marten.PgVector does not ship a default embedding model; it asks you to implement a tiny IEmbeddingProvider interface and bring whatever you like. Here it is, backed by Ollama:

public sealed class OllamaEmbeddingProvider(
    HttpClient http, string model = "nomic-embed-text", int dimensions = 768)
    : IEmbeddingProvider
{
    // nomic-embed-text is trained with task prefixes. Using them measurably
    // improves retrieval: documents under "search_document:", queries under
    // "search_query:". This is the single most common thing people get wrong.
    public const string DocumentPrefix = "search_document: ";
    public const string QueryPrefix    = "search_query: ";

    public int Dimensions { get; } = dimensions;

    public async Task<Vector[]> GenerateEmbeddingsAsync(string[] texts, CancellationToken ct = default)
    {
        if (texts.Length == 0) return [];

        using var resp = await http.PostAsJsonAsync(
            "/api/embed", new EmbedRequest(model, texts), ct);
        resp.EnsureSuccessStatusCode();

        var body = await resp.Content.ReadFromJsonAsync<EmbedResponse>(ct)
                   ?? throw new InvalidOperationException("Ollama returned an empty embeddings response.");

        return body.Embeddings.Select(e => new Vector(e)).ToArray();
    }

    private sealed record EmbedRequest(
        [property: JsonPropertyName("model")] string Model,
        [property: JsonPropertyName("input")] string[] Input);

    private sealed record EmbedResponse(
        [property: JsonPropertyName("embeddings")] float[][] Embeddings);
}

Vector here is Pgvector.Vector, the type Marten.PgVector round-trips through Npgsql.

Two details that matter in practice and are easy to miss:

The task prefixes. nomic-embed-text was trained so that documents and queries get different prefixes. Embed your corpus with search_document: and your queries with search_query:. Skip this and search still “works”, but noticeably worse; it is the most common own-goal with this model.
Dimensions is load-bearing. It must equal the model’s output width (768 for nomic-embed-text) and the vector(N) column width. Changing it later is a migration, not a config tweak: pgvector cannot resize a vector column in place.

Wiring up Marten

The document is an ordinary POCO. The embedding is just a float[] property; Marten stores it inside the document’s JSONB and casts it to vector(768) at query time.

public class Movie
{
    public int Id { get; set; }          // Wikipedia movie id
    public string Title { get; set; } = "";
    public int? Year { get; set; }
    public string[] Genres { get; set; } = [];
    public string Plot { get; set; } = "";
    public float[]? Embedding { get; set; }   // 768-dim, null until ingested
}

Enabling pgvector is a single call inside AddMarten/DocumentStore.For:

public static DocumentStore Build() =>
    DocumentStore.For(opts =>
    {
        opts.Connection(ConnectionString);

        // 1. CREATE EXTENSION IF NOT EXISTS vector (on every database Marten manages)
        // 2. NpgsqlDataSourceBuilder.UseVector() so Pgvector.Vector round-trips
        opts.UsePgVector();

        opts.RegisterDocumentType<Movie>();

        opts.AutoCreateSchemaObjects = AutoCreate.CreateOrUpdate;  // demo convenience
    });

UsePgVector() does the two annoying things you would otherwise do by hand: it adds the CREATE EXTENSION to Marten’s schema management (correctly, even for per-tenant databases, historically a sharp edge), and it registers the pgvector type handler on the Npgsql data source so Vector serialises without ceremony.

Ingesting: embed, then store

Ingestion is a loop: pull a batch of movies, embed their plots in one call, attach each vector to its document, accumulate them, and then load the whole set with Marten’s bulk insert. Store plus SaveChanges is fine for a handful of documents, but for thousands BulkInsertAsync is the right tool: it streams through COPY instead of issuing per-row upserts. BulkInsertMode.OverwriteExisting makes a re-run an idempotent upsert, and since the document id is the Wikipedia id, re-ingesting updates rows rather than duplicating them.

var movies = new List<Movie>();

foreach (var batch in CmuDatasetLoader.Load(dataDir).Take(limit).Chunk(64))
{
    var inputs = batch
        .Select(m => OllamaEmbeddingProvider.DocumentPrefix + m.Plot)
        .ToArray();

    var vectors = await embedder.GenerateEmbeddingsAsync(inputs);
    for (var i = 0; i < batch.Length; i++)
        batch[i].Embedding = vectors[i].Memory.ToArray();

    movies.AddRange(batch);
}

await store.BulkInsertAsync(movies, BulkInsertMode.OverwriteExisting);

Embedding is the slow part: it is CPU-bound model inference, not database work. Batching (64 plots per /api/embed call here) keeps the HTTP overhead negligible, and the bulk insert at the end is a rounding error next to the model time; the wall-clock is dominated by embedding. On a laptop CPU, budget a few minutes per 10k movies. Start with --limit 2000 to get a feel for it before committing to the full corpus.

Retrieval: the first half of a search

movie-search starts by retrieving the most relevant plots, and that retrieval is one method call. VectorSearchAsync takes a selector for the vector property, the query vector, a limit, and a distance function:

var queryVector = (await embedder.GenerateEmbeddingsAsync(
    [OllamaEmbeddingProvider.QueryPrefix + query]))[0];

await using var session = store.QuerySession();

var hits = await session.VectorSearchAsync<Movie>(
    x => x.Embedding,
    queryVector,
    limit: 5,
    distance: DistanceFunction.Cosine);

The three distance functions map onto pgvector’s three operator classes. For text embeddings you almost always want cosine distance, which is the default:

`DistanceFunction`	pgvector operator	Use
`Cosine` (default)	`<=>`	Text embeddings (what you want here)
`L2`	`<->`	Euclidean distance
`InnerProduct`	`<#>`	Inner product

Every movie-search lists the films it retrieved under a Movies: heading (title, year, genre, and a plot snippet), so you can always see what the answer was grounded in. For the phrase “a heist that goes wrong” (run against a 500-movie slice, movie-ingest --limit 500, so the demo runs in a few minutes; the full 42k set surfaces more recognisable titles), the retrieved films are crime capers ranked by semantic closeness, and crucially, not one of them contains the word “heist”:

Movies:
  [1] Next Day Air (2009)  [Crime Fiction, Comedy of Errors, Action]
      Leo works for Next Day Air, a package delivery company, but is going to get fired for any more mistakes. While delivering a package addressed to Jesus…
  [2] The Good, the Bad, the Weird (2008)  [Adventure, World cinema, Costume Adventure]
      In the desert wilderness of 1930s Manchuria, The Bad, a bandit and hitman, is hired to acquire a treasure map from a Japanese official traveling by train…
  [3] Lucky Losers (1950)  [Crime Fiction, Comedy]
      Slip and Sach's boss, David J. Thurston, has allegedly committed suicide. Slip finds a book of matches with the name of a local nightclub…
  [4] Big Nothing (2006)  [Crime Fiction, Crime Comedy, Action/Adventure]
      The film is set in a small Oregon town, where a brutal serial killer nicknamed the 'Oregon undertaker' has been murdering young women…

Why this beats `LIKE '%heist%'`

A keyword search for '%heist%' returns nothing here; none of these plots use the word. Semantic search finds them anyway, because their meaning sits near the query’s meaning in vector space: a botched delivery job, a train robbery for a treasure map, a crew of small-time crooks. That is the entire value proposition: you are searching concepts, not strings. Synonyms, paraphrases and fuzzy intent all just work, with no synonym lists or full-text configuration to maintain.

Making it fast: the HNSW index

By default VectorSearchAsync casts the JSONB-stored array to vector(N) at query time and scans. That is fine for thousands of rows and fine for trying things out. For tens of thousands and up, add an HNSW index so the search is index-accelerated.

There is one wrinkle worth understanding: because the vector lives inside the JSONB document, the index has to be built on the same cast expression the query uses, not on a bare column. The DDL we want is:

CREATE INDEX IF NOT EXISTS mt_doc_movie_embedding_hnsw
ON public.mt_doc_movie
USING hnsw (((data ->> 'embedding')::vector(768)) vector_cosine_ops);

The vector_cosine_ops operator class must match the distance function you query with (Cosine maps to vector_cosine_ops).

Letting Marten manage the index

You could just run that statement once and move on. But as of 9.9 Marten.PgVector has no first-class API to declare a vector index: there is no IndexMethod.Hnsw, and Marten’s Index() builder only covers btree/gin/gist/brin/hash on members or computed columns, none of which can express USING hnsw (...vector_cosine_ops) over a cast-to-vector(N) expression.

If you create the index with a one-off CREATE INDEX, Marten never learns it is supposed to exist. That has real consequences: the index is absent from the DDL Marten generates for your schema, Marten’s schema diff does not know to account for it, and ApplyAllConfiguredChangesToDatabaseAsync() will not recreate it when you stand the database up somewhere new. You end up maintaining that one index by hand, out of step with every other schema object Marten owns.

The fix is to hand Marten the index as a schema object it owns, through the same opts.Storage.ExtendedSchemaObjects hook that Marten.PgVector itself uses for its projection tables. An ISchemaObject is mostly mechanical: write the create/drop DDL, and tell Marten how to detect whether the index already exists so it knows whether to create it:

public sealed class HnswVectorIndex(
    string schema, string table, string member, int dimensions,
    string opsClass = "vector_cosine_ops") : ISchemaObject
{
    private readonly string _indexName = $"{table}_{member}_hnsw";
    public DbObjectName Identifier => new(schema, _indexName);
    public IEnumerable<DbObjectName> AllNames() => [Identifier];

    private string CreateSql =>
        $"CREATE INDEX IF NOT EXISTS {_indexName} ON {schema}.{table} " +
        $"USING hnsw (((data ->> '{member}')::vector({dimensions})) {opsClass});";

    public void WriteCreateStatement(Migrator m, TextWriter w) => w.WriteLine(CreateSql);
    public void WriteDropStatement(Migrator m, TextWriter w) =>
        w.WriteLine($"DROP INDEX IF EXISTS {schema}.{_indexName};");

    // Marten runs this during schema diffing; the presence of a row decides the delta.
    public void ConfigureQueryCommand(DbCommandBuilder b) => b.Append(
        $"select 1 from pg_indexes where schemaname = '{schema}' and indexname = '{_indexName}';");

    public async Task<ISchemaObjectDelta> CreateDeltaAsync(DbDataReader reader, CancellationToken ct)
    {
        var exists = false;
        while (await reader.ReadAsync(ct)) exists = true;
        return new Delta(this, exists ? SchemaPatchDifference.None : SchemaPatchDifference.Create);
    }
    // Delta.WriteUpdate emits CreateSql only when the difference is Create; rollback drops it.
}

opts.Storage.ExtendedSchemaObjects.Add(new HnswVectorIndex(
    "public", "mt_doc_movie", "embedding", dimensions: 768));

Now the index is a first-class citizen of the schema: ApplyAllConfiguredChangesToDatabaseAsync() creates it when missing, the diff reports None once it exists (re-running is a clean no-op), and it travels with your generated migration scripts. One genuine caveat: the index depends on mt_doc_movie existing, so the document table has to be created first; in practice you ingest (which creates the table) before applying the index, and on a fresh database a single schema apply handles both.

Honestly, this is enough boilerplate that it makes the case for the feature living in the library. A natural enhancement to Marten.PgVector would be a declarative option, something like opts.RegisterDocumentType<Movie>().VectorIndex(x => x.Embedding, DistanceFunction.Cosine), that emits exactly this ISchemaObject for you. Until then, the class above is a drop-in.

If you expect a genuinely large, write-heavy corpus, Marten.PgVector also offers a VectorProjection base class that maintains a dedicated embedding table off your event stream and skips re-embedding unchanged content, a good next step but more than this dataset needs.

Closing the loop: generation

That retrieval is most of the work. The second half of movie-search adds the generation: take the hits it just retrieved, drop their plots into a prompt as context, and ask the LLM to answer using only that context. Ollama’s /api/generate streams the response token by token:

var hits = await RetrieveByVectorAsync(store, queryVector, k);

var context = string.Join("\n\n", hits.Select((m, i) =>
    $"[{i + 1}] {m.Title} ({m.Year}) [{Genres(m)}]\n{Snippet(m.Plot, 600)}"));

var prompt =
    $"""
    You are a film buff working only from the movie summaries below.
    - If the input is a question, answer it and name the movies you used.
    - If it is a topic or search phrase, name the movies that best match it and say in a line why.
    Use ONLY these summaries. Do not refuse just because the exact words are absent; match on meaning.

    MOVIES:
    {context}

    INPUT: {question}

    RESPONSE:
    """;

await foreach (var token in chat.StreamAsync(prompt))
    Console.Write(token);

That two-rule prompt matters: a query like “a heist that goes wrong” is a phrase, not a question, so a strict “answer the question or say you don’t know” prompt would make the model refuse even when retrieval nailed it. Telling it to match on meaning and name the closest films keeps a search-style query useful.

Two Ollama options matter as much as the wording, and both are easy to miss. Send temperature: 0 so a grounded answer is deterministic and faithful instead of wandering from run to run (the sample client does this). And raise num_ctx: Ollama’s default context window is 2048 tokens, which silently truncates a multi-chunk prompt and quietly drops the very passages you retrieved, so the sample sets it to 8192 to make sure the model actually sees them.

So a single dotnet run -- movie-search "which film involves a treasure map and bandits in the desert?" retrieves the nearest plots, has llama3.2 ground its answer in them, and then lists the movies it drew on with their year, genre, and a snippet:

Retrieved 3 movies. Asking llama3.2 ...

The film that involves a treasure map and bandits in the desert is "The Good, the Bad,
the Weird" (2008). I used this movie because it specifically mentions a Japanese official
traveling by train with a treasure map that the bandit tries to acquire.

Movies:
  [1] The Good, the Bad, the Weird (2008)  [Adventure, World cinema, Costume Adventure]
      In the desert wilderness of 1930s Manchuria, The Bad, a bandit and hitman, is hired to
      acquire a treasure map from a Japanese official traveling by train...
  [2] Merupu Daadi (1984)  [Western, Action, Drama]
      Professor Varma discovers a hidden treasure belonging to the Ratnagiri Dynasty by
      researching ancient books and maps...
  [3] Carib Gold (1956)  [Drama]
      A shrimping boat skipper calls in to shore with news of a good catch...

The instruction to answer only from the supplied summaries is what keeps the model honest: it is grounded in your retrieved data instead of free-associating from its training. That grounding is the whole reason RAG exists.

In production, reach for a hosted model

Local Ollama is how I built and demoed this (zero setup, zero cost, nothing leaves the machine), and that makes it perfect for development and for the privacy-sensitive cases where data genuinely cannot go out. But do not read the demo’s model choices as a recommendation. For anything user-facing, a hosted API model is usually the better call, and it moves both of RAG’s accuracy dials:

Retrieval quality: the embedding model. This decides whether the right documents even reach the prompt. nomic-embed-text is fine for a demo; a hosted embedding model (OpenAI’s text-embedding-3-large, Voyage, Cohere) ranks subtler matches better and lifts recall on hard queries. Swapping it is a new IEmbeddingProvider implementation and a re-embed; nothing else changes. (Remember the dimension count changes with the model.)
Answer quality: the generation model. llama3.2 is a small local model; it paraphrases a single retrieved plot competently but stumbles on multi-document synthesis, nuance, and faithfully refusing when the answer is not in the context. A frontier hosted model, whether Claude (Opus 4.8 or Sonnet 4.6), a GPT-class model, or Gemini, is markedly sharper, with far fewer “confidently wrong” moments. That is a single HTTP-client swap; the retrieval half of the pipeline is untouched.

None of this disturbs the central pitch of the post: the vector store is still just Postgres. You are choosing where the model runs, which is orthogonal to whether you run a separate vector database (you do not). Marten and pgvector do not care which model produced the vectors or the prose, so start hosted, drop to local for dev or privacy, and switch back, all without touching your storage layer.

Things worth knowing before production

A few characteristics of this stack that are better learned here than in an incident:

Vector dimensions are forever. The vector(N) width is fixed at table creation. Switching embedding models almost always changes N, which means a new column and a full re-embed. Decide on a model before you ingest millions of rows.
VectorSearchAsync is raw SQL, not LINQ. It runs straight through the session’s connection and does not go through Marten’s LINQ provider or compiled-query cache. Documents still deserialise through the store’s serializer, so you get real Movie objects back, but you cannot tack on a .Where(...) and expect Marten to translate it.
Only simple member access is supported in the vector selector: x => x.Embedding, nothing fancier.
Multi-tenancy is respected. In a conjoined multi-tenant store the search automatically adds the tenant_id filter, so a tenant only ever sees its own vectors.
Embedding throughput is your real constraint, not Postgres. If ingestion feels slow, it is the model, not the database: batch aggressively, and consider a GPU or a smaller embedding model if you are embedding at scale.

Wrapping up

The headline is how little there is. One package, one UsePgVector() call, one IEmbeddingProvider implementation, and one VectorSearchAsync call gave us real semantic search over 42,000 documents, and the RAG layer on top was a single prompt and an HTTP stream. Going to production from here is incremental, not a rewrite: swap the generation step to a hosted frontier model for sharper answers, and add an HNSW index when the corpus grows. Every one of those changes leaves the storage layer alone: because it is all Postgres, it backs up, replicates, and operates exactly like the database you already run. No new datastore to learn, secure, and pay for.

The complete project (Docker compose, dataset loader, ingestion, search, and the RAG command) is on GitHub at mysticmind/marten-pgvector-rag. Clone it, point it at your own corpus, and you have a working semantic search in an afternoon.

In Part 2 we point exactly this machinery at something more useful than movie trivia: the Marten documentation itself, building a question-answering assistant that cites the doc pages it draws from. The new ingredient there is chunking, turning long markdown pages into retrievable pieces, which is the part most docs-RAG attempts get wrong.