How Sant Chat AI reads a WordPress site. The RAG architecture behind the plugin, priced at under a tenth of a cent per page ingested.
19 April 202611 min read
Ingesting a 50 page WordPress site end to end, summarised, chunked, and embedded ready for retrieval, costs Sant Chat roughly three US cents. A 200 page site, which is the hard cap in the current ingest pipeline, costs roughly eleven US cents. A single chat turn, once the site is ingested, costs a fraction of a cent. These are real numbers from the production pipeline, not estimates.
The numbers are worth stating up front because they shape the architecture. Sant Chat was designed around the assumption that a WordPress site worth reading is a WordPress site worth reading in full, and reading in full has to be cheap enough that cost becomes irrelevant to the pricing conversation. The architecture that gets there is not the standard RAG recipe. The difference is the order of operations.
Most RAG pipelines chunk raw content, embed the chunks, and retrieve at query time. Sant Chat summarises each page first, then chunks the summary, then embeds. The summarise step runs on gpt-4o-mini. The embed step runs on text-embedding-3-small. The store is Supabase pgvector. Retrieval is a Postgres RPC call using cosine distance. This post walks through the decisions behind that pipeline, what they cost, and where they pay back.
Why a sitemap is the right entry point for WordPress
The first question any RAG product has to answer is what to read. Sant Chat reads the sitemap.
Sitemaps are canonical. A reasonably configured WordPress site publishes one, usually at /sitemap.xml, and it reflects what the site owner wants discoverable. Scraping the whole site is the alternative and it is worse in every way. Scraping is aggressive, brittle against theme changes, and it picks up pages the site owner never intended to expose, which is the opposite of what a customer facing chatbot should do.
Using the sitemap is also a respect signal. When a WordPress site marks a page as noindex or excludes it from the sitemap, the site owner has made a decision about visibility. A chatbot that reads the sitemap inherits that decision. A scraper ignores it.
The practical implementation is straightforward. The ingest pipeline accepts a sitemap URL, walks the URL list, and fetches each page. Sant Chat caps the ingest at 200 pages per sync to keep latency and cost predictable. For sites larger than 200 pages, the priority is set by sitemap order, which is typically most recent first.
Summarise before you embed, not after
This is the architectural choice that defines the pipeline.
A WordPress page as rendered includes a lot that is not useful answer material. Navigation bars. Footer links. Related post widgets. Call to action blocks repeated on every page. Template chrome. If you embed the raw content, you embed the chrome alongside the signal, and retrieval has to work through noise to find the actual answer.
The conventional solutions are chunking with small overlaps, careful prompt engineering at query time, and a reranking step that reorders retrieved chunks before generation. Sant Chat takes a different route. Each page is sent through a summarisation pass on gpt-4o-mini before anything is embedded. The summary is a structured fact sheet that captures what the page is about, what the reader would ask about it, and what the useful answers are. The fact sheet is what gets chunked and embedded.
The cost is where this gets interesting.
For a typical 2,000 word WordPress page, the summarisation pass consumes around 2,500 input tokens and around 300 output tokens on gpt-4o-mini. At the current OpenAI pricing of fifteen cents per million input tokens and sixty cents per million output tokens, that is 0.0565 US cents per page. The embedding step on the same page, which happens next, costs another 0.0004 US cents. The total per page ingest cost is under 0.06 US cents.
Extrapolating to a full site: a 50 page WordPress site costs roughly three US cents to fully ingest. The 200 page hard cap costs roughly eleven US cents. A full re ingest of a 200 page site is less than the cost of most other operations Sant Chat performs on its own infrastructure.
The summarise step compresses hard. The fixture page used to generate these numbers, a 1,922 word agency services page, produced a 962 character fact sheet. That fact sheet is the thing that gets chunked and embedded. For most WordPress pages, the downstream effect is that each page ends up as a single embedded chunk, not many. That sounds like a limitation and it is actually a feature. Retrieval at query time is testing the summary against the query, not a fragment of the raw content. The signal to noise ratio is higher by design.
Chunk the summary, not the raw page
The chunking function sits downstream of the summarisation pass, not upstream.
The function is called semanticChunk. It takes a text blob, a maximum chunk size of 1,000 characters, a minimum chunk size of 100 characters, and it splits first on paragraph boundaries, falling back to sentence boundaries only when a single paragraph exceeds the budget. It produces no overlap between chunks.
No overlap is a deliberate choice. Overlap exists in standard RAG pipelines because raw content chunks lose context at the boundaries, and overlap patches the seams. Summarised content does not have that problem. The summary is already the coherent unit. Chunking it is a pagination step, not a context preservation step.
Because the typical summary fits in a single chunk, most pages produce one embedding each. The rare case is a long form article where the fact sheet exceeds 1,000 characters and gets split into two or three chunks. That is fine. The chunking function handles it. Retrieval treats all chunks from a page as candidates, and the generation step can pull more than one chunk per page when a query calls for it.
The chunk size budget is measured in characters, not tokens. A thousand characters is roughly 250 tokens at English prose averages. That size was chosen to sit well below the retrieval context limits while being large enough to carry a complete idea. Boundaries in tokens would be more precise, but the small amount of slack that characters produce has not caused issues in production.
The vector store is the database
Sant Chat does not use a separate vector database. Embeddings live in the same Postgres instance as the rest of the application data, stored in a vector(1536) column in the site_documents table. The indexing strategy is ivfflat with vector_cosine_ops. Retrieval is a Postgres RPC called match_site_documents, which takes a query embedding and returns the top matching chunks ranked by cosine distance.
This is Supabase pgvector in a fairly standard configuration. The choice to use it rather than Pinecone, Qdrant, or a dedicated vector service is operational. Running one Postgres instance for everything is simpler than running one Postgres and one vector service. Joins across the customer data and the embeddings happen in the same engine. Backups are one system, not two. The operational surface area stays small, which matters when the product is maintained by a small team.
The embedding dimensions are 1536, which is the default for text-embedding-3-small. OpenAI supports dimension reduction on this model, but Sant Chat does not use it. The storage savings at customer data volumes are negligible. The schema is explicit about 1536. Reducing the dimensions later is a migration that can be done if scale ever justifies the work. It does not today.
Retrieval is a single RPC and a cosine distance
When a visitor sends a message, the message is embedded with the same model used for ingest. The embedding is passed to match_site_documents as an RPC call. The RPC returns a ranked list of chunks. The top matches, along with the business context the site owner has configured, are passed to gpt-4o-mini as the context for the generation step.
Real numbers from production, across the 39 chat turns logged so far, show an average input of 691 tokens and an average output of 28 tokens per chat turn. The output is short because the model is instructed to answer the question, not to extrapolate. The input is lean because the RAG pipeline delivers compressed context. At current gpt-4o-mini pricing, the per turn cost is in the neighbourhood of 0.01 US cents. Per thousand chat turns, that is roughly ten US cents.
Voice chat turns, which run a separate path that includes Whisper for speech to text and a text to speech model for responses, average 302 input tokens and 13 output tokens across the 12 logged voice turns to date. The shorter figures reflect that voice questions are typically shorter than typed questions. The full voice path carries additional cost for the audio pipeline itself, which is a separate story and one this post leaves alone.
Hash based change detection for incremental sync
Full re ingest of a 200 page site is cheap, but doing it on every sync cycle is wasteful. Sant Chat uses hash based change detection to decide which pages to re ingest.
Each page fetched during a sync has its content hashed. The hash is compared against the hash stored from the last sync. If the hash is the same, the page is unchanged and the existing summary and embeddings are kept. If the hash has changed, the page goes through the full pipeline again.
Schedules run hourly, daily, weekly, or monthly, configured per site. For most customers the weekly cadence is the right default. A marketing site does not change hourly. A news site often does. The schedule is a configuration choice rather than a product opinion, and the site owner is trusted to know their own publishing rhythm.
Force sync is a first class option, not a workaround. Sometimes a structural change to the summarisation prompt or the chunking logic means the stored embeddings are no longer comparable to a fresh query. Force sync rebuilds from scratch. The cost is knowable because the per page numbers are knowable.
Sant ships WordPress plugins, websites, and applications through the Launch phase of the Sant framework. The RAG pipeline inside Sant Chat is one expression of how Sant thinks about AI product engineering. The interesting decision is usually the one upstream of where the industry is looking. Most RAG discourse in 2026 focuses on the retrieval step and the generation step. The summarise then embed choice happens earlier, and it is the one that determines whether the downstream steps work well at all.
For agencies commissioning AI products, or founders shipping an AI feature inside a larger product, the lesson generalises. The default architecture is the first thing to question. The second thing to question is whether the cheapest possible ingest is cheap enough to ingest in full. In Sant Chat's case, the answer is yes, and that answer reshapes what is possible downstream.
The companion piece to this post is the open source compliance toolkit Sant shipped alongside Sant Chat. The architecture described above and the compliance workflow are two halves of the same shipping story. The architecture decisions made this pipeline cheap to run. The compliance workflow made the plugin reviewable by WordPress.org and safe to install. Those same decisions matter more in regulated contexts, which deserves its own treatment.
Frequently asked questions
What happens if a WordPress page is very long?
The summarisation step is not length bounded in the way chunking is. A 10,000 word page produces a longer fact sheet than a 2,000 word page, but the fact sheet is still compressed. The chunking step then splits the fact sheet as needed. Most pages, regardless of source length, end up as one or two chunks post summarisation. Very long pages might produce three or four.
Why summarise before embedding? Does the step not lose information?
It compresses information, which is the point. The question is whether compression costs retrieval quality. In production the answer has been no, because WordPress pages carry a lot of non answer material and compression removes that noise before it reaches the embedding space. Retrieval against a summary is retrieval against the useful part of the page. Retrieval against raw chunks is retrieval against the useful part and the noise, with the noise sometimes winning.
Why Supabase pgvector instead of a dedicated vector database?
Operational simplicity. Customer data and embeddings live in the same Postgres instance, backups are one system, and joins across tables work without cross service coordination. Dedicated vector services are the right call at scales Sant Chat is not operating at today. If scale changes that, migrating is possible, and the abstraction layer sits behind one RPC.
Does Sant Chat support sites without a sitemap?
A reasonably configured WordPress site publishes a sitemap. Yoast, Rank Math, All in One SEO, and the core WordPress XML sitemap all generate one. Sites without a sitemap are rare and almost always indicate a larger problem the site owner should fix independently of Sant Chat. The pipeline can accept a manual URL list as a fallback, but the sitemap is the default.
What happens when the site changes?
The sync schedule picks up changes on its next run. Hash based change detection means only changed pages run through the pipeline again. Force sync is available for structural changes that require a full rebuild. Site owners who publish frequently can move to an hourly schedule. Site owners with stable content can sit on weekly or monthly.
Most RAG pipelines spend their architectural attention on the retrieval and generation steps. Sant Chat moves the interesting decision earlier, to the moment where page content becomes embedding material. The payoff is a pipeline that reads a full WordPress site for pennies, returns answers grounded in the published content rather than approximate matches to fragments of it, and stays operationally simple enough for a small team to maintain.
The same decisions that made this architecture workable also made the plugin safe to install and compliant with WordPress.org's review standards. That is a different post. If the full plugin is the ask rather than the architecture, the Sant Launch conversation is the shorter path. If the interest is the toolkit behind the submission, the open source compliance checker is the practical entry point.