AI chatbots for ecommerce: when RAG is the right answer (and when it isn’t)

Every ecommerce vendor is shipping AI features right now. Most of them are forgettable. A handful are genuinely useful. The difference between the two is rarely about the underlying model — it’s about how the AI is grounded in the merchant’s actual data. This post is about retrieval-augmented generation (RAG) for ecommerce, when it’s the right answer, and the design choices that decide whether you ship something that helps customers or something that hallucinates them into the support queue.

What RAG actually is, in plain terms

A large language model on its own knows a lot of things, but it doesn’t know your catalog, your shipping policy, or your return window. If a customer asks “do you ship to Canada in under 3 days?”, a vanilla LLM will guess — and getting that wrong costs you trust.

Retrieval-augmented generation fixes this by inserting a small retrieval step before the model answers. The flow:

Take the customer’s question.
Search a vector database of your real content (FAQ entries, product descriptions, policy pages) for the most relevant snippets.
Pass those snippets to the LLM as context, with instructions to answer using only the context provided.
The LLM composes the answer in natural language but grounds every claim in your actual content.

It’s the difference between an AI that sounds smart and an AI that’s actually useful for your customers.

When RAG is the right answer

RAG shines for ecommerce in three specific patterns:

Customer support deflection. Most support tickets are answers to questions you’ve already documented somewhere. A RAG bot grounded in your help docs can deflect a meaningful percentage of inbound tickets while keeping the answers accurate.

Product discovery. “I need a gift for my sister, she’s into hiking, budget is around $80” is a real question customers actually ask. A RAG bot grounded in your product catalog can navigate this conversationally in a way that no faceted search filter quite manages.

Pre-purchase qualification. For high-AOV or B2B products, customers have specific questions that gate the buy decision (compatibility, lead time, custom options). A RAG bot can answer these from the relevant product spec sheets without escalating to sales.

When RAG is the wrong answer

RAG is overkill for a few patterns it gets reached for anyway:

Simple search. If the customer just needs to find a known product, a good search bar is faster and cheaper than a chat bot. Don’t make people type a sentence to get to a product page.

Pure transaction flows. Adding to cart, checking out, applying a discount code — these are mechanical. Don’t put an LLM in the middle of them.

Anything requiring strong factual guarantees. Pricing in a regulated context. Medical or legal advice. Returns on a specific order. Use deterministic flows for these — RAG is good but not bulletproof, and the cost of being wrong is too high.

The design choices that matter

Building a RAG system that ships is mostly about choices you make outside the model.

What you index. A common mistake is indexing the entire site, including blog posts and outdated content. The bot will then cite a 2019 blog post when it shouldn’t. Index curated content with explicit ownership and revision dates. Decide what’s authoritative, and only index that.

How you chunk. A 10,000-word policy page indexed as one chunk is useless — the model will get back a giant blob and can’t pick out the relevant sentence. Chunk by section, with overlap. For product data, one chunk per product is usually right; for long docs, one per heading.

How you cite. Every answer should include the source it drew from. This isn’t just for trust — it’s for the merchant team to see what content the bot is leaning on, so they can tune the corpus over time.

How you handle “I don’t know.” The bot must be willing to say “I don’t have information about that” instead of guessing. This is a system-prompt discipline question more than a model question. A bot that confidently makes things up is worse than no bot at all.

How you escalate. Every RAG bot needs a human escalation path. Some customer questions deserve a human, and the bot should know which. We typically wire a “talk to a human” button into every conversation, plus an automatic escalation when the bot fails to answer twice in a row.

The infrastructure choices

For most ecommerce RAG builds, the stack is small:

Embedding model — OpenAI’s text-embedding-3-small or Voyage are fine for English, well-priced, and don’t need fine-tuning.
Vector store — for small corpora (under 100K chunks), pgvector on whatever Postgres you already have is enough. For larger, Pinecone or similar.
LLM — Claude or GPT-4-class for the final answer. The smaller models are tempting but the answer quality drop is noticeable.
Hosting — the bot itself can run on whatever your storefront runs on. We deploy as lightweight scripts embedded into Shopify and WordPress storefronts.

None of this is exotic. The defining work is the corpus curation, the chunking strategy, and the prompt design — not the infrastructure.

What we measure post-launch

A RAG bot ships in a draft state. It gets useful through measurement:

Conversation count, daily.
Resolution rate (did the customer take a positive next action — view a product, add to cart, close the chat without escalating).
Escalation rate (where did the bot punt to a human, and why).
Most-frequent questions the corpus didn’t answer well — these become the next batch of content to add.
Most-frequent products surfaced — both to validate the bot is recommending the right things, and to catch over-recommendation patterns.

The first month after launch is mostly tuning. The corpus needs adjustments. The prompts need tightening. The escalation thresholds need calibration. Plan for this — a RAG bot that’s “shipped” but not iterated on for three months is degrading in quality, not improving.

What good looks like

A well-built ecommerce RAG bot:

Cites its sources every time.
Says “I don’t know” cleanly when it doesn’t.
Hands off to humans without friction.
Improves measurably month over month as the corpus gets tuned.
Quietly deflects a meaningful percentage of inbound questions while never producing a wrong answer that costs the merchant a customer.

That’s the bar. Anything below it is novelty. Above it is genuine operational leverage.

Thinking about an AI assistant for your store? See the AI services or get in touch.