This article began as a casual chat with a former colleague who currently works in a platform team - the folks who build shared tools and infrastructure used across all product teams in his company.

His company recently had recently got an AI platform. The platform gave access to multiple LLMs, built in chat bot interface with other enterprise features for access control and common integration to enterprise tools for ingestion etc.

The company was promoting all employees to use the AI platform and build tools to make themselves more productive. He had a simple goal:

I want to build an internal chatbot that can help my team - and devs in another teams- quickly find answers from internal docs, wikis, and changelogs when they want to integrate with the platform.

On a recent weekend, he reached out to me brainstorm how he could do this. He had heard about RAG, he had rough idea that he had to use his company's AI platform and upload documents specific to his platform. He had tried few things but results were not as expected. So we got talking and what i have captured here is all the conversation we had around RAG and how he needs to use it in his project.

Why even use RAG? Can't the LLM just answer things?

Language models like GPT are amazing, but they're trained on static datasets and often don't know your internal product details or recent changes. They hallucinate when they're unsure, confidently making up answers.

RAG fixes that by letting the model look things up in real-time. It retrieves relevant content (like docs, help articles, code), then feeds it to the LLM as context. So the model's answer is grounded in your actual data - not guesswork.

What if the product is super niche and the model has never seen it?

That's where RAG really earns its keep.

Even if the model has never "seen" the product during training, it can still generate accurate answers - as long as you give it the right context through retrieval.

Think of it like this: you're asking the model a question and handing it a cheat sheet at the same time. It reads the cheat sheet and gives you an answer based on that, not just its pre-trained knowledge.

Is it possible to give the model too much context?

Yes, just like people get overwhelmed with too many browser tabs open, LLMs can get confused if you dump in a wall of irrelevant or noisy text. You'll:

Waste precious token space
Dilute the important info
Increase the chances of the model hallucinating or rambling

The goal isn't to give it everything - it's to give it only what it needs. Think signal over noise. The best RAG setups retrieve and inject just a few high-quality chunks, not the entire knowledge base.

We have 3 products. Should we build one RAG system or one per product?

We debated this one for a while. The answer is pretty clear:

Go with one RAG system per product.

Why?

Cleaner separation of content
More accurate retrieval
Less chance of mixing up concepts from different products
Easier to debug and improve over time

The only time you might want to combine them is if your products are tightly integrated - or you've built a solid routing layer to figure out which product the question is about. Based on strategy you pick, quality of the answer from RAG/LLM can wildly swing.

How do we decide which product's RAG system to use for a query?

This is where routing comes in. You've got a few options:

Keyword matching - Quick and dirty. Works if people mention product names, but brittle if they don't.
Embedding similarity - Use vector embeddings to compare the query to each product's description and see which one it's closest to.
LLM-based routing - Ask the LLM to decide which product the question is about. Slower, but smarter.

In practice, a hybrid approach works best:

Try keyword match first (fast)
Fallback to embeddings
If still unsure, ask the user: "Are you asking about Product A or B?"

Does one RAG system = one vector table?

Not necessarily, but it often helps to think that way. You can either:

Create a separate vector index/collection per product (cleaner)
Store all chunks in one table but tag them with metadata like "product": "ProductA" and filter at query time

Both approaches work. The first keeps things simple. The second is more flexible if you want to support cross-product search later.

How should I chunk my docs before embedding them?

Chunking is so underrated - get this wrong and everything else falls apart.

Here's what works:

Use semantic chunking - break docs by headers, sections, or bullet points instead of fixed lengths
Keep chunks between 100–300 tokens for best balance
Include headings or section titles in the chunk - it improves retrieval accuracy

Most modern RAG frameworks support smart chunking out of the box (LangChain, LlamaIndex, Haystack, etc.).

Should I fine-tune the LLM or just improve my retrieval?

Unless you have a very specific use case (like legal summaries or structured formats), you probably don't need fine-tuning.

Fine-tuning:

Is expensive and time-consuming
Requires a lot of carefully labeled examples
Breaks easily when things change

Start by improving your retrieval pipeline:

Better chunking
Smarter ranking
Clearer metadata

You'll get most of the performance gains there - no GPU cluster required.

How do we test whether our multi-RAG setup is working?

Treat each product's RAG system like a mini product in itself. For each one:

Write 10–20 realistic user questions
See what chunks are being retrieved - are they relevant?
Compare model answers with and without context injected

Track things like:

How often retrieval hits the right content
Whether the model gives helpful answers
If it hallucinates less with RAG than without

You can use LangChain's evals, LlamaIndex evals, or just a simple spreadsheet + eyeballs.

How do we keep the RAG system updated over time?

Your docs and knowledge base aren't static - your RAG system shouldn't be either.

To keep things fresh:

Auto-ingest updates from product docs, release notes, changelogs, FAQs
Detect changes via hashes or timestamps
Re-chunk and re-embed changed content
Rebuild or refresh the vector index as needed

Think of RAG as a living knowledge layer, not a one-time dump.

Final Takeaways

If you're building internal tooling with RAG:

Keep each product's knowledge base clean and scoped
Don't flood the model with irrelevant text - retrieval quality > quantity
Build a smart routing layer if you're dealing with multiple products
Focus on chunking, ranking, and context formatting before jumping to fine-tuning

It's not about making the LLM "smarter" - it's about feeding it the right stuff at the right time.

RAG in the Real World: A Candid Q&A on Building Smarter AI Chatbots