Building a Practical AI Agent with RAG, MCP, and Ollama How I built (and iterated on) a production‑ready AI agent that actually works with real data

Last Updated on March 10, 2026

Introduction

Over the last year, “AI agents” went from a buzzword to something people actually try to ship. And if you’ve tried to build one beyond a demo, you’ve probably hit the same walls I did:

The model sounds confident, and is still wrong
It knows nothing about your data
You end up duct-taping APIs together with prompts
Costs and latency quietly get out of hand

This post is a write-up of how I dealt with those problems in a real project. I’ll walk through how I built an AI agent that:

Grounds answers in real documents using RAG
Talks to external systems in a structured way using MCP
Runs entirely locally with Ollama

This isn’t a theoretical overview or a framework comparison. It’s the architecture, trade-offs, and lessons learned from something I actually run.

Why “just an LLM” isn’t enough

A plain LLM is impressive, but it hits three hard limits very quickly:

Static knowledge, it only knows what it was trained on
No source of truth, it can’t verify its own answers
No real integration, APIs and databases are glued on ad-hoc

When people say “agent,” what they usually mean (whether they realize it or not) is an LLM plus:

Memory
Tools
Access to external, up-to-date data

That’s where RAG, MCP, and Ollama fit in.

Core technologies (in plain terms)

RAG (Retrieval-Augmented Generation)

RAG is often explained in abstract diagrams. Here’s the practical version:

Before asking the model to answer, you look up relevant information and then force the model to use it.

In concrete terms:

Split documents into chunks
Convert each chunk into an embedding
Store those embeddings (in memory or a vector store)
For every question:
- Embed the question
- Find the most similar chunks
- Inject them into the prompt

The main benefit isn’t “smarter answers.” It’s bound answers. The model is constrained by the context you give it.

When RAG shines

Internal documentation
Product catalogs
Wikis and knowledge bases
Anything that changes frequently

When RAG struggles

Very small datasets (the overhead isn’t worth it)
Poor chunking strategies
Vague or underspecified questions

Chunking and retrieval quality matter more than model choice. By a lot.

MCP (Model Context Protocol)

MCP is the least talked-about piece, but it’s what makes the agent feel alive.

Think of MCP as a contract between your agent and the outside world. Instead of writing prompts like:

“If the user asks about products, call this API…”

You expose structured endpoints that the agent can consume reliably.

In my setup, MCP handles:

Fetching data from services
Returning normalized JSON
Acting as a stable boundary between AI and business logic

That boundary matters. Prompts shouldn’t know how your database works.

Ollama (local LLM runtime)

I deliberately chose Ollama for three reasons:

Privacy, data never leaves my machine
Predictable cost, zero API usage
Fast iteration, swap models in seconds

I use Ollama for:

Text generation (llama3.2)
Embeddings (nomic-embed-text)

Would a hosted model perform better? Sometimes. But for internal tools and knowledge agents, local inference is more than good enough.

High-level architecture

This is the mental model I used while building the system:

Client → API → Agent
              ├─ RAG (documents)
              ├─ MCP (external data)
              └─ Ollama (reasoning)

The agent itself is intentionally thin. It orchestrates components, but it doesn’t “own” knowledge.

That decision paid off later when I needed to change retrieval logic without touching the API layer.

Project structure and setup

The project is written in TypeScript, using Hono for the HTTP layer. Nothing exotic, just fast startup and a clean API surface.

Configuration lives in one place so models, ports, chunk sizes, and limits can be tweaked without digging through the codebase.

This sounds boring, but once you start tuning retrieval and context windows, you’ll be glad it’s centralized.

Implementing RAG (what actually matters)

Most RAG tutorials jump straight to vector databases. I didn’t.

For an initial version, in-memory embeddings are:

Easier to reason about
Easier to debug
Surprisingly effective

The RAG service focuses on three things:

Generating embeddings in parallel
Scoring similarity with cosine distance
Returning only the most relevant chunks

The main takeaway: retrieval quality matters more than generation quality.

A smaller model with good context beats a larger model’s guessing.

Semantic search: the quiet workhorse

Cosine similarity is boring, and that’s a good thing.

At query time:

The question becomes a vector
Every document becomes a score
The highest scores win

I added logging around retrieval early on. Seeing which chunks were selected helped me fix bad chunking faster than any benchmark.

If your agent feels dumb, inspect the retrieved context first.

MCP: keeping the agent honest

Instead of letting the model invent API calls, MCP enforces structure.

From the agent’s point of view:

“Products” are just JSON
“Customers” are just JSON
Errors are explicit, not hallucinated

From the system’s point of view:

Business logic stays out of prompts
APIs remain testable
AI failures don’t corrupt the state

This separation alone reduced prompt complexity more than anything else I tried.

The agent loop (what ties everything together)

Every request follows the same flow:

Receive user input
Retrieve relevant documents (RAG)
Fetch external data if needed (MCP)
Assemble a bounded context
Ask the model to respond
Store conversation state

The most important design choice: context is assembled deterministically.
The model never decides what to include.

That single rule eliminated an entire class of hallucinations.

API layer and sessions

The HTTP API is intentionally boring:

Load data
Chat
Inspect history
Inspect RAG state

Sessions are tracked explicitly by ID, not by “magic memory.” This makes debugging easier and avoids ghost-context bugs.

A real request walkthrough

When a user asks:

“What products do we have?”

The system:

Embeds the question
Retrieves product-related chunks
Injects them into the prompt
Asks Ollama to answer using only that context

The result isn’t impressive because it’s clever.
It’s impressive because it’s correct.

What worked well

Local models were more than sufficient
RAG drastically reduced hallucinations
MCP kept the system maintainable
Simple cosine similarity was enough

What I’d change next time

Persist embeddings instead of keeping everything in memory
Stream responses
Cache embeddings aggressively
Add observability around retrieval scores

The architecture already supports these changes, which is a good sign.

Who this approach is for

This setup works well if you’re:

Building internal tools
Working with private data
Maintaining a wiki or knowledge base
Shipping something small but real

If you need internet-scale reasoning or massive throughput, you’ll need different trade-offs.

Final thoughts

AI agents get a lot of hype, but most of the real value comes from boring engineering decisions:

Clear boundaries
Deterministic context
Grounded answers

RAG, MCP, and Ollama aren’t magic on their own. Together, they form a system that’s understandable, debuggable, and, most importantly, useful.

If your agent feels fragile, slow down and inspect the context. That’s where most problems start.

Happy building.

We want to work with you. Check out our Services page!

Building a Practical AI Agent with RAG, MCP, and Ollama

How I built (and iterated on) a production‑ready AI agent that actually works with real data

Introduction

Why “just an LLM” isn’t enough