Building a Practical AI Agent with RAG, MCP, and Ollama

How I built (and iterated on) a production‑ready AI agent that actually works with real data

Introduction

Over the last year, “AI agents” went from a buzzword to something people actually try to ship. And if you’ve tried to build one beyond a demo, you’ve probably hit the same walls I did:

  • The model sounds confident, and is still wrong
  • It knows nothing about your data
  • You end up duct-taping APIs together with prompts
  • Costs and latency quietly get out of hand

This post is a write-up of how I dealt with those problems in a real project. I’ll walk through how I built an AI agent that:

  • Grounds answers in real documents using RAG
  • Talks to external systems in a structured way using MCP
  • Runs entirely locally with Ollama

This isn’t a theoretical overview or a framework comparison. It’s the architecture, trade-offs, and lessons learned from something I actually run.


Why “just an LLM” isn’t enough

A plain LLM is impressive, but it hits three hard limits very quickly:

  1. Static knowledge, it only knows what it was trained on
  2. No source of truth, it can’t verify its own answers
  3. No real integration, APIs and databases are glued on ad-hoc

When people say “agent,” what they usually mean (whether they realize it or not) is an LLM plus:

  • Memory
  • Tools
  • Access to external, up-to-date data

That’s where RAG, MCP, and Ollama fit in.


Core technologies (in plain terms)

RAG (Retrieval-Augmented Generation)

RAG is often explained in abstract diagrams. Here’s the practical version:

Before asking the model to answer, you look up relevant information and then force the model to use it.

In concrete terms:

  1. Split documents into chunks
  2. Convert each chunk into an embedding
  3. Store those embeddings (in memory or a vector store)
  4. For every question:
    • Embed the question
    • Find the most similar chunks
    • Inject them into the prompt

The main benefit isn’t “smarter answers.” It’s bound answers. The model is constrained by the context you give it.

When RAG shines

  • Internal documentation
  • Product catalogs
  • Wikis and knowledge bases
  • Anything that changes frequently

When RAG struggles

  • Very small datasets (the overhead isn’t worth it)
  • Poor chunking strategies
  • Vague or underspecified questions

Chunking and retrieval quality matter more than model choice. By a lot.


MCP (Model Context Protocol)

MCP is the least talked-about piece, but it’s what makes the agent feel alive.

Think of MCP as a contract between your agent and the outside world. Instead of writing prompts like:

“If the user asks about products, call this API…”

You expose structured endpoints that the agent can consume reliably.

In my setup, MCP handles:

  • Fetching data from services
  • Returning normalized JSON
  • Acting as a stable boundary between AI and business logic

That boundary matters. Prompts shouldn’t know how your database works.


Ollama (local LLM runtime)

I deliberately chose Ollama for three reasons:

  1. Privacy, data never leaves my machine
  2. Predictable cost, zero API usage
  3. Fast iteration, swap models in seconds

I use Ollama for:

  • Text generation (llama3.2)
  • Embeddings (nomic-embed-text)

Would a hosted model perform better? Sometimes. But for internal tools and knowledge agents, local inference is more than good enough.


High-level architecture

This is the mental model I used while building the system:

Client → API → Agent
              ├─ RAG (documents)
              ├─ MCP (external data)
              └─ Ollama (reasoning)

The agent itself is intentionally thin. It orchestrates components, but it doesn’t “own” knowledge.

That decision paid off later when I needed to change retrieval logic without touching the API layer.


Project structure and setup

The project is written in TypeScript, using Hono for the HTTP layer. Nothing exotic, just fast startup and a clean API surface.

Configuration lives in one place so models, ports, chunk sizes, and limits can be tweaked without digging through the codebase.

This sounds boring, but once you start tuning retrieval and context windows, you’ll be glad it’s centralized.


Implementing RAG (what actually matters)

Most RAG tutorials jump straight to vector databases. I didn’t.

For an initial version, in-memory embeddings are:

  • Easier to reason about
  • Easier to debug
  • Surprisingly effective

The RAG service focuses on three things:

  1. Generating embeddings in parallel
  2. Scoring similarity with cosine distance
  3. Returning only the most relevant chunks

The main takeaway: retrieval quality matters more than generation quality.

A smaller model with good context beats a larger model’s guessing.


Semantic search: the quiet workhorse

Cosine similarity is boring, and that’s a good thing.

At query time:

  • The question becomes a vector
  • Every document becomes a score
  • The highest scores win

I added logging around retrieval early on. Seeing which chunks were selected helped me fix bad chunking faster than any benchmark.

If your agent feels dumb, inspect the retrieved context first.


MCP: keeping the agent honest

Instead of letting the model invent API calls, MCP enforces structure.

From the agent’s point of view:

  • “Products” are just JSON
  • “Customers” are just JSON
  • Errors are explicit, not hallucinated

From the system’s point of view:

  • Business logic stays out of prompts
  • APIs remain testable
  • AI failures don’t corrupt the state

This separation alone reduced prompt complexity more than anything else I tried.


The agent loop (what ties everything together)

Every request follows the same flow:

  1. Receive user input
  2. Retrieve relevant documents (RAG)
  3. Fetch external data if needed (MCP)
  4. Assemble a bounded context
  5. Ask the model to respond
  6. Store conversation state

The most important design choice: context is assembled deterministically.
The model never decides what to include.

That single rule eliminated an entire class of hallucinations.


API layer and sessions

The HTTP API is intentionally boring:

  • Load data
  • Chat
  • Inspect history
  • Inspect RAG state

Sessions are tracked explicitly by ID, not by “magic memory.” This makes debugging easier and avoids ghost-context bugs.


A real request walkthrough

When a user asks:

“What products do we have?”

The system:

  1. Embeds the question
  2. Retrieves product-related chunks
  3. Injects them into the prompt
  4. Asks Ollama to answer using only that context

The result isn’t impressive because it’s clever.
It’s impressive because it’s correct.


What worked well

  • Local models were more than sufficient
  • RAG drastically reduced hallucinations
  • MCP kept the system maintainable
  • Simple cosine similarity was enough

What I’d change next time

  • Persist embeddings instead of keeping everything in memory
  • Stream responses
  • Cache embeddings aggressively
  • Add observability around retrieval scores

The architecture already supports these changes, which is a good sign.


Who this approach is for

This setup works well if you’re:

  • Building internal tools
  • Working with private data
  • Maintaining a wiki or knowledge base
  • Shipping something small but real

If you need internet-scale reasoning or massive throughput, you’ll need different trade-offs.


Final thoughts

AI agents get a lot of hype, but most of the real value comes from boring engineering decisions:

  • Clear boundaries
  • Deterministic context
  • Grounded answers

RAG, MCP, and Ollama aren’t magic on their own. Together, they form a system that’s understandable, debuggable, and, most importantly, useful.

If your agent feels fragile, slow down and inspect the context. That’s where most problems start.

Happy building.

We want to work with you. Check out our Services page!