Introduction
Over the last year, “AI agents” went from a buzzword to something people actually try to ship. And if you’ve tried to build one beyond a demo, you’ve probably hit the same walls I did:
- The model sounds confident, and is still wrong
- It knows nothing about your data
- You end up duct-taping APIs together with prompts
- Costs and latency quietly get out of hand
This post is a write-up of how I dealt with those problems in a real project. I’ll walk through how I built an AI agent that:
- Grounds answers in real documents using RAG
- Talks to external systems in a structured way using MCP
- Runs entirely locally with Ollama
This isn’t a theoretical overview or a framework comparison. It’s the architecture, trade-offs, and lessons learned from something I actually run.
Why “just an LLM” isn’t enough
A plain LLM is impressive, but it hits three hard limits very quickly:
- Static knowledge, it only knows what it was trained on
- No source of truth, it can’t verify its own answers
- No real integration, APIs and databases are glued on ad-hoc
When people say “agent,” what they usually mean (whether they realize it or not) is an LLM plus:
- Memory
- Tools
- Access to external, up-to-date data
That’s where RAG, MCP, and Ollama fit in.
Core technologies (in plain terms)
RAG (Retrieval-Augmented Generation)
RAG is often explained in abstract diagrams. Here’s the practical version:
Before asking the model to answer, you look up relevant information and then force the model to use it.
In concrete terms:
- Split documents into chunks
- Convert each chunk into an embedding
- Store those embeddings (in memory or a vector store)
- For every question:
- Embed the question
- Find the most similar chunks
- Inject them into the prompt
The main benefit isn’t “smarter answers.” It’s bound answers. The model is constrained by the context you give it.
When RAG shines
- Internal documentation
- Product catalogs
- Wikis and knowledge bases
- Anything that changes frequently
When RAG struggles
- Very small datasets (the overhead isn’t worth it)
- Poor chunking strategies
- Vague or underspecified questions
Chunking and retrieval quality matter more than model choice. By a lot.
MCP (Model Context Protocol)
MCP is the least talked-about piece, but it’s what makes the agent feel alive.
Think of MCP as a contract between your agent and the outside world. Instead of writing prompts like:
“If the user asks about products, call this API…”
You expose structured endpoints that the agent can consume reliably.
In my setup, MCP handles:
- Fetching data from services
- Returning normalized JSON
- Acting as a stable boundary between AI and business logic
That boundary matters. Prompts shouldn’t know how your database works.
Ollama (local LLM runtime)
I deliberately chose Ollama for three reasons:
- Privacy, data never leaves my machine
- Predictable cost, zero API usage
- Fast iteration, swap models in seconds
I use Ollama for:
- Text generation (
llama3.2) - Embeddings (
nomic-embed-text)
Would a hosted model perform better? Sometimes. But for internal tools and knowledge agents, local inference is more than good enough.
High-level architecture
This is the mental model I used while building the system:
Client → API → Agent
├─ RAG (documents)
├─ MCP (external data)
└─ Ollama (reasoning)The agent itself is intentionally thin. It orchestrates components, but it doesn’t “own” knowledge.
That decision paid off later when I needed to change retrieval logic without touching the API layer.
Project structure and setup
The project is written in TypeScript, using Hono for the HTTP layer. Nothing exotic, just fast startup and a clean API surface.
Configuration lives in one place so models, ports, chunk sizes, and limits can be tweaked without digging through the codebase.
This sounds boring, but once you start tuning retrieval and context windows, you’ll be glad it’s centralized.
Implementing RAG (what actually matters)
Most RAG tutorials jump straight to vector databases. I didn’t.
For an initial version, in-memory embeddings are:
- Easier to reason about
- Easier to debug
- Surprisingly effective
The RAG service focuses on three things:
- Generating embeddings in parallel
- Scoring similarity with cosine distance
- Returning only the most relevant chunks
The main takeaway: retrieval quality matters more than generation quality.
A smaller model with good context beats a larger model’s guessing.
Semantic search: the quiet workhorse
Cosine similarity is boring, and that’s a good thing.
At query time:
- The question becomes a vector
- Every document becomes a score
- The highest scores win
I added logging around retrieval early on. Seeing which chunks were selected helped me fix bad chunking faster than any benchmark.
If your agent feels dumb, inspect the retrieved context first.
MCP: keeping the agent honest
Instead of letting the model invent API calls, MCP enforces structure.
From the agent’s point of view:
- “Products” are just JSON
- “Customers” are just JSON
- Errors are explicit, not hallucinated
From the system’s point of view:
- Business logic stays out of prompts
- APIs remain testable
- AI failures don’t corrupt the state
This separation alone reduced prompt complexity more than anything else I tried.
The agent loop (what ties everything together)
Every request follows the same flow:
- Receive user input
- Retrieve relevant documents (RAG)
- Fetch external data if needed (MCP)
- Assemble a bounded context
- Ask the model to respond
- Store conversation state
The most important design choice: context is assembled deterministically.
The model never decides what to include.
That single rule eliminated an entire class of hallucinations.
API layer and sessions
The HTTP API is intentionally boring:
- Load data
- Chat
- Inspect history
- Inspect RAG state
Sessions are tracked explicitly by ID, not by “magic memory.” This makes debugging easier and avoids ghost-context bugs.
A real request walkthrough
When a user asks:
“What products do we have?”
The system:
- Embeds the question
- Retrieves product-related chunks
- Injects them into the prompt
- Asks Ollama to answer using only that context
The result isn’t impressive because it’s clever.
It’s impressive because it’s correct.
What worked well
- Local models were more than sufficient
- RAG drastically reduced hallucinations
- MCP kept the system maintainable
- Simple cosine similarity was enough
What I’d change next time
- Persist embeddings instead of keeping everything in memory
- Stream responses
- Cache embeddings aggressively
- Add observability around retrieval scores
The architecture already supports these changes, which is a good sign.
Who this approach is for
This setup works well if you’re:
- Building internal tools
- Working with private data
- Maintaining a wiki or knowledge base
- Shipping something small but real
If you need internet-scale reasoning or massive throughput, you’ll need different trade-offs.
Final thoughts
AI agents get a lot of hype, but most of the real value comes from boring engineering decisions:
- Clear boundaries
- Deterministic context
- Grounded answers
RAG, MCP, and Ollama aren’t magic on their own. Together, they form a system that’s understandable, debuggable, and, most importantly, useful.
If your agent feels fragile, slow down and inspect the context. That’s where most problems start.
Happy building.
We want to work with you. Check out our Services page!

