Technical Challenges in Building Voice Agents

Technical Challenges in Building Voice Agents

Learn techniques such as Token Optimization, Function Calling, RAG, and Fine Tuning

In our previous blog post, Creating Your First Voice Agent, we taught you how to create your first voice agent with Pipecat and explored a bit about VAPI and Telnyx. If you missed that post, we strongly recommend reading it before this one, but it’s not necessary to engage with this one!

So, the point is that regardless of the tool you choose, there are several common technical challenges you’ll likely encounter when building your voice agent. Let’s take a look at some of them:

Token Optimization

One important technical detail you’ll face when working with Pipecat and OpenAI relates to how the system “remembers” the conversation. Unless you’re using the Real-Time API, which tends to be significantly more expensive, you’ll likely rely on the Completions API to orchestrate the dialogue.

This means that for the voice agent to “remember” what was said, the entire conversation (or at least a substantial portion of it) must be resent with each new interaction. It’s as if you had to reread the entire script of a play every time an actor delivers a new line.

When it comes to token optimization, the more tokens you send and receive, the more expensive and potentially slower the transaction becomes, not to mention the rate limits of each model. But don’t worry, it’s not the end of the world! There are smart strategies to mitigate this.

One of them is context summarization, which involves summarizing older parts of the conversation instead of resending everything. This way, only a concise summary of what has already been discussed is included. Another approach is to set clear limits on the size of the context you want to maintain, discarding less relevant parts of the conversation. After all, no one wants to listen to a monologue as a response or have to remember every single detail of a long discussion. You can also use Prompt Caching, storing frequently used parts of prompts, those common across conversations, to avoid resending the same information repeatedly.

Prompting

Another major challenge involves controlling the agent’s behaviour during a conversation. How does it manage to give the “right” answers? Well, it’s important to remember that “right” and “wrong” are often subjective. To guide the agent effectively, you need to craft a well-structured prompt, also known as a system prompt, that clearly defines the desired behavior.

Improving your instructions through Prompt Engineering helps you create more effective prompts. Think of it like giving detailed instructions to a friend who’s helping you with something, it’s not just about throwing in a few words. The structure of the prompt matters a lot! A well-crafted prompt acts like a clear script, guiding the AI on what to do, when to ask, how to respond, and how to behave.

Creating effective prompts means providing all the necessary context so the agent fully understands your intent. The more “intelligent” your prompt, the more intelligent and useful your agent’s responses will be. Remember that these instructions are directed at the LLM, but it can also be valuable to guide other parts of your application, for example, the TTS system, to format information correctly, such as reading a phone number as digits or in a more natural spoken format.

The same applies to STT, where you might want to avoid transcribing special characters or make list interpretations sound more natural. In practice, this can turn a robotic response like “One, do this action; Two, do that action” into something more conversational, such as “First, perform this action. Then, you should do the next one.”

RAG and Knowledge Bases

But what happens when your agent needs to be more than just conversational? What if it needs to provide accurate answers about highly specific information, like product details or your company’s internal policies? Even more challenging, what if that information changes frequently, such as product prices or business hours that vary on holidays? That’s where Knowledge Bases and RAG (Retrieval-Augmented Generation) come into play.

Think of a Knowledge Base as a custom-built library for your agent. It can include internal company documents, product manuals, FAQs, articles, or any data you’ve collected, across multiple formats such as text, spreadsheets, links, images, and more. With this “library,” your agent no longer has to guess or rely solely on its general training data. Instead, it now has a source of truth to consult.

Among the many techniques available, one particularly powerful approach involves using embeddings to retrieve relevant information. In this setup, instead of the LLM “remembering” everything, it queries an external knowledge base and retrieves only the specific pieces of information relevant to the current question. This is the essence of RAG (Retrieval-Augmented Generation), a true game changer! Think of it like performing a Google search: you don’t read the entire internet; you only read the most relevant results for your query.

In practice, RAG allows the agent to search for relevant information in your Knowledge Base before generating a response, and then use that information to produce a much more accurate and contextually rich answer. This helps prevent the notorious “hallucinations” (when AI makes things up) and ensures that responses are aligned with your actual data.

A great example of a RAG-based tool that integrates easily with Pipecat is Mem0. Mem0 handles the entire RAG process in a particularly elegant way, giving your agent a kind of “memory” that evolves as conversations unfold. For instance, if a user tells the agent they purchased a travel package to a certain destination on a specific date for a given price, Mem0 organizes, classifies, and stores that information in its database. Later, when relevant, it can recall those details for the assistant. So, if the same user returns and asks, “How much have I spent on travel this year?” or “When did I travel to that country last year?”, the agent can query Mem0’s database and provide an accurate, data-driven response.

Tools / Function Calling

Although it may seem like a voice agent’s main job is to hold a conversation, in many cases it also needs to take action, it needs to do things! The capability that enables this comes from tools.

In real-world scenarios, your agent might need to transfer a call to another department when the topic becomes too complex, send a confirmation SMS, email order details, or even schedule a reminder for you. In a voice context, it might also need to end the call gracefully to ensure you’re not billed for extra time.

To use tools, also known as function calling, you must instruct your agent to use them intelligently. That means defining clear rules, such as which actions are valid under specific conditions (for example, “when the user says goodbye, end the call”) and what restrictions should be enforced to ensure everything works safely and reliably (for example, “if the user says they still have questions, don’t hang up!”).

It’s worth noting that tool usage can range from extremely simple to highly complex. The agent receives instructions about which tools it can access, when to use them, and how, but the actual implementation of these tools can vary greatly depending on your goals. Keep in mind that the code for each tool must run somewhere, either within the same project as your agent or on external servers via APIs. With that in mind, it’s important to plan your agent’s architecture carefully, as you may need to handle authentication, security, middleware, processing queues, and other technical details that can complicate development. You’ll also need to account for potential exceptions, failures, and errors that may occur during tool execution.

Fine-Tuning

Once your agent is fully functional with all its capabilities defined, you can make it even better! Let’s say the only thing missing is its ability to understand your company’s jargon or handle complex commands specific to your application.

That’s where fine-tuning comes in! Think of it as specialized training, you take an already intelligent language model and teach it to become even smarter, making it more accurate and fluent in tasks unique to your project. How? By feeding it custom datasets, collections of examples that reflect how your users typically interact. This allows your agent not only to understand but also to respond with clarity and relevance that generic models simply can’t achieve in specialized contexts, taking the user experience to a whole new level.

Fine-tuning can be a complex process that requires a significant amount of data and computational resources. However, the results can be well worth it, especially if your goal is to build a voice agent that truly stands out for its precision and responsiveness in specific scenarios.

For example, consider a video game tech support company. When dealing with console issues, the agent might struggle to respond effectively, after all, the problem could involve a PlayStation, Xbox, or Nintendo system, each with its own quirks. Now imagine a user reports a display issue. Each console handles that differently, and since these problems are common, the same questions are often repeated: “Have you tried restarting it? Is the cable properly connected? Have you tested it on another TV?” With fine-tuning, you can train your voice agent to recognize these recurring patterns of questions and responses, enabling it to provide quick, accurate solutions, saving time for both the user and the support team.

Conclusion

Building an AI-powered voice agent is both exciting and challenging, so mastering techniques such as token optimization, prompt engineering, RAG, tools and function calling, and even fine-tuning will help you design increasingly sophisticated voice agents capable of meeting your users’ specific needs.

The key takeaway here is: always keep learning and exploring new tools and techniques! Don’t assume that the first solution you find is the best or only way to do something.

We want to work with you. Check out our Services page!

Lucas Geron

 I'm always looking for ways to learn and grow. I enjoy contributing to projects that make a real difference. Clear communication and respect are important to me – it's how we achieve our goals together!     |     Back-end Developer at CodeMiner42

View all posts by Lucas Geron →