Tool or Agent? The impact of AI in your code and in your wallet It all boils down to math again!

Hello again! It has been some time since I last posted here. I have been hoarding more knowledge about state-of-the-art LLMs and all models currently available on the market, so I may be more helpful.

Recently, a survey on AI usage by Stack Overflow caught my attention, and I thought, “Why not put my input here with everything I’ve been learning so far?”

So let’s start!

AI as an Agent vs AI as a dev tool

First of all, it’s important to understand that just because you use Claude or Copilot during the development of a given feature in your project, or even if you’re vibe coding, it doesn’t mean you are incorporating an LLM into your code.

→ What’s an LLM? Stands for Large Language Model: It’s an AI trained on large amounts of data to generate text, code, or even images and audio.

There are differences in the final product when you resort to using an AI Coding Assistant versus when you actually implement a Large Language Model into your code.

It’s important to distinguish between them, so I’ll break that down for you.

→ What is an Agent?

It is an AI entity that acts autonomously, making decisions based on a specific task.

Basically, the literal implementation of an AI into your code to deal with tasks in your project, whatever they may be.

Let’s say you are building an endpoint that uses one of OpenAI’s models to create a chatbot, for example. You can use the same completion capacity to make certain features of your application.

For instance, if you need a flexible way to validate a request input, you can make the AI Agent receive the request data as a string and interpret it based on rules you’ve sent it to work with, so instead of throwing an immediate error in case the request comes incomplete, the Agent could complete it and send it forward in your application, or deal with it accordingly.

type CreateChatParameters = {
  model?: string;
  tools?: ChatCompletionTool[];
  messages: ChatCompletionMessageParam[];
};

export const createChat = async ({
  messages,
  model = config.groq.model,
  tools,
}: CreateChatParameters) => {

  const response = await groqClient.chat.completions.create({
    model,
    messages: [
      {
        role: "system",
        content: `
        ## Filter Request: Whenever you receive a request with the following schema 
        ${JSON.stringify(personJsonSchema, null, 2)}
        infer the first name from the user email and attach it to a JSON object. 
        ONLY output a JSON object.

        ## Confirmation:
        Call tool "savePerson" ONLY if all fields are filled. 
        If the rules aren't met, respond with "Not enough information"
        `,
      },
      ...messages,
    ],
    tools,
    tool_choice: tools ? "auto" : undefined,
    temperature: 0.7,
    top_p: 1,
    stream: false,
  });

  return response;
};

It could be a kind of gatekeeper for your requests, or for an n number of things, really.
You’re just limited by your imagination.

Pretty neat, huh? As the survey suggests, 52% of the users report an increase in productivity, after all, they can be pretty reliable, and it lowers the need for some annoying tasks that would usually demand a lot of steps.

Of course, every time your application hits the endpoint of a paid model, it will convert it to tokens; thus, if not well planned, you will have a very hefty bill by the end of the week.

→ What is a token? Tokens are a breakdown of words, a way of covering full sentences in small bits of numbers.

→ What’s an AI as a dev tool?

They are literal tools that can help your productivity in many ways, but are powered by an LLM model. Yes, they are powered by an AI Agent!

Some are more generalist, like ChatGPT, which can generalize and create an amalgamation of things, from helping you understand a complex algorithm to creating an image of a flying cow, while others are very task-specific, like GitHub Copilot, which interprets code and generates the continuation of the input code based on the most likely outcome.

So, in a number of ways, while one is using an AI dev tool to help them do something faster, these tools are not implemented directly in the final product, even if they have an Agent working behind them. Because then, it’s an Agent for the dev tool, not for your project.

They still have monthly payments, also based on token usage, but they are a direct consequence of the user rather than of the project itself, and easier to control, if you know what you’re doing. And in most cases, you simply have to pay the monthly fee to use them, without worrying about the context window or anything else.

Now, why am I bringing these concepts here, and why are they important?

According to the survey, developers in all categories seem to use and approve the use of AIs at various levels of software development. 47% use AI daily, and more than 50% are at least favourable about its usage.

While code-based assistants have been in development since 2013, with Bing Code Search, which was (and still is) a Visual Studio Code add-in capable of searching for snippets of code at StackOverflow, the real first implementation of an AI code completer, was made by Tabnine in 2014.

Baby steps, still, since it didn’t have the same capabilities as the ones we have today.

The actual implementation of LLM models capable of writing complete projects came between 2021 and 2022, with GitHub Copilot, which wasn’t just capable of auto-completion, but also was able to create small boilerplates with a single prompt, and of using its context window to understand the files inside the project and generate the continuation of the code, make reviews, or correct bugs.

→ What is a prompt? It’s a text inputted by you, the user, so the AI can ‘continue’ it.

According to the survey, opinions on AI development tools are rather negative in some way. Only 30% seem to somewhat trust the tools, which, in my opinion, is a good thing.

Every answer coming from an AI is a prediction, so it’s important to keep in mind that it will make mistakes, especially about abstract things. If you don’t exactly know what you need from the AI, it will be as lost as you because it has to continue your input.

They can be quick search tools, and help you see things that you end up ignoring after spending way too long writing code, because they work in patterns. Who has never made simple mistakes for the simple fact that you know how that code should work, but you’ve read it so many times that eventually your brain cannot process small mistakes, like a missing bracket.

But there’s the downside of it, too.

GenAIs and output Generation: Behind Generational Models

→ What is a Generation Model (GenAI)? It’s an AI that creates new things based on something given to it.
Quick recap here: As I commented in my previous post here, AIs are trained on large datasets to learn and generalize data, turning it into small arrays of numbers and then predicting the answer based on the inputted data fed to them.

The same is done for GenAIs, and thus, the state-of-the-art we have today, like Claude, Llama, OpenAI, Copilot, Mistral… all of those are trained on a large set of data to learn syntax and how a certain input should be continued.

What differs them, many times, it’s their architecture and the type of tokenization of their input.

→ What is tokenization? It’s a way of turning words into tokens. There are many ways of doing it; each model works with one.

And this is where the problem lies, too.

Knowing where a GenAI faults in is important if you intend to use it in the best way, and know it will never be just like you want it to be.

Context Window

A thing most people who have any interest in AI will hear a lot about is the context window.

So, what is that?

When you input a prompt for a GenAI, there’s a limit on how many token IDs are saved in the current conversation history. If there’s a token limit of 4000, then only that quantity will be temporarily saved as a text Buffer memory, and older tokens will be discarded to make room for the newer ones.

→ What is a token ID? It is just the identification/index of a given token. In other words, a way for the model to identify which word is being talked about. When these sequential IDs are fed to the model, the model will extract the meaning in the context of it.

So, basically, the context window is how many token IDs the model can use to remember what you are talking about, so it can continue the conversation.

Whenever a conversation history evolves, depending on the model and the memories implemented, the user might need to keep reminding the model of the context of the conversation, since the older tokens will be discarded.

According to the Survey, 66% find that the answers given by a model are frustratingly almost there, but not quite, and a model context window is of ultimate importance for the AI’s understanding of the problem, since it’s what gives the AI context for meaning.

If you have way too many tokens to deal with, the model may get lost, and the probability of it hallucinating is higher, especially if your prompt is ambiguous.

Why? Again, to generate a prediction, the model has to have a clear view of what it’s being asked of it. If you leave it open, it will try to predict the most likely scenarios based on the dataset it was trained with. AIs don’t have the capability of thinking, even if they are reasoning models. They just take extra steps, but it boils down to the same generating logic in the end.

→ What is a reasoning model? They are AIs capable of creating smaller prompts based on the inputted prompt and solving each piece of the problem separately until it reaches a ‘conclusion’.

An important thing to remember in dev tools such as GitHub Copilot or Claude Code is that the AI doesn’t have a distinction on which files you are working on, exactly. It will only receive a bunch of characters in a huge string in its context window, and from that, it will generate what you’re asking of it. So meaning can be lost along the text, and things that are important to you specifically, may not be mathematically important for the AI, unless you point that out in your prompt. That’s why is important to understand Context Window and its impact on how the AI will spit out an answer.

Overall, the bigger the context window that the model can handle, the better the model will work with the files you have on your project, thus generating better outputs when taking into consideration the global context of the project.

Sure, that’s not the only thing to look out for.

Why Architecture Matters

Here’s where things get interesting. All generative models work under the same principle, but their architectures and context windows are what really define how useful the model is for a given task.

The Architecture is how the model deals with the data given to it. Some are specific for matching results, some will work better with extracting meaning of the input, and some will be more general when outputting to the user.

Some models (Like Claude) have optimized architecture for longer context windows, being able to infer meaning from huge prompts (Even if some information may get lost in the middle), which makes them better at handling multi-file projects without getting lost, but they may not be that accurate for specific tasks. Others are smaller and cheaper, like Mistral or Llama, and are good for specific tasks rather than encompassing too much, but they get lost in larger contexts. Choosing the right architecture is not just about raw power — it’s about trade-offs between speed, cost, and reliability.

The architecture also interferes with the model’s way of generating answers, either by being more creative – hallucinating more – or leaning towards matter-of-factly answers, but with the wrong data. It’s important to keep these differences in mind when you’re choosing which dev tool or model for an AI Agent you want to use, because this will impact the accuracy of its output.

→ What is the hallucination of a GenAI? It’s when the model creates something incorrect. In other words, the model generates a response as if it knows what it’s talking about, even if the answer makes no sense, logic-wise.

So whenever you are choosing a tool or a model to turn it into an Agent, keep in mind what the model’s architecture is focused on. Is it better for specific tasks? Does it handle longer context windows better?

The architecture will also have an impact on the benchmarking of the model. In other words, whether the model generates overall good answers for specific tasks.

And what does all of this translate to?

Tool and AI Agents’ Cost

The thing is, when talking about implementing an Agent in our code, or in your pipeline somehow, it will all come with a cost.

Sure, dev tools are great, and they have either a monthly fee or a year-long one, which reduces costs beautifully. But they also have their limitations, and as everything that is generalist, there are certain things, certain patterns that your dev tool may not be able to cover just the way you like.

Why? When you depend on something that generalizes things, the nicher or more specific the task gets, the less able the model will be to answer, because it escapes the probable responses.

Think like this: As I explained about AIs in this post, the model will usually output the most probable answer based on what you wrote to it, and as it was trained in a dataset of most likely responses, if nothing in its parameters is close to what you’re asking of it, it will possibly make a very wrong assumption.

For example, a dataset of a Custom Service bot is about two thousand examples like the one above. If you ask the bot something like “How is your mom?” when nothing in the dataset gets even remotely close to this input, it will not know how to answer correctly.

That’s where AI agents come to play. You can mold the Agent’s behaviour to your liking, lessening the need for annoying tasks or over-the-top complicated data handling that can now be solved by just feeding inputs to the Agent and waiting for its response based on the rules you’ve set for the model to follow.

Be code pattern, be memory sharing, be request gatekeeping – Agents can help with a multitude of things, and depending on how they were made, they can be pretty reliable.

The thing is, they can come with a very steep cost.

After all, you’ll be paying both for the input tokens as well as for the output tokens, which can be summed as:

Total = (Inputted Tokens * Price) + (Outputted Tokens * Price)

Some have differences between the input price and the output price, but it all boils down to this simple equation.

Also, each model may have a different tokenization method, so they may process more or fewer tokens, depending on the method.

According to this and these journals, an Agent can be calculated from a few thousand bucks to up to six digits. And Agents are used for specific, pointed tasks, so your project may be working with one model or many, applying the equation above to each model.

My take on this (Even if no one asked)

Does any of the above mean that I am against implementing AI in your code?

Oh, absolutely not. I’ll always be the AI #1 fangirl. After all, it’s a great field of study, and I just love working with it. There’s just so much you can do, and you’re really just limited by your imagination!

But it’s important to understand that you have to know at least the basics if you’re interested in adding an Agent to your project. Or even, sometimes, choosing the right AI-powered dev tool.

You can even make your own model from scratch, not depending on other models’ monetization! Though that is quite the work.

There is a bit of a polarized opinion amongst developers about AI usage and implementation, but if I am to give an opinion on that is: As with any other approach in programming, AI needs to be studied before getting implemented.

There a couple of problems that can come up when you are using them wrongly, like:

Security Vulnerabilities: AI agents may unintentionally expose sensitive data or generate insecure code if their behavior isn’t carefully managed, or still, misconfigured access to APIs or datasets can lead to leaks or breaches.
Incorrect or Unreliable Output: Models can produce misleading, biased, or plain wrong results, especially if prompts or context are misunderstood. Blindly trusting AI-generated code or analysis can introduce subtle bugs.
Inefficient Resource Usage: Unoptimized queries or repeated agent calls can increase costs significantly, especially in cloud-based or token-billed services. Poor pipeline design can lead to slower performance and unnecessary computation.
Integration Errors: Without understanding how the AI interacts with your system, agents can cause runtime errors, break APIs, or corrupt data. Misalignment between model outputs and application logic can create unexpected behavior.
Overfitting to Prompting or Hardcoding: Over-reliance on prompts may mask the need for proper algorithmic or architectural solutions. Agents may “memorize” or misinterpret specific instructions.
Regulatory and Compliance Risks: Using AI incorrectly can result in non-compliance with GDPR, HIPAA, or other legal requirements, especially when handling personal data.
Hidden Biases and Ethical Issues: Agents can reinforce harmful biases or produce unfair outcomes if datasets or prompts are flawed.
Maintenance and Debugging Complexity: AI-driven pipelines can be opaque; debugging generated code or reasoning steps is harder than traditional code.
False Sense of Expertise: Relying on AI without understanding fundamentals can create overconfidence in the system’s correctness. Can lead to deployment of solutions that fail under real-world conditions.

Understanding how the model processes the prompt and the ways it highlights its meaning is going to save you a lot of headaches, both cost-wise and code-wise, in the long run.

So keep this in mind, and read a lot. At the end of the day, using tools and agents is up to preference and company policies, but knowledge is never too much.

References

We want to work with you. Check out our Services page!

Tool or Agent? The impact of AI in your code and in your wallet

It all boils down to math again!