Introduction to instrumentation with OpenTelemetry for Developers – Key Concepts

A (maybe not so) quick introduction to the universe of instrumentation and observability.

Introduction

I’m a Rails web developer, why in the hell should I bother instrumenting my app if I’m already using Honeybadger? Or I could just spend dozens of hours suffering from suicidal ideation watching a unstructured log to debug an issue?

Good point, and I think that’s honestly why the practice of observability exists.

Nowadays, many apps are built to run in distributed setups, whether that means multiple environments hosting the same app for redundancy or entirely different services working together. But one thing’s for sure – with such architecture, debugging and monitoring might become a gruesome task. Developers often take for granted the environments that provide solid tooling to track operations, until we have to work in an environment that DOESN’T have any kind of tool set for that matter whatsoever.

That’s why practices like instrumentation and observability are so critical: they give us the visibility we need to troubleshoot issues, gain insights, and keep things running as smoothly as possible.

But what is instrumentation after all? What’s the difference between instrumentation and observability? Which tools could help me with those matters? Why are we talking about observability in this intro but the title says instrumentation with open telemetry!? What’s the meaning of life?

Fear not, this post aims to answer some of these questions. In this post, we’ll:

  • Show the difference between instrumentation, observability, and telemetry.
  • Introduce an agnostic tool that solves some of the problems that come with instrumentation and its ecosystem.
  • Give an overview of OpenTelemetry, as well as show what problem it aims to solve, and how it does so.
  • Explain key concepts that we need to have to use OpenTelemetry on applications.
  • Understand the meaning of life.

What this post DOESN’T aim to do:

  • Show what’s the best setup for your instrumentation setup with OpenTelemetry.
  • Show what’s the best solution for your needs using observability
  • Show what’s the best vendor to manage metrics, logs or traces collected by OpenTelemetry.

That said, first, let’s clarify some concepts before we dive into this new universe:

Glossary

Resources

Resources are the logical or physical components an operation depends on. A logical resource could be a Service Object that performs a specific business rule, while a physical resource could be RAM, CPU, an EC2 instance, etc.

Transactions

A transaction is an operation that’s triggered by something or someone towards your application, consuming its resources, such as a request that’s triggered by the end user to buy a book on a bookstore website.

Telemetry

Telemetry is data that describes what your application’s doing, what it’s consuming, etc. It isn’t just about knowing how much CPU or RAM is being consumed – it goes way further than that. For instance, an online multiplayer game might collect telemetry such as frames per second and network latency, two key metrics that reflect the application’s current state.

That makes it clear that Telemetry isn’t just about Performance Telemetry, which’s about gathering data about resources‘ usage, but it can be about User’s telemetry too (which is often called as Profiling).

Signals

A signal is simply a type of telemetry data. Frames per Second can be a signal, network latency could be a signal, and so on. In this post, we’ll focus on three specific signals: Traces, Logs, and Metrics. There’s a reason for that, which we’ll get into shortly.

Every signal relies on two practices: Instrumentation and Transmission.

Instrumentation

Instrumentation is the practice of gathering telemetry data from an application. With that definition, you might ask:

Then what’s the difference between telemetry and instrumentation?

Telemetry is the data itself (being that a signal is a type of telemetry), while instrumentation is about collecting that data.

Transmission

That is pretty straightforward – it is the way we export the instrumented data.

Observability

Last but not least, observability is the practice of analyzing the instrumented data to extract value, mainly through correlations made based upon this data. That value could mean insights, clues that help identify bugs, alerts, etc.

The three pillars of Observability

As mentioned earlier, we’ll focus specifically on three signals – Logs, Traces, and Metrics, and there’s a reason for it. Historically, these three signals were treated with primacy, which led to a current concept called The Three Pillars of Observability. Logs and traces could help to recreate the picture led by a transaction, and metrics could help to identify key events related to a transaction, such as performance issues.

Logs are timestamped messages that record an operation’s events. Traces model the operations performed in a system and can be easily aggregated, think of them as a structured collection of logs. And metrics are numeric measurements collected from a system at runtime.

When looking at the at them in a isolated manner we wave a problem – most of the value that can be retrieved from these signals comes from correlations made based on these signals.

Remember – every signal is composed of instrumentation and transmission. Now imagine these three signals in a scenario where managing each one requires a different tool. And worse – imagine dealing with multiple different ways to export these signals to an analysis tool, or maybe even using multiple different tools to analyze each other, forcing you to open multiple tabs on your browser just to debug a specific transaction. Chaotic, right? In this scenario, each one would have a vertical integration, meaning that every signal has its own instrumentation, transmission and maybe even analysis tools sets to deal with.

That’s one of the problems that OpenTelemetry aims to solve.

OpenTelemetry itself

OpenTelemetry is an OSS that aims to solve two problems:

  • It creates an agnostic interface to instrument applications, removing the need to learn about multiple interfaces.
  • It is compatible with a lot of different analysis tools, allowing users to build an ecosystem to practice observability.

It makes analysis easier to make by intertwining traces, logs, and metrics into a "single braid of data", where these 3 signals are not treated separately, but they are close enough to take the most value out of them by making correlations.

Understanding this might not be very intuitive at first, but OpenTelemetry builds this kind of model by allowing us to enrich data in a way that we can literally relate one signal to another, so a trace could be related to a specific log, or metric, and so forth.

In short, OpenTelemetry provides:

  • An agnostic interface for analysis tools to consume.
  • A standard transmission protocol called OpenTelemetry Protocol (OTLP), to efficiently export the signals.
  • Features to enrich the signals, giving more info on what is happening on their application and where it is happening.

The OpenTelemetry solution to instrumentation could be visualized as this:

Pasted image 20250818103559

It’s important to make it clear that when we talk about products like Grafana, Prometheus, Jaeger, New Relic, Datadog, and many others, we’re talking about observability tools: although they provide their own solution to deal with these signals (e.g Jaeger is a tool to deal with traces, Datadog and New Relic are cloud-based softwares that deals with traces, logs and metrics), they all support observability. There are a lot of different observability tools, and every tool has its own solution and architecture to solve the problem that it intends to solve – it’s another universe of its own.

And OpenTelemetry brings consistency on HOW these tools can collect or receive data to operate properly.

How it solves the problem

OpenTelemetry solves these problems with some practices:

Hard and Soft contexts

Context is basically useful metadata that’s added to a signal that helps to identify and describe a relationship between a transaction and the system. This is the most used way to enrich data on OpenTelemetry.

Hard contexts are metadata that don’t change throughout an operation, and can uniquely identify a signal. Good examples for hard contexts are trace_ids, service name, host ip, etc.

Soft contexts are metadata that can be gathered from different resources involved throughout a transaction. Good examples for soft contexts are request specific data, such as user id, http method, http response code, etc.

Although it can look simple, Hard contexts can identify a signal helping us to correlate a signal to another one (e.g, a log that can have a trace_id related to it or vice-versa), filter, and aggregate data. By enabling label traces we could, for instance, group labels from a specific service that threw a specific type of error.

And with soft context, we could see useful information such as which user is related to that transaction, what went wrong or not, and many other things.

Telemetry Layering

Telemetry layering is the practice of structuring instrumented data by building layers on top of it. For example, we might first collect the data in a convenient format, then enrich it with context, convert it to JSON, and finally transmit it using a protocol. Each of these steps represents a different layer. This concept will become much clearer once we see how OpenTelemetry works in practice.

Remember—telemetry always involves a signal’s instrumentation and transmission. With layering, several concerns arise: What’s the best format for this signal? What if we want to transform or convert it? How do we perform operations on this data in a pipeline? And so on. All these considerations must account for the overhead introduced, since processing data always comes at a cost. That’s why telemetry layering is so important.

Semantic Telemetry

Semantic telemetry provides naming conventions for data added to signals, such as HTTP method, HTTP response, request duration, and more. Why is this important? Primarily because it avoids ambiguity and establishes a standard way to label key metadata and describe transactions. It makes metadata easy to understand, reducing confusion. Even better, many analysis tools follow these conventions, which simplifies sending data to them and performing operations on it—like creating metrics in Prometheus.

In short, it is useful to give meaning to telemetry metadata.

Overview

Now we’ll see how OpenTelemetry works, and even how these practices work together with.

Here’s an overview on OpenTelemetry architecture at its highest level:

Otel Overview Diagram drawio

This diagram was adapted from the book Learning OpenTelemetry – by Austin Parker and Ted Young

Traces, metrics, and logs are enriched with context through metadata called attributes and resources (we’ll cover this in more detail soon), and all of them are typically exported to a Collector via OTLP. It’s important to note that using a Collector isn’t mandatory, but it’s considered best practice because it provides convenient ways to manage telemetry data.

The OpenTelemetry Collector is a component that offers an agnostic way to process and export data. It handles tasks like retries, batching, and horizontal scaling to prevent backpressure, which occurs when data arrives faster than the system can process it. The Collector can export data in multiple protocols and formats, with gRPC and HTTP being the most commonly used.

For more details on Collector, see https://opentelemetry.io/docs/collector/

Attributes and Resources

These are the metadata that can be attached to a signal. It is, literally, a hash data structure containing key and value pairs, being that a value could be either a primary type (integer, floating point, string) or an array of primary types. Attributes and Resources are part of the Context concept inside OpenTelemetry.

Semantic conventions

Semantic conventions are a set of common names of attributes that can be used inside signals, and they summarizes the practice of Semantic telemetry.


I know that I’ve already defined these signals before, but let’s contextualize it better inside OpenTelemetry:

Traces

Traces are perhaps the most important signal to work with in web applications, and that’s because each trace represents an operation (transaction) made to the application. It describes a data flow throughout the application – from its entry point, to every service that’s called during that operation processing.

A trace is made up of one or more spans, which are the smallest unit of an operation. A span can also have child spans. For example, a single trace might include a span for the endpoint entry point, another for the controller, one for the service layer, another for a third-party API call, and finally one for database queries.

Metrics

Metrics in OpenTelemetry defines important events that happened on the codebase, and gives a better view about resources’ usage. Want to check how much is taking for an endpoint to return a response? We can calculate the duration of a request as a metric.

As the OpenTelemetry docs say:

A metric is a measurement of a service captured at runtime. The moment of capturing a measurement is known as a metric event, which consists not only of the measurement itself, but also the time at which it was captured and associated metadata.

That’s why tools that handle metrics with a backend rely on a Time Series Database—a database optimized specifically for this type of data.

The most powerful feature for the metrics is aggregation. Metrics offers statistical information in aggregate, and this gives us the possibility to group this statistical data to retrieve insights.

OpenTelemetry calls measurements instruments. These instruments can be of different types according to the nature of the measurement, being that each one has a name, kind, unit, and description (these last 2 are optional).

Kinds of Instruments

histogramas são agregados numéricos de valores que são usados para resumir a distribuição desses valores ao longo do tempo, sendo que cada um desses agregados, desse grupo de valores, são chamados de buckets. Ou seja – histogramas definem valores pivôs, que serão os limites, e separam esses valores em buckets, que são intervalos baseados nesses limites definidos.

Esses limites são definidos nas views de uma métrica, conceito que veremos logo logo.

Cada histograma, ao ser agregado em buckets, irá conter 4 atributos – min, max sum e buckets, sendo que cada bucket tem possui suas boundaries definidas e o count de quantos valores estão contemplados em cada bucket. Min é o menor valor que existe nos buckets, e max o maior

Histogram:

  • Histograms are numeric aggregations of values, used to measure how those values are distributed over time. Each aggregation is split into buckets. A histogram defines boundaries that separate values into intervals, and each recorded value falls into one of those buckets. These boundaries are usually configured inside a metric’s view — we’ll look at that concept shortly.
  • A histogram typically provides the following attributes:
    • Min – the smallest recorded value.
    • Max – the largest recorded value.
    • Sum – the total sum of all recorded values.
    • Count – the total number of recorded values.
    • Buckets – the set of intervals defined by the boundaries, along with the count of values that fall into each bucket.

Not clear enough? Imagine we wanted to aggregate how many queries took less than 10ms, between 10 and 100ms, between 100 and 1000ms, and more than 1000ms. This would be a histogram with the boundaries [10, 100, 1000]. Each time we record a value into the histogram, it gets slotted into the right bucket, and the histogram updates all its data — count, sum, min, max, and bucket counts.

Counter:

  • A numeric value that increases over time, such as the number of requests processed in a day.

UpDown Counter:

  • A numeric value that can either increase or decrease over time, such as the number of reservations available in a day.

Gauge:

  • A current value taken at the time it is read – it is non-additive, meaning that it doesn’t necessarily increase or decrease over time. The temperature of a room that’s managed by a smart thermostat could be an example of a Gauge..

Note that these instruments changes in a continuous way – but not every metric can be obtained like that, and that’s why each of the previously mentioned types has its asynchronous counterpart, where instead of collecting this data continuously, it is collected once for each export of that data.

Logs

Logs are treated a bit differently. While OpenTelemetry collects Traces and Metrics directly, it doesn’t try to replace existing, well-established logging frameworks. Instead, it integrates with them: OTel can take the logs they produce, enrich those logs with additional context (like trace IDs), and then export them in a consistent format — often JSON, since it’s structured and easy to process.

Some logging frameworks already provide native support for exporting log records through OTLP. Because of this, it’s common to see approaches where the logging framework itself handles exporting, instead of OpenTelemetry. In these cases, it’s still possible to correlate a log record with a trace or a span through its trace id or span id.

OTLP – Open Telemetry Protocol

OTLP is a protocol that can be transmitted as binary or text, and exported as a file or even to an event queue such as Kafka, and it can be easily translated to other formats (such as Prometheus).

Finally – Instrumentation

Sorry for taking so long, but I had to explain the necessary concepts before actually showing the instrumentation part. But rest assured, if the concepts aren’t clear enough in your mind right now, it’ll be from now on.

Setting up for instrumentation

In short for us to instrument our application, we’ll need two major components: OpenTelemetry SDK and the Instrumentation library.

OpenTelemetry SDK

OpenTelemetry offers an API to interact with its features and access all of its features, such as traces, metrics and logs recording. But the API itself it’s a no-op, meaning that it won’t work if you try to call it directly. To interact with the API, we’ll need to use a SDK (Software Development Kit) that implements its features.

Instrumentation library

Most of the web applications nowadays uses some kind of framework or library to deal with things such as web servers, routing, database connections and object relational mapping, and many other things. These are all important resources (remember, resources can be logical too!) that should be instrumented properly. If it’s not clear enough, think of it this way: we’re literally instrumenting lines of code. Imagine a traditional Rails app, for instance – your data will flow through many of its conventions such as controllers, middlewares, and database accesses (usually through ActiveRecord), hence, being a vital part of your application. That’ why is so important for you to instrument core libraries used by the application. We’re not talking about getting metrics of CPU usage when the isEven lib checks if a number’s even (although it would be really cool to have a benchmark for it).

That said, there are two ways to instrument a library – either it has support to zero-code instrumentation or code-based instrumentation.

Zero-code, as the name implies, doesn’t need you to do much except for installing the package that deals automatically with instrumenting your app, and code-based it’s the opposite – it involves coding.

The languages that support zero-code instrumentation are the following:

  • .NET
  • Go
  • Java
  • JavaScript
  • PHP
  • Python

If your language it’s not supported, then you’ll need to write some code in order to do the work. But what isn’t obvious is that you don’t necessarily need to write everything from scratch, since most of the well-known frameworks and libraries already has support for auto-instrumentation – an Open Telemetry package that doesn’t require you to manually instrument it. Instead, it offers packages that instrument that framework for you. For example, in Ruby we have the gem opentelemetry-instrumentation-all, which offers instrumentation for a lot of different libraries such as Rails, Sinatra, Rack, ActiveRecord, etc.


So in short: To instrument our app, we’ll need to:

1 – Install Open Telemetry SDK, which will interact with the API
2 – Instrument important libraries your app uses. If your language supports zero-code instrumentation, congrats! If it doesn’t, either you do the instrument from scratch or install a package that will automatically instrument that lib for you. Again, I’ll leave the links in the references section.

One key point to know is that, currently, support to deal with metrics, and logs are still in development for some languages. You can check if the language of your preference has stable support for traces, metrics, or logs here.

Now we should be ready to finally understand – how can we manually create traces, metrics and logs using the SDK?

We can do this by using an OpenTelemetry abstraction called Providers.

Providers

Providers are structures that serve as an entry point to create traces, logs, and metrics. The idea is simple: we register a provider in the application’s entry point (for example, in Rails apps this might be an initializer, in Express.js apps it’s usually index.js, and so on), export it, and then reuse it anywhere we want to instrument our code.

Each Provider has its own architecture, which, for now, you don’t have to know every specifics of it (besides, this is an introductory content), just enough for you to be able to implement an initial instrumentation of an application, so, for simplicity sake, i’ll summarize the key concepts for understanding the Tracer, Meter and Logger Providers.

The most important concept to grasp is that every Provider uses Processors. Processors are responsible for handling the collected data:

  • One type of processor is used to transform or modify data. You can even chain these together, where one processor modifies the data and passes it to another.
  • The other type is the export processor, which sends the data to an Exporter. Together, these form a pipeline: data is collected, passed through processors, and finally sent out by an exporter.

An Exporter is simply responsible for exporting the data. Why does this matter? Because OpenTelemetry is designed to be agnostic. Different exporters exist so you can send your telemetry to many different backends — from your local shell for debugging, all the way to OTLP-compatible observability tools.

Tracer Provider

Tracer Providers are composed of three components:

Samplers

Samplers are responsible by sampling traces, instead of recording every collected trace. And this is done through different sampling algorithms that make it possible to configure. Since this implies in data loss since it won’t save every trace generated by every operation, Samplers should be used carefully – if you don’t know why or how to use it, simply don’t.

Span Processors

Span Processors are the kind of processors responsible for collecting and processing data.

Batch Processor

The batch processor is the default or the last processor on the pipeline. That means that you either use it to send data directly to an exporter without using any span processor before, or the span processors should send data in a pipeline to a batch processor, which will then be sent to an exporter.

Meter Provider

Meter providers are composed again by three components:

Views

Views are meter outputs that can be customized according to the developer’s needs. Such customization is, of course, optional, and there are tons of different ways that you can customize an output. We won’t cover this in depth. But it’s good to know where you can customize your meter outputs.

Metric Readers

These readers have the same purpose as the span processor – they serve to collect and process meters.

Metric Producers

Producers are used to import metrics from third-party applications into the SDK. So instead of outputting the metrics through the OpenTelemetry API, it imports from another application, and OpenTelemetry will know how to deal with it using these producers.

Logger Provider

Log Record Processors

These readers have the same purpose as the span processor and metric reader.

Batch Processors

This is the default processor, responsible for sending the logs to the Exporter.

Log Record Exporter

Exports the logs to a convenient format, OTLP being the default format.

So, to summarize

image

What to instrument?

Usually, you can separate your app into important parts:

  • The main libraries that are used inside your app
  • Your own code
  • External components that are crucial for your app

Libraries

Saying to instrument the crucial libraries used by your application might sound too vague – does that mean that i should instrument my npm package that checks if a number is even? Obviously yes no, specially if that lib doesn’t have an auto-instrumentation package. Now think about things such as ORMs, query builders, database connection adapters, web frameworks such as Rails, Spring, Nest.JS, HTTP libraries such as Fastify, Express, Sinatra… These are libraries that you should instrument – they give you the foundation to deal with external and important interfaces such as web requests and database queries.

And fortunately, thanks to the effort of many devs, most of these popular frameworks and libraries already have an auto-instrumentation package available for us to use, meaning that, for some libraries, we won’t need to instrument everything from scratch. Everything you have to do in that case is to download the desired package and register inside your app’s entry point.

If the library you’re using doesn’t have such a thing, or if it is a proprietary, internal, private library, then bad news – you’ll have to instrument that lib from scratch.

Your own code

Instrumenting your own code depends on the data you find important to analyze, aggregate, and monitor. For example, you might want to track the average duration of incoming requests, or link an operation to the user who triggered it by adding their user ID to the span context. There are many ways to enrich telemetry, so always think about what data really matters for your case, adding both soft context (business information, like user IDs) and hard context (technical details, like response times or error codes).

Start with what’s important, instead of guessing what might be useful someday. Keep it simple, start narrow, and then expand as your needs grow. Always take care not to overload your infrastructure with too much recorded data.

A good rule of thumb is to think in terms of breadth-first search rather than depth-first search: spread your instrumentation across key parts of the system first, instead of going too deep into one long chain of spans. This gives you a clearer, more balanced picture of your system without drowning in detail.

Conclusion

I know it looks like it’s too much, but that’s only the tip of the iceberg when it comes to instrumentation and observability. I appreciate your patience and enthusiasm for sticking with me until the end.

Most of the information contained here were found in the book "Learning OpenTelemetry" by Ted Young and Austin Parker, and on OpenTelemetry website’s documentation. So if you want to read further about and even try OpenTelemetry yourself, I’ll leave some references below.

Until next time!

References and interesting reads

Primary sources

We want to work with you. Check out our Services page!