Observability Series 03: Jaeger 101 — Human-Readable Distributed Tracing

In the previous articles, we talked about observability as a way to understand a system from the outside, and OpenTelemetry as a standard for producing telemetry data such as traces, metrics, and logs.

But once an application produces traces, the next questions appear:

Where do we view those traces?
How do we read the journey of a request?
How do we know which service is slow, failing, or causing a bottleneck?

This is where Jaeger becomes useful.

Jaeger is a distributed tracing tool that helps us visualize the journey of a request inside a distributed system. When a request passes through an API gateway, backend service, database, cache, message queue, or external service, Jaeger helps turn that journey into a trace that humans can actually read.

Quick Recap: From Observability to OpenTelemetry

Observability helps us answer questions like:

what is happening inside the system?
why is this request slow?
which service is failing?
which dependency is becoming a bottleneck?
is the issue in the application, network, database, or another service?

OpenTelemetry acts as a standard and toolkit for collecting telemetry data from applications. With OpenTelemetry, applications can produce traces, metrics, and logs in a more consistent format.

But OpenTelemetry itself is not usually the main dashboard for reading traces. OpenTelemetry is more commonly used as an instrumentation layer and telemetry pipeline. To read traces visually, we need a tracing backend such as Jaeger.

The Problem Jaeger Tries to Solve

In a simple system, debugging is usually still manageable. We can open the application logs, find the error, and fix the broken part.

But in modern systems, a single request often passes through many components.

For example:

User
→ API Gateway
→ Auth Service
→ Backend Service
→ Database
→ External Payment API
→ Message Queue
→ Worker

If the request is slow, one log line is usually not enough to explain the root cause.

The issue might be a slow database query. It might be a slow external API. It might be a full worker queue. It might be an internal service taking too long to process data. It might also fail because of a dependency somewhere in the middle of the request flow.

Jaeger helps answer questions like:

which services did this request pass through?
which service took the longest time?
which span failed?
which dependency caused the latency?
is the bottleneck consistent or occasional?

Without distributed tracing, we often only know that a request is slow. With Jaeger, we can see where it is slow.

What Is Jaeger?

Jaeger is a distributed tracing platform used to collect, store, and visualize traces from applications or services.

With Jaeger, we can view a single request as a trace. Inside that trace, there are multiple spans that represent smaller units of work performed while the request is being processed.

Examples of spans include:

receiving an HTTP request
querying a database
calling another service
publishing a message to a queue
running a worker process
receiving a response from an external API

Each span has duration, metadata, and a relationship with other spans. These relationships allow us to understand the complete structure of a request.

In simple terms:

Trace = the full journey of a request
Span  = a smaller unit of work inside that request

Core Concepts You Should Understand

Trace

A trace represents the journey of one request from start to finish.

For example, when a user opens an invoice page, the request might pass through the frontend, backend, database, and payment service. All of that journey can be grouped into one trace.

A trace helps us see the big picture of a request.

Span

A span is a unit of work inside a trace.

If a trace is the full journey, then spans are checkpoints along the way.

Examples of spans:

a span for an HTTP request
a span for a database query
a span for a call to another service
a span for request validation
a span for publishing an event

Each span usually has a start time, end time, duration, and metadata.

Parent-child Span

Spans can have parent-child relationships.

For example, the main span could be a request to a backend service. Inside it, there could be child spans for a database query and an external API call.

This relationship helps us understand the sequence and dependencies between operations.

A simple example:

POST /payments
├── validate request
├── get user from database
├── insert payment record
├── call payment provider
└── publish payment.created event

With a structure like this, we can understand what happened inside one request without manually jumping between many different logs.

Tags or Attributes

Tags or attributes are additional metadata attached to a span.

Examples:

HTTP method
HTTP status code
endpoint
database statement
service name
error message
environment

This metadata is important because it helps with filtering and debugging.

For example, we might want to find all traces with status code 500, or all traces from a specific service. Attributes make investigation much faster.

Duration and Latency

One of the strongest benefits of distributed tracing is the ability to see the duration of each span.

From this, we can identify which part of the request consumed the most time. Sometimes a request is slow not because the main application is slow, but because an external dependency takes too long to respond.

For example:

POST /payments                 2000ms
├── validate request              5ms
├── get user from database       30ms
├── insert payment record        40ms
├── call payment provider      1800ms
├── publish payment event        20ms
└── return response               5ms

From this example, the bottleneck is clearly in call payment provider, not in the database or the main application logic.

How Jaeger Works in a Modern Architecture

At a high level, the flow looks like this:

Application
→ OpenTelemetry SDK
→ OpenTelemetry Collector
→ Jaeger
→ Jaeger UI

The application produces traces through instrumentation. This instrumentation can be manual or automatic using the OpenTelemetry SDK.

The trace data is then sent to the OpenTelemetry Collector. The Collector receives, processes, and forwards telemetry data to a backend.

One of the backends that can receive those traces is Jaeger.

Once traces are available in Jaeger, we can open Jaeger UI to inspect request details, span duration, errors, and service relationships.

Jaeger and OpenTelemetry: How Are They Related?

OpenTelemetry and Jaeger often appear in the same conversation, but they play different roles.

OpenTelemetry acts as the instrumentation standard and telemetry pipeline.

Jaeger acts as the tracing backend and UI for reading traces.

In other words:

OpenTelemetry produces and sends traces.
Jaeger stores and visualizes traces.

In a modern setup, applications do not always need to send traces directly to Jaeger. A more flexible approach is to send traces to the OpenTelemetry Collector first, then let the Collector forward them to Jaeger.

This approach makes the system easier to change. If the tracing backend changes in the future, the application does not need major changes. We only need to update the Collector configuration.

A Simple Request Flow Example

Imagine a request to create a payment transaction.

POST /payments

The request might go through this flow:

API Service
→ validate request
→ check user
→ create payment record
→ call payment provider
→ publish event
→ return response

In Jaeger, this journey can appear as one trace with multiple spans:

POST /payments
├── validate request
├── get user from database
├── insert payment record
├── call payment provider
├── publish payment.created event
└── return response

If the request takes 2 seconds, Jaeger can help show which part took the longest.

Without tracing, we might only see a log saying that the request was slow. With Jaeger, we can see where the time was spent.

This is why distributed tracing becomes very useful in systems with many dependencies.

When Jaeger Becomes Useful

Jaeger becomes very useful when a system starts to have many services or dependencies.

Common use cases include:

debugging latency between microservices
finding bottlenecks in a specific request
understanding dependencies between services
analyzing errors that happen in the middle of a request flow
comparing performance before and after a deployment
understanding request flows that are too complex to read from logs alone

Jaeger is not a replacement for logs or metrics. It complements them.

Metrics help us see aggregate patterns. Logs help us read detailed events. Traces help us understand the journey of a request.

All three are important parts of observability.

Things to Watch Out For

Jaeger is powerful, but there are a few things to keep in mind.

First, tracing can generate a lot of data. If every request is traced without sampling, storage usage and cost can increase quickly.

Second, instrumentation needs to be designed carefully. Too few spans make traces less useful. Too many spans make traces harder to read.

Third, span metadata should be useful but must not leak sensitive data. Avoid storing tokens, passwords, personal data, or sensitive payloads as attributes.

Fourth, service names should be consistent. If service names keep changing, traces become harder to filter and analyze.

Fifth, Jaeger becomes much more useful when trace context is propagated correctly between services. If context propagation does not work, traces can become fragmented and difficult to read as one complete request journey.

Closing

Jaeger helps make distributed tracing easier to read.

If OpenTelemetry is the way applications produce and send telemetry data, Jaeger is where we visually inspect the journey of a request.

With Jaeger, we can understand which services are involved, which operations are slow, and which dependencies may be causing problems.

At the beginning, the goal of using Jaeger is not to build a perfect observability system. The goal is to make debugging clearer.

Because in a distributed system, the hardest part is often not just the error itself, but not knowing where the error actually happened.