What Is Distributed Tracing?

Definition

Distributed tracing is a way of following a single request as it travels through all the services that handle it, recording the path it took and how long each step took, so you can see the whole journey in one view. In a system made of many services that call each other, a single user action might pass through a dozen services before a response comes back. Distributed tracing stitches together the work each of those services did on behalf of that one request into a single connected trace, so that instead of looking at each service in isolation, you can see the complete end-to-end story of what happened to that request.

The need for tracing comes directly from the move to distributed architectures. When an application is a single program, understanding a request is straightforward, because everything happens in one place and a stack trace or a log shows the whole flow. When that application is split into many services that communicate over the network, the flow of a single request is scattered across many machines and many logs, and no single service has the whole picture. Distributed tracing exists to reassemble that scattered picture, which is why it became essential exactly as microservices became common. It is the answer to the question microservices make hard: what actually happened to this one request.

A trace is built from spans. A span represents one unit of work, such as one service handling the request, one database query, or one call to another service, and it records when that work started, how long it took, and details about what it did. Spans are linked together by a shared trace identifier and by parent-child relationships, so the trace forms a tree that shows which work happened inside which other work and in what order. When you look at a trace, you see this tree of spans laid out over time, which immediately shows you where the time went and which service did what, for that specific request.

Distributed tracing is one of the three commonly cited pillars of observability, alongside metrics and logs, and it answers a question the other two cannot. Metrics tell you that something is wrong in aggregate, that latency is up or error rates are climbing, but not which request or which service. Logs tell you what individual services did, but not how their work connected into a single request's journey. Tracing connects the dots, showing the path and timing of individual requests across services, which is exactly what you need to debug problems that span service boundaries. The three pillars work together, and tracing is the one that handles the cross-service story.

This page covers what distributed tracing is, why distributed systems need it, how traces and spans actually work, how teams use tracing to debug and understand their systems, and the practices that make tracing useful rather than just expensive. The specific tools, from OpenTelemetry to the many tracing backends, will keep evolving. The underlying idea, following a single request across all the services that handle it so you can see its complete journey, is durable and increasingly necessary as systems grow more distributed.

Key Takeaways

Distributed tracing follows a single request across all the services that handle it, assembling the scattered work into one connected end-to-end view.
The need comes from distributed architectures, where a single request's flow is scattered across many services and no one service has the whole picture.
A trace is built from spans, each representing one unit of work with timing and detail, linked by a shared trace ID into a tree.
Tracing is the observability pillar that connects the dots across services, answering what happened to one request, which metrics and logs cannot.
Tracing is most valuable for debugging cross-service problems, finding where latency goes, and understanding how requests actually flow through the system.

Why Distributed Systems Need Tracing

In a single-process application, understanding a request is easy because everything happens in one place. A stack trace shows the chain of function calls, and logs from that one process tell the whole story, because there is only one story to tell. The flow is contained, sequential, and visible from a single vantage point. This is why monolithic applications rarely needed anything as elaborate as distributed tracing; the information was already together, and ordinary debugging tools were enough to follow what a request did from start to finish.

Microservices break that containment. When a request passes through many services running on different machines, each service sees only its own slice, logs only its own slice, and knows nothing about the slices before or after it. A user reports that something was slow, and the team faces a system where the request touched ten services, and the slowness could be in any of them, or in the network between them, with no single place that shows the whole flow. Correlating logs from ten services by hand to reconstruct one request is painful and often impossible at scale, which is the gap tracing fills.

The problems that tracing solves are specifically the cross-service ones, which are exactly the problems that distributed architectures create. A request that is slow because one service deep in the chain is slow, an error that originates in one service but surfaces in another, a request that takes an unexpected path through the system, these are problems you cannot understand by looking at any single service. They are problems of how services interact, and tracing is the tool built to make those interactions visible. The more services a request touches, the more valuable tracing becomes, because the harder it is to see the flow any other way.

There is also a comprehension benefit that goes beyond debugging. In a large microservices system, no one fully knows how requests actually flow, because the architecture is complex and changes constantly, and the real call paths often differ from what anyone expects. Tracing shows the real flow as it actually happens, which services call which, in what order, how often, and how long each takes, giving teams an accurate picture of their own system. This is valuable for finding inefficiencies, understanding dependencies, and onboarding people to a system too complex to hold in one head, and it is a reason to adopt tracing even before a specific incident demands it.

How Traces and Spans Work

The trace identifier is the thread that ties everything together. When a request enters the system, it is assigned a unique trace ID, and that ID is passed along to every service the request touches, traveling with the request through each network call. Every piece of work done on behalf of that request records the same trace ID, so later, all the scattered work can be found and assembled by matching trace IDs. This propagation of the trace ID across service boundaries, called context propagation, is the mechanism that makes distributed tracing possible, and it is the part that requires services to cooperate.

Spans are the units that make up the trace. Each span records one operation: a service receiving and handling the request, a call to a database, an outbound call to another service, or any other meaningful unit of work. A span captures when it started, how long it lasted, and attributes describing what it did, such as which database it queried or what status it returned. Spans also record their parent, the span that caused them, so the spans form a tree that mirrors the actual structure of the work: the top-level span for the whole request, with child spans for each service it called, and their children for the work those services did in turn.

Laid out visually, a trace becomes a timeline that makes the request's behavior obvious at a glance. Each span appears as a bar positioned by when it happened and sized by how long it took, nested to show parent-child relationships. This view immediately reveals where the time went, because the long bars stand out, and it shows whether work happened in sequence or in parallel. A request that took two seconds shows clearly that one particular database query inside one particular service took 1.8 of those seconds, which is precisely the kind of insight that would be nearly impossible to extract from logs scattered across services.

Getting spans recorded requires instrumentation, which is the part teams have to invest in. Code has to be instrumented to create spans, attach attributes, and propagate the trace context across calls, and while frameworks and libraries automate much of this, some effort is always involved. OpenTelemetry has become the dominant standard for this instrumentation by 2026, providing a vendor-neutral way to generate traces that can be sent to any compatible backend, which freed teams from being locked into one vendor's agents. The instrumentation produces the spans, the backend stores and displays them, and the standard in between is what lets the two parts be chosen independently.

How Teams Use Tracing

The most common use is debugging slow requests, where tracing earns its keep immediately. When a request is slow, the trace shows exactly where the time was spent, span by span, so instead of guessing which of many services is the culprit, the team looks at the trace and sees the one span that dominates the duration. A slow database query, a slow downstream call, an unexpected retry, all show up plainly. This turns a cross-service performance investigation, which can otherwise take hours of correlating logs, into a quick look at a timeline, which is why latency debugging is the use case teams reach for first.

Tracing is equally valuable for understanding errors that cross service boundaries. When a request fails, the trace shows where the failure originated, which is often not where it surfaced, because an error deep in the chain can propagate up and appear as a generic failure at the top. The trace lets the team follow the failure back to its source, seeing which span actually failed and what it was doing, rather than starting at the symptom and working backward through unconnected logs. This is especially useful for the confusing cases where the service that reports the error is not the service that caused it, which are common in distributed systems and hard to debug otherwise.

Beyond individual requests, teams use tracing to understand the system as a whole. Aggregated across many traces, the data shows which services call which, how often, and how the latency is distributed, revealing the real architecture and its hot spots. This helps teams find services that are called more than expected, dependencies they did not realize existed, and patterns of slowness that only emerge in aggregate. It supports capacity planning, dependency mapping, and the ongoing work of understanding a system that is too large and too dynamic for anyone to hold in their head, which is a quieter but substantial benefit of having traces.

Tracing also connects to the rest of observability, and the connections are where a lot of the value comes from. A trace can link to the logs emitted during each span, so you can jump from seeing that a span was slow to reading exactly what that service logged while it was slow. Metrics that show a problem in aggregate can lead to traces that show it in detail for specific requests. The trace becomes the connective tissue that ties metrics, logs, and the request flow together, which is why mature observability setups integrate all three rather than treating tracing as a separate tool, and why tracing is most useful as part of a whole rather than alone.

Costs, Sampling, and Practical Trade-offs

Tracing is not free, and the cost is real enough that teams have to manage it deliberately. Generating, transmitting, storing, and querying spans for every request in a high-traffic system produces an enormous volume of data, which costs money and adds some overhead to the services being traced. A busy system can generate billions of spans a day, and storing and indexing all of them is expensive. This is the central practical tension in tracing: the data is most useful when you have the trace you need, but keeping every trace from every request is costly, so teams have to decide what to keep.

Sampling is how teams manage that cost, and it is the most important practical decision in tracing. Rather than keeping every trace, the system keeps a sample, and the strategy for choosing which traces to keep matters a great deal. Head-based sampling decides at the start of a request whether to trace it, which is simple but might miss the rare slow or failed requests that are most worth keeping. Tail-based sampling decides after the request finishes, so it can keep all the slow or erroneous traces and a sample of the normal ones, which is more useful but more complex to operate. The choice trades cost against the likelihood of having the trace you need when something goes wrong.

The instrumentation effort is the other practical cost, and it is ongoing rather than one-time. Services have to be instrumented to produce spans and propagate context, and while automatic instrumentation from frameworks and OpenTelemetry covers a lot, gaps remain, especially at the boundaries between services and in custom code. Context propagation is the fragile part: if even one service in the chain fails to pass the trace ID along, the trace breaks at that point and the downstream work is orphaned. Keeping instrumentation complete and correct across a changing system takes continuous attention, and broken traces are a common frustration that comes from instrumentation gaps.

The value of tracing depends on having it before you need it, which shapes how teams should approach the trade-offs. You cannot trace a request that already happened if it was not being traced at the time, so the decision about what to instrument and what to sample has to be made in advance, betting on what will be useful later. This argues for instrumenting broadly and sampling intelligently rather than tracing narrowly, because the cost of missing the one trace you needed during an incident is high. The practical art of tracing is keeping enough of the right data to be useful during incidents without keeping so much that the cost becomes unjustifiable, and that balance is specific to each system.

Examples of Tracing in Practice

A slow checkout shows the latency use case clearly. Customers report that checkout is occasionally slow, and the checkout request passes through a web service, an order service, a payment service, an inventory service, and several databases. Without tracing, the team would correlate logs across all of these to find the slow step, which is tedious and uncertain. With tracing, they open a slow trace and see immediately that one span, a call to an external payment provider, took several seconds while everything else was fast. The investigation that could have taken hours takes minutes, because the trace points straight at the slow span.

A confusing error shows the cross-service debugging use case. A request fails with a generic error at the top-level service, and the logs there say only that a downstream call failed, with no indication of why. The trace shows the failure originated three services deep, in a service that timed out connecting to a database, and the timeout propagated up as a vague error at each layer. The trace lets the team follow the failure to its true source rather than starting at the misleading symptom, which is exactly the kind of problem, an error that surfaces far from where it originated, that distributed systems produce constantly.

A system-understanding example shows the aggregate use case. A team inherits a large microservices system and is unsure how requests actually flow through it, because the documentation is stale and the architecture has drifted. By aggregating traces, they build an accurate map of which services call which, how often, and where the latency concentrates, and they discover that a service everyone thought was rarely used is actually called on every request and contributing significant latency. This insight, invisible in any single trace and impossible to get from logs, comes from tracing data viewed in aggregate, and it changes how they prioritize their work.

These examples share a common thread: the problem is about how services interact, and tracing makes that interaction visible. The slow checkout, the propagating error, and the misunderstood architecture are all problems you cannot see by looking at one service alone, because the information that solves them is in the connections between services. Seeing the pattern across latency, errors, and comprehension makes clear why tracing is specifically a distributed-systems tool: it exists to answer questions about cross-service behavior, and those are exactly the questions that distributed architectures make hard and that other observability tools cannot fully answer.

Best Practices

Adopt OpenTelemetry for instrumentation so traces are vendor-neutral and you can choose or change your tracing backend without re-instrumenting everything.
Instrument broadly and ensure context propagates across every service boundary, since one service that drops the trace ID breaks the trace at that point.
Use intelligent sampling, often tail-based, so you keep the slow and failed traces that matter most rather than only a blind fraction of all requests.
Connect traces to logs and metrics so you can move from a slow span to the logs that explain it, treating tracing as part of whole observability.
Add meaningful attributes to spans, such as identifiers and parameters, so traces carry the context needed to understand and group them later.

Common Misconceptions

Tracing is the same as logging; logs show what one service did, while tracing connects the work of many services into one request's end-to-end journey.
Tracing replaces metrics and logs; it is one observability pillar that complements the others, answering the cross-service question they cannot.
You can trace everything cheaply; the data volume is large and costly, which is why sampling is a central and unavoidable practical decision.
Instrumentation is a one-time setup; context propagation and span coverage need continuous attention as services change, or traces break and go incomplete.
You can trace a request after the fact; if it was not being traced when it happened, the trace does not exist, so instrumentation decisions are made in advance.

What Is Distributed Tracing?

Definition

Key Takeaways

Why Distributed Systems Need Tracing

How Traces and Spans Work

How Teams Use Tracing

Costs, Sampling, and Practical Trade-offs

Examples of Tracing in Practice

Best Practices

Common Misconceptions

Frequently Asked Questions (FAQ's)

What is distributed tracing?

Why do distributed systems need tracing?

What are traces and spans?

How does tracing relate to metrics and logs?

What is OpenTelemetry?

What is sampling and why does it matter?

How does tracing help debug problems?

What does it cost to use tracing?

Can I trace a request that already happened?