Distributed tracing is a way of following a single request as it travels through all the services that handle it, recording the path it took and how long each step took, so you can see the whole journey in one view. In a system made of many services that call each other, a single user action might pass through a dozen services before a response comes back. Distributed tracing stitches together the work each of those services did on behalf of that one request into a single connected trace, so that instead of looking at each service in isolation, you can see the complete end-to-end story of what happened to that request.
The need for tracing comes directly from the move to distributed architectures. When an application is a single program, understanding a request is straightforward, because everything happens in one place and a stack trace or a log shows the whole flow. When that application is split into many services that communicate over the network, the flow of a single request is scattered across many machines and many logs, and no single service has the whole picture. Distributed tracing exists to reassemble that scattered picture, which is why it became essential exactly as microservices became common. It is the answer to the question microservices make hard: what actually happened to this one request.
A trace is built from spans. A span represents one unit of work, such as one service handling the request, one database query, or one call to another service, and it records when that work started, how long it took, and details about what it did. Spans are linked together by a shared trace identifier and by parent-child relationships, so the trace forms a tree that shows which work happened inside which other work and in what order. When you look at a trace, you see this tree of spans laid out over time, which immediately shows you where the time went and which service did what, for that specific request.
Distributed tracing is one of the three commonly cited pillars of observability, alongside metrics and logs, and it answers a question the other two cannot. Metrics tell you that something is wrong in aggregate, that latency is up or error rates are climbing, but not which request or which service. Logs tell you what individual services did, but not how their work connected into a single request's journey. Tracing connects the dots, showing the path and timing of individual requests across services, which is exactly what you need to debug problems that span service boundaries. The three pillars work together, and tracing is the one that handles the cross-service story.
This page covers what distributed tracing is, why distributed systems need it, how traces and spans actually work, how teams use tracing to debug and understand their systems, and the practices that make tracing useful rather than just expensive. The specific tools, from OpenTelemetry to the many tracing backends, will keep evolving. The underlying idea, following a single request across all the services that handle it so you can see its complete journey, is durable and increasingly necessary as systems grow more distributed.
In a single-process application, understanding a request is easy because everything happens in one place. A stack trace shows the chain of function calls, and logs from that one process tell the whole story, because there is only one story to tell. The flow is contained, sequential, and visible from a single vantage point. This is why monolithic applications rarely needed anything as elaborate as distributed tracing; the information was already together, and ordinary debugging tools were enough to follow what a request did from start to finish.
Microservices break that containment. When a request passes through many services running on different machines, each service sees only its own slice, logs only its own slice, and knows nothing about the slices before or after it. A user reports that something was slow, and the team faces a system where the request touched ten services, and the slowness could be in any of them, or in the network between them, with no single place that shows the whole flow. Correlating logs from ten services by hand to reconstruct one request is painful and often impossible at scale, which is the gap tracing fills.
The problems that tracing solves are specifically the cross-service ones, which are exactly the problems that distributed architectures create. A request that is slow because one service deep in the chain is slow, an error that originates in one service but surfaces in another, a request that takes an unexpected path through the system, these are problems you cannot understand by looking at any single service. They are problems of how services interact, and tracing is the tool built to make those interactions visible. The more services a request touches, the more valuable tracing becomes, because the harder it is to see the flow any other way.
There is also a comprehension benefit that goes beyond debugging. In a large microservices system, no one fully knows how requests actually flow, because the architecture is complex and changes constantly, and the real call paths often differ from what anyone expects. Tracing shows the real flow as it actually happens, which services call which, in what order, how often, and how long each takes, giving teams an accurate picture of their own system. This is valuable for finding inefficiencies, understanding dependencies, and onboarding people to a system too complex to hold in one head, and it is a reason to adopt tracing even before a specific incident demands it.
The trace identifier is the thread that ties everything together. When a request enters the system, it is assigned a unique trace ID, and that ID is passed along to every service the request touches, traveling with the request through each network call. Every piece of work done on behalf of that request records the same trace ID, so later, all the scattered work can be found and assembled by matching trace IDs. This propagation of the trace ID across service boundaries, called context propagation, is the mechanism that makes distributed tracing possible, and it is the part that requires services to cooperate.
Spans are the units that make up the trace. Each span records one operation: a service receiving and handling the request, a call to a database, an outbound call to another service, or any other meaningful unit of work. A span captures when it started, how long it lasted, and attributes describing what it did, such as which database it queried or what status it returned. Spans also record their parent, the span that caused them, so the spans form a tree that mirrors the actual structure of the work: the top-level span for the whole request, with child spans for each service it called, and their children for the work those services did in turn.
Laid out visually, a trace becomes a timeline that makes the request's behavior obvious at a glance. Each span appears as a bar positioned by when it happened and sized by how long it took, nested to show parent-child relationships. This view immediately reveals where the time went, because the long bars stand out, and it shows whether work happened in sequence or in parallel. A request that took two seconds shows clearly that one particular database query inside one particular service took 1.8 of those seconds, which is precisely the kind of insight that would be nearly impossible to extract from logs scattered across services.
Getting spans recorded requires instrumentation, which is the part teams have to invest in. Code has to be instrumented to create spans, attach attributes, and propagate the trace context across calls, and while frameworks and libraries automate much of this, some effort is always involved. OpenTelemetry has become the dominant standard for this instrumentation by 2026, providing a vendor-neutral way to generate traces that can be sent to any compatible backend, which freed teams from being locked into one vendor's agents. The instrumentation produces the spans, the backend stores and displays them, and the standard in between is what lets the two parts be chosen independently.
The most common use is debugging slow requests, where tracing earns its keep immediately. When a request is slow, the trace shows exactly where the time was spent, span by span, so instead of guessing which of many services is the culprit, the team looks at the trace and sees the one span that dominates the duration. A slow database query, a slow downstream call, an unexpected retry, all show up plainly. This turns a cross-service performance investigation, which can otherwise take hours of correlating logs, into a quick look at a timeline, which is why latency debugging is the use case teams reach for first.
Tracing is equally valuable for understanding errors that cross service boundaries. When a request fails, the trace shows where the failure originated, which is often not where it surfaced, because an error deep in the chain can propagate up and appear as a generic failure at the top. The trace lets the team follow the failure back to its source, seeing which span actually failed and what it was doing, rather than starting at the symptom and working backward through unconnected logs. This is especially useful for the confusing cases where the service that reports the error is not the service that caused it, which are common in distributed systems and hard to debug otherwise.
Beyond individual requests, teams use tracing to understand the system as a whole. Aggregated across many traces, the data shows which services call which, how often, and how the latency is distributed, revealing the real architecture and its hot spots. This helps teams find services that are called more than expected, dependencies they did not realize existed, and patterns of slowness that only emerge in aggregate. It supports capacity planning, dependency mapping, and the ongoing work of understanding a system that is too large and too dynamic for anyone to hold in their head, which is a quieter but substantial benefit of having traces.
Tracing also connects to the rest of observability, and the connections are where a lot of the value comes from. A trace can link to the logs emitted during each span, so you can jump from seeing that a span was slow to reading exactly what that service logged while it was slow. Metrics that show a problem in aggregate can lead to traces that show it in detail for specific requests. The trace becomes the connective tissue that ties metrics, logs, and the request flow together, which is why mature observability setups integrate all three rather than treating tracing as a separate tool, and why tracing is most useful as part of a whole rather than alone.
Tracing is not free, and the cost is real enough that teams have to manage it deliberately. Generating, transmitting, storing, and querying spans for every request in a high-traffic system produces an enormous volume of data, which costs money and adds some overhead to the services being traced. A busy system can generate billions of spans a day, and storing and indexing all of them is expensive. This is the central practical tension in tracing: the data is most useful when you have the trace you need, but keeping every trace from every request is costly, so teams have to decide what to keep.
Sampling is how teams manage that cost, and it is the most important practical decision in tracing. Rather than keeping every trace, the system keeps a sample, and the strategy for choosing which traces to keep matters a great deal. Head-based sampling decides at the start of a request whether to trace it, which is simple but might miss the rare slow or failed requests that are most worth keeping. Tail-based sampling decides after the request finishes, so it can keep all the slow or erroneous traces and a sample of the normal ones, which is more useful but more complex to operate. The choice trades cost against the likelihood of having the trace you need when something goes wrong.
The instrumentation effort is the other practical cost, and it is ongoing rather than one-time. Services have to be instrumented to produce spans and propagate context, and while automatic instrumentation from frameworks and OpenTelemetry covers a lot, gaps remain, especially at the boundaries between services and in custom code. Context propagation is the fragile part: if even one service in the chain fails to pass the trace ID along, the trace breaks at that point and the downstream work is orphaned. Keeping instrumentation complete and correct across a changing system takes continuous attention, and broken traces are a common frustration that comes from instrumentation gaps.
The value of tracing depends on having it before you need it, which shapes how teams should approach the trade-offs. You cannot trace a request that already happened if it was not being traced at the time, so the decision about what to instrument and what to sample has to be made in advance, betting on what will be useful later. This argues for instrumenting broadly and sampling intelligently rather than tracing narrowly, because the cost of missing the one trace you needed during an incident is high. The practical art of tracing is keeping enough of the right data to be useful during incidents without keeping so much that the cost becomes unjustifiable, and that balance is specific to each system.
A slow checkout shows the latency use case clearly. Customers report that checkout is occasionally slow, and the checkout request passes through a web service, an order service, a payment service, an inventory service, and several databases. Without tracing, the team would correlate logs across all of these to find the slow step, which is tedious and uncertain. With tracing, they open a slow trace and see immediately that one span, a call to an external payment provider, took several seconds while everything else was fast. The investigation that could have taken hours takes minutes, because the trace points straight at the slow span.
A confusing error shows the cross-service debugging use case. A request fails with a generic error at the top-level service, and the logs there say only that a downstream call failed, with no indication of why. The trace shows the failure originated three services deep, in a service that timed out connecting to a database, and the timeout propagated up as a vague error at each layer. The trace lets the team follow the failure to its true source rather than starting at the misleading symptom, which is exactly the kind of problem, an error that surfaces far from where it originated, that distributed systems produce constantly.
A system-understanding example shows the aggregate use case. A team inherits a large microservices system and is unsure how requests actually flow through it, because the documentation is stale and the architecture has drifted. By aggregating traces, they build an accurate map of which services call which, how often, and where the latency concentrates, and they discover that a service everyone thought was rarely used is actually called on every request and contributing significant latency. This insight, invisible in any single trace and impossible to get from logs, comes from tracing data viewed in aggregate, and it changes how they prioritize their work.
These examples share a common thread: the problem is about how services interact, and tracing makes that interaction visible. The slow checkout, the propagating error, and the misunderstood architecture are all problems you cannot see by looking at one service alone, because the information that solves them is in the connections between services. Seeing the pattern across latency, errors, and comprehension makes clear why tracing is specifically a distributed-systems tool: it exists to answer questions about cross-service behavior, and those are exactly the questions that distributed architectures make hard and that other observability tools cannot fully answer.
Distributed tracing is a way of following a single request as it travels through all the services that handle it, recording the path and the timing of each step so you can see the whole journey in one view. In a system of many services that call each other, one user action might pass through a dozen services, and tracing stitches together the work each service did on behalf of that request into a single connected trace. Instead of looking at each service in isolation, you see the complete end-to-end story of what happened to that one request, which is exactly what distributed architectures make hard to see otherwise.
Because the flow of a single request is scattered across many services, and no one service has the whole picture. In a single-process application, a stack trace and logs from one place tell the whole story, but when a request passes through many services on different machines, each service sees only its own slice. Correlating logs from a dozen services by hand to reconstruct one request is painful and often impossible at scale. Tracing reassembles that scattered picture automatically, which is why it became essential exactly as microservices became common. It answers the question distributed systems make hard: what happened to this one request.
A trace is the complete record of one request's journey through the system, and it is built from spans. A span represents one unit of work, such as one service handling the request, one database query, or one call to another service, and it records when the work started, how long it took, and details about what it did. Spans are linked by a shared trace ID and by parent-child relationships, so they form a tree that shows which work happened inside which other work and in what order. When you view a trace, you see this tree of spans laid out over time, which shows where the time went.
They are the three commonly cited pillars of observability, and each answers a different question. Metrics tell you something is wrong in aggregate, that latency or error rates are climbing, but not which request or service. Logs tell you what individual services did, but not how their work connected into one request's journey. Tracing connects the dots, showing the path and timing of individual requests across services. The three work together, and mature setups integrate them, so you can move from a metric showing a problem to a trace showing it in detail to the logs explaining a specific span. Tracing is the pillar that handles the cross-service story.
OpenTelemetry is the dominant open standard for generating traces and other telemetry, and by 2026 it has become the default choice for instrumentation. It provides a vendor-neutral way to instrument code to create spans, attach attributes, and propagate trace context, and the telemetry it produces can be sent to any compatible backend. This matters because it separates the instrumentation from the storage and display, so teams are not locked into one vendor's agents and can change their tracing backend without re-instrumenting their services. Adopting OpenTelemetry is the common recommendation precisely because it preserves that flexibility while standardizing how traces are produced.
Sampling is keeping only a portion of traces rather than all of them, and it matters because tracing every request in a high-traffic system produces an enormous, expensive volume of data. Head-based sampling decides at the start of a request whether to trace it, which is simple but may miss the rare slow or failed requests that are most worth keeping. Tail-based sampling decides after the request finishes, so it can keep all the slow and erroneous traces plus a sample of normal ones, which is more useful but more complex to run. The sampling strategy trades cost against the likelihood of having the trace you actually need during an incident.
It shows exactly where time was spent and where errors originated, across all the services a request touched. For a slow request, the trace lays out every span on a timeline, so the one span that dominates the duration stands out immediately, turning a cross-service performance hunt into a quick look at a timeline. For an error, the trace shows where the failure actually originated, which is often not where it surfaced, letting the team follow it back to the source rather than starting at the symptom. These cross-service problems are precisely the ones that are hard to debug any other way, which is where tracing is most valuable.
There are two main costs. The data cost, because generating, transmitting, storing, and querying spans for a high-traffic system produces a large volume that is expensive to keep and adds some overhead to the traced services, which is why sampling is necessary. And the instrumentation cost, because services must be instrumented to produce spans and propagate trace context, and while automatic instrumentation covers much of it, gaps remain and need ongoing attention. Context propagation is fragile, since one service that drops the trace ID breaks the trace. These costs are manageable with good practices, but they are real and have to be planned for.
Only if it was being traced at the time it happened. Tracing records data as the request flows through the system, so if the request was not sampled or the services were not instrumented when it occurred, there is no trace to look at afterward. This is why instrumentation and sampling decisions have to be made in advance, betting on what will be useful later, and why teams generally instrument broadly and sample intelligently rather than tracing narrowly. The cost of missing the one trace you needed during an incident is high, so the practical approach is to keep enough of the right data ahead of time to be useful when something goes wrong.