· 5 min read ·
CorrelationId: The Missing Piece in debugging Distributed Systems
How to make life easier when dealing with distributed workflows
In distributed systems, a request from the user rarely stays within one service.
It flows through APIs, message queues, worker nodes, background jobs, external services, and sometimes multiple internal systems before finally producing a response.
When everything works, this complexity stays invisible. But when something breaks, it becomes very visible & painful to debug.
And without a Correlation ID, debugging quickly becomes guesswork.
The problem
Imagine a user request that triggers a workflow like this:
User Request
↓
API Service
↓
Message Queue
↓
...
Worker Nodes
↓
...
↓
Message Queue
↓
Final Result
↓
API ServiceEach component logs data in its own way independently. When a failure occurs, we might see different traces of logs like:
INFO Executing tool
INFO Calling external API
SEVERE Timeout occurredBut we almost never have straightforward answers to questions like:
- Which request triggered it?
- Which user session?
- Which was the previous step?
Without a way to connect logs across services, debugging becomes slow and unreliable.
What is a Correlation ID?
A Correlation ID is a unique identifier attached to a request and propagated across every system involved in processing it.
CorrelationId = Any unique identifier (can be a simple UUID, can be the primary key of your row denoting the execution, etc)
Every service that takes part in handling the request includes this ID in its logs.
Example:
- [correlationId: 7c2d9f7b-18e7-4c7a-bf8c-4d83d13d9c01, service: API] API request received
- [correlationId: 7c2d9f7b-18e7-4c7a-bf8c-4d83d13d9c01, service: API] Publishing execution message
- [correlationId: 7c2d9f7b-18e7-4c7a-bf8c-4d83d13d9c01, service: Worker1] LLM step started
- [correlationId: 7c2d9f7b-18e7-4c7a-bf8c-4d83d13d9c01, service: Worker2] Tool execution completed
- [correlationId: 7c2d9f7b-18e7-4c7a-bf8c-4d83d13d9c01, service: Worker2] Agent execution finished
Now a single ID lets you trace the entire lifecycle of a request, across multiple services.
Why it matters in event-driven systems
Correlation IDs become even more important in event-driven architectures.
In systems that rely on message queues, a single logical request might produce multiple messages and steps.
For example, in an event-driven agent orchestration pipeline:
Execution Start
↓
Context Management
↓
LLM Execution -> Observability
↓
Context Management
↓
Tool Execution -> Observability
↓
...
↓
Execution EndEach of these steps may be executed by different consumers on different worker nodes.
Without a shared identifier, it becomes impossible to understand:
- which steps belong to the same execution
- which tool calls were triggered by which LLM response
- how long the overall execution took
This is where Correlation IDs become the backbone of observability.
A simple example
Lets consider a system, where an execution gets broken down into discrete execution steps, each executed asynchronously through message queues.
Each step can get executed in series or in parallel, based on how the workflow is planned
We also might need some shared data to be passed along to stateless services, so they can understand the current request context
case class Context(
correlationId: String,
stepId: String,
previousStepId: String,
entityId: String,
organizationId: String,
userId: String
// additional shared metadata
)The above contract is a simple context object passed along in every message across the distributed execution of the request, from start to end
case class StepMessage(
stepType: StepType,
context: Context,
input: Input
)The above is a simple message object which can be standardized across all components involved in the distributed execution of the request
This allows every component in the system to know:
-
which execution this step belongs to
-
which step triggered the current step
-
which user initiated the execution
In practice, this often becomes the single most important identifier used during production debugging.
An even better approach would be relying on a single stateful service that manages the execution tracking using a persistent storage like a Database, in order to introduce much needed additions like pause & resume, retry on failure, etc. But that would be a separate post by itself.
What do we get in return?
Observability becomes much easier
With a proper correlation ID strategy you can:
- Reconstruct execution timelines exactly as they happened
Step 1 Context Management 120ms
Step 2 LLM Inference 8.4s
Step 3 Context Management 970ms
Step 4 Tool execution 2.1s
Step 5 Context Management 160ms
Step 6 Result 90ms-
You can quickly identify bottlenecks based on where time is spent.
-
Debugging failures becomes easy. Instead of searching logs randomly, you just query:
correlationId = xyz123One final note
Correlation ID vs Message ID
Correlation IDs are often confused with message IDs.
- Message ID identifies a specific message in a queue.
- Correlation ID links multiple messages that belong to the same logical request.
A single request might produce many messages, but all of them share the same correlation ID.
Takeaway
Correlation IDs are a small addition to a distributed system, but they unlock a powerful ability:
The ability to trace a single request across an entire distributed system
Followup Topics
- Fan-Out / Fan-In Patterns for Parallellism in Distributed Workflows
- Building Observability for AI Agent Infrastructure