Skip to main content

Command Palette

Search for a command to run...

Observability for Event-Driven Systems

Monitoring flow, lag, and failure across asynchronous pipelines

Updated
5 min read
Observability for Event-Driven Systems

Series: Designing a Microservice-Friendly Datahub

Event-driven systems don’t usually fail loudly. They drift, stall, or silently fall behind. CPU stays low. Pods look healthy. Uptime dashboards glow green—while business data quietly stops moving.

This is why observability in event-driven systems is fundamentally different from observability in request/response architectures. You are not observing services. You are observing pipelines, flows, and time.

This article explains how to build real observability for event-driven systems using the same stack you’ve seen throughout this series: PHP, Redis Streams, .NET, RabbitMQ, Node.js, and relational databases.


The Core Shift: From Nodes to Flow

Traditional observability asks:

  • Is the service up?

  • Is CPU high?

  • Are requests failing?

Event-driven observability asks:

  • Is data still flowing?

  • Where is it slowing down?

  • What is stuck, duplicated, or lost?

  • How long does an event take to propagate?

A system can be up and completely broken at the same time.


The Three Pillars (Reinterpreted)

The classic pillars—logs, metrics, traces—still apply, but their meaning changes.

1. Logs → Event Narratives

Logs should tell the story of an event.

2. Metrics → Flow Health

Metrics should describe movement and pressure, not just resource usage.

3. Traces → Causality Across Time

Traces should connect asynchronous steps that never share a call stack.


Start With the One Thing You Must Have: Correlation IDs

If you can’t answer:

“What happened to this event?”

You don’t have observability—you have noise.

Every event must carry a stable correlation ID from birth to death.

Event envelope example

{
  "event_id": "9b6c8c2a-7c9d-4b7f-9a1e-1e1f9a5b3f6a",
  "correlation_id": "req-81fa3",
  "event_type": "user.updated",
  "event_version": 1,
  "occurred_at": "2025-01-02T10:15:30Z",
  "data": { ... }
}
  • event_id → idempotency

  • correlation_id → observability

They solve different problems. You need both.


Logging: Make Logs Event-Centric

PHP (producer)

$logger->info('Emitting event', [
    'event_type' => 'user.updated',
    'event_id' => $eventId,
    'correlation_id' => $correlationId,
    'user_id' => $userId
]);

.NET Processor

_logger.LogInformation(
    "Processing event {EventId} ({EventType}) Correlation {CorrelationId}",
    evt.EventId,
    evt.EventType,
    evt.CorrelationId
);

Node.js consumer

logger.info("Event received", {
  event_id: event.event_id,
  correlation_id: event.correlation_id,
  type: event.event_type
});

Logs without event IDs are anecdotes.
Logs with IDs become timelines.


Metrics That Actually Matter

1. Queue Depth (Pressure)

RabbitMQ

  • Messages ready

  • Messages unacked

Redis Streams

  • Stream length

  • Pending entries per consumer group

These tell you:

  • Is the system falling behind?

  • Where pressure is accumulating?

2. Lag (Time as a Signal)

Track:

now() - occurred_at

This single metric answers:

“How stale is my system?”

Example (Node.js)

const lagMs = Date.now() - Date.parse(event.occurred_at);
metrics.histogram("event.lag.ms", lagMs);

Lag matters more than throughput.

3. Retry & DLQ Rates

Retries and DLQs are not errors—they are signals.

Track:

  • Retry count per event type

  • DLQ volume over time

  • DLQ age (how long messages sit unhandled)

A growing DLQ is an organizational alert, not just a technical one.


Distributed Tracing (Without Synchronous Calls)

Tracing async systems is hard—but possible.

The trick

You don’t trace calls.
You trace causality.

Each step emits a span using the same correlation_id.

.NET example (pseudo OpenTelemetry)

using var span = tracer.StartActiveSpan(
    "process.user.updated",
    SpanKind.Consumer,
    new SpanContext(correlationId)
);

Node.js continuation

tracer.startSpan("notify.user", {
  links: [{ context: extractContext(event.correlation_id) }]
});

The trace becomes a timeline, not a stack.


Observing Redis Streams Specifically

Redis Streams give you observability hooks for free—if you look.

Pending messages

XPENDING events user-group

If this grows:

  • Consumers are slow

  • Consumers are dead

  • Messages are poison

Idle time

Idle time tells you how long messages have been stuck.

Stuck messages are more dangerous than missing ones.


Observing RabbitMQ Pipelines

RabbitMQ exposes exactly what event systems need:

  • Queue depth

  • Unacked messages

  • Consumer count

  • Publish rate vs ack rate

Alert when:

  • Queue grows steadily

  • Ack rate drops to zero

  • One consumer owns all unacked messages

Those are architectural smells, not just incidents.


Event Timeline Debugging (The Most Powerful Tool)

With proper logging and IDs, debugging becomes linear:

  1. Find event_id

  2. Search logs across services

  3. Reconstruct the journey

You can answer:

  • Was it emitted?

  • Was it buffered?

  • Was it processed?

  • Was it retried?

  • Was it dead-lettered?

Without this, debugging is guesswork.


The Anti-Patterns That Kill Observability

Avoid these at all costs:

  • Logging without event IDs

  • Metrics only at service level

  • No lag measurement

  • DLQs without alerts

  • “We’ll add observability later”

Later is always too late.


Observability Is a Design Feature

You cannot bolt observability onto an event-driven system after the fact. It must be designed in:

  • Event envelopes carry IDs

  • Consumers propagate context

  • Pipelines expose pressure

  • Failures surface visibly

This is architecture, not tooling.


A Simple Observability Checklist

If you can answer these in production, you’re doing well:

  • How many events are currently stuck?

  • Where are they stuck?

  • How old is the oldest event?

  • Which event types retry most?

  • Can I trace one event end-to-end?

If not, the system is operating on faith.


Closing Thought

Event-driven systems trade immediacy for resilience.
Observability is how you regain confidence.

When you can see:

  • Time

  • Flow

  • Pressure

  • Failure

Asynchrony stops being scary and starts being predictable.

Good observability doesn’t make systems faster.
It makes them trustworthy.

And trust is what lets distributed systems scale—technically and humanly.

Designing a Microservice-Friendly Datahub

Part 3 of 22

A series on microservice-friendly Datahub architecture, covering event-driven principles, decoupling, diving in real-world implementation with Redis, RabbitMQ, REST API, and processor service showing distributed systems communicate at scale.

Up next

Dead Letter Queues and Retry Strategies

Designing safe failure and recovery in event-driven systems