Datahub Lessons Learned and Evolution

Series: Designing a Microservice-Friendly Datahub
PART III — CASE STUDY: MY CSL DATAHUB IMPLEMENTATION
Previous: End-to-End Data Flow Scenarios in Datahub

Architectures don’t become “good” because they were well-designed on day one. They become good because they survive contact with reality, accumulate scar tissue, and adapt without collapsing.

This final article steps away from diagrams and patterns to reflect on what the CSL Datahub taught us in practice: what worked, what surprised us, where the cracks appeared first, and how the system could evolve if built again today.

This is where architecture grows up.

Disclaimer (Context & NDA)
The CSL Datahub implementation discussed throughout this case study was designed and built in 2021. While the architectural principles remain sound, some tooling choices could be updated today. To comply with NDA requirements, business-specific logic, schemas, and operational details are intentionally generalized.

What Worked Well (And Why It Worked)

Some design decisions paid off immediately—and kept paying dividends.

1. Clear Ownership of State

Making the CSL Web App and its MySQL database the single source of truth removed ambiguity early.

Benefits:

No split-brain state
No cross-service writes
No schema politics

Every downstream system knew its role: react, don’t mutate.

This clarity prevented entire classes of bugs.

2. Event Emission After Commit

Events were emitted after database transactions completed, never before.

$db->commit();

$redis->xAdd(
    'csl:events',
    '*',
    [
        'type' => 'user.updated',
        'user_id' => $user->id,
        'occurred_at' => time()
    ]
);

This simple discipline avoided phantom events, partial updates, and confusing rollback scenarios. It also made replay and reasoning straightforward.

3. Redis Streams as a Shock Absorber

Redis Streams quietly did exactly what they were meant to do:

Absorb bursts
Decouple time
Protect the legacy app

They were rarely discussed—and that’s the highest compliment.

When systems are calm under pressure, it’s usually because something unglamorous is doing its job well.

4. RabbitMQ for Decentralized Consumption

RabbitMQ scaled not just in traffic, but in organizational complexity.

Teams could:

Add consumers independently
Remove consumers without coordination
Evolve their logic safely

Publish/subscribe worked as intended—no central registry of dependencies, no choreography meetings.

What Was Harder Than Expected

Some challenges only appear once the system is alive.

1. Operational Visibility Took Real Effort

Event-driven systems fail between services, not inside them.

We learned quickly that:

CPU metrics weren’t enough
Service uptime lied
Queue lag told the real story

It took deliberate work to monitor:

Redis stream lag
RabbitMQ queue depth
Processor throughput
DLQ growth

Visibility wasn’t optional—it was a feature we had to build.

2. Idempotency Was Non-Negotiable

At-least-once delivery meant duplication was inevitable.

The teams that embraced idempotency early slept better:

INSERT INTO processed_events (event_id)
VALUES (:event_id)
ON DUPLICATE KEY UPDATE event_id = event_id;

The teams that didn’t… learned the hard way.

Idempotency isn’t clever. It’s defensive driving.

3. The Processor Attracted Gravity

Despite best intentions, the Processor constantly tempted engineers with convenience.

“If we already have the event here, why not just…?”

That sentence is how God services are born.

It required discipline to:

Keep logic shallow
Push domain meaning to consumers
Split handlers early when responsibilities diverged

Architectural boundaries don’t enforce themselves.

Bottlenecks Discovered Over Time

No system escapes bottlenecks—only ignorance of them.

Processor Throughput

As event volume grew, Processor lag became the first scaling signal.

The fix wasn’t “more CPU.” It was:

Horizontal scaling
Splitting handlers by responsibility
Reducing synchronous API calls

The bottleneck revealed where coupling still existed.

Message Storms

Some early event designs were too chatty.

Emitting events on every micro-change caused:

Unnecessary fan-out
Consumer overload
Hard-to-reason flows

The solution was not throttling—it was semantic restraint:

Fewer events
More meaningful events
Better aggregation

Architecture Is a Living System

One of the most important lessons was psychological, not technical.

Architectures don’t “finish.” They age.

What worked at:

3 modules
2 teams
10k events/day

…needed adjustment at:

10 modules
6 teams
1M events/day

Treating architecture as frozen design would have killed the system. Treating it as a living system kept it adaptable.

Future Improvements (If We Were Building This Today)

With hindsight—and modern tooling—several evolutions would make sense.

1. Kafka for High-Volume Event History

If:

Event replay became core
Stream processing emerged
Retention requirements grew

Kafka would be a natural next step.

Not as a replacement for everything—but as a data backbone where history matters.

2. Splitting the Processor by Responsibility

Rather than one Processor service:

One for Redis → RabbitMQ
One for RabbitMQ → CSL API
One for external integrations

This would reduce blast radius and simplify scaling decisions.

3. Formal Event Contracts

Early contracts were implicit. Later, they deserved structure.

A shared schema repo with versioning would:

Reduce accidental breakage
Improve onboarding
Enable better validation

Contracts are how event-driven systems communicate trust.

4. Deeper Observability by Default

If rebuilt today:

Distributed tracing would be first-class
Correlation IDs everywhere
Pipeline-level dashboards from day one

Debugging async systems without visibility is archaeology.

Reflection Is Part of Engineering

The biggest takeaway isn’t about Redis, RabbitMQ, or .NET.

It’s this:

Architecture is not about being right.
It’s about being able to change without fear.

The CSL Datahub worked not because it was perfect, but because it respected:

Boundaries
Ownership
Failure
Time

Those principles outlast tools.

Closing the Case Study (And the Series)

This article closes the CSL case study—but not the conversation.

If there’s one thing worth carrying forward, it’s this mental shift:
design systems as conversations, not call graphs.

When systems speak in facts, tolerate delay, and respect ownership, they scale—not just technically, but humanly.

That’s what Datahub architecture is really about.

And that’s the kind of architecture worth building.

Optional Extra Articles

If you’d like to go deeper, the following optional articles explore specific corners of the architecture—contracts, failure handling, observability, tooling trade-offs, and scaling pressure points that deserve their own focused discussion.

🧾 Event Contracts as APIs

♻️ Dead Letter Queues and Retry Strategies

🔍 Observability for Event-Driven Systems

⚖️ Redis Streams vs Kafka: Choosing the Right Event Backbone

🚦 When the Processor Becomes a Bottleneck

Lessons Learned and Future Improvements for my implemented Datahub