Circuit Breaker With No Circuit to Break

A fintech firm had experienced three outages in three months. Their SLA had dropped to 98.4% against a contractual requirement of 98.5%. Management was facing penalty clauses and real client retention risk.

The system at the center of it was a custom-built back-end proxy and load balancer. Twenty years old. Supporting 10,000 virtual machines across two on-premise data centers, AWS, and Azure. It handled all egress calls to the internet, load-balanced intra-network traffic, and managed VPN connections to dozens of client systems.

Management wanted it replaced. Immediately. In the middle of an active migration of the entire on-premise infrastructure to Azure.

The architecture team proposed an interim measure: implement a circuit breaker pattern.

I was brought in to assess the system and build a replacement roadmap.

What I Actually Found

I read the documentation, reviewed the configuration, and consulted the development and support teams.

The system was old. But it was well-designed, stable, and in a clustered configuration it was performing well. It was not the problem.

Two of the three failures had been diagnosed. Both were vendor failures — downstream connections either took too long or dropped entirely. In both cases the system did not crash. It ran out of configured connections and stopped.

The maximum connection count had been set at the original implementation. Twenty years ago. It had never been updated.

After the first failure, the support team added another server to the cluster. A reasonable response to a system failure. Ineffective against a configuration ceiling — adding machines does not raise a connection limit set in a configuration file.

The second failure followed anyway, when a major vendor went completely offline.

The third failure was undiagnosed.

My conclusion: the system was grossly misconfigured and improperly monitored. The connection count was set for a world that no longer existed. The monitoring metrics were not alerting the team until after connections were exhausted — by which point the damage was done.

The Actual Recommendation

I confirmed my findings with both the development and support teams. Then I recommended four things in order of priority:

First, significantly increase the connection configuration — not doubled, but an order of magnitude higher. Research showed a single modern machine could comfortably handle ten times the current limit.

Second, change the monitoring metrics to trigger alerts before connections were exhausted rather than after failure.

Third, explore the circuit breaker optionally — not as the primary fix but as a possible additional stabilization measure.

Fourth, plan a proper long-term replacement on a sensible timeline. After thorough comparison of available options I recommended Envoy, with HAProxy and Nginx as alternatives should Envoy fail a proof of concept.

Critically: do not attempt full replacement in the middle of migrating 10,000 virtual machines across three environments. The risk profile of replacing a mission-critical single-point-of-failure system during an active infrastructure migration is not justified when the actual problem is a configuration file.

The system stabilized. The replacement is now planned properly for after the migration completes.

The Pattern Worth Naming

This story is not really about connection pools or circuit breakers.

It is about what happens to decision-making under pressure in large technical organizations.

When a system fails visibly and repeatedly, the instinct is to reach for solutions before the diagnosis is confirmed. The more sophisticated the team, the more sophisticated — and expensive — the wrong answer becomes. Full replacement projects get scoped. Invasive architectural patterns get proposed. Budgets get approved.

Nobody stops to verify whether the premise is actually correct.

In this case the premise was wrong. The system was not failing. It was misconfigured. Those require completely different responses and the cost difference between them is enormous.

The most valuable question in enterprise IT is one that almost never gets asked under pressure:

Are we certain we know what is actually wrong?

Not what appears to be wrong. Not what pattern the symptoms resemble. What is actually, verifiably wrong — traced to its root cause before a single solution is proposed.

Two days of diagnosis answered that question here. The answer changed everything.

SEE THE FULL DOCUMENT

Russ Profant is a solutions architect and independent consultant with 30 years of experience across HP, Morgan Stanley, CIBC, and RBC. He runs PC4IT, offering cloud cost diagnostics and architecture advisory to mid-market organizations. pc4it.com

Book a 20-minute call

Send a message