You Don’t Have a Custom System, You Have a Boondoggle

When I joined one of the top global banks and its Capital Research group as a contractor, the first thing I did was what I always do.

I explored the system for a month and examined every nook and cranny of it.

Not out of insubordination. Out of professional necessity. You cannot fix what you do not understand. And understanding a system nobody has documented, in an organization where the people who built it have left, requires time before it requires action.

What I found in that first month set the direction for the next two years.

The Environment

The bank’s Capital Research was a 5,000-person organization operationally dependent on a custom-built system that nobody fully understood. But they sure were proud of it; they said it was the only one on Wall Street. It never entered their mind that this could be a significant problem. Simply put, custom-built systems are less reliable and prone to all kinds of issues that commercial systems have overcome long ago due to the breadth of their clientele.

The development manager — my direct boss — had been hired one month before me. He did not understand the system; all he was given were the IP addresses of the 12 on-prem production servers and 3 on-prem dev/test servers, a sheet in an Excel document. The entire onshore development team in New Jersey had left, taking their miracle system to another venerable Wall Street firm as well as taking their institutional knowledge with them. The only people with genuine familiarity with the system were young developers in Beijing whose English was not strong enough to transfer what they knew in any meaningful depth.

I could not tap their knowledge directly. I had to investigate everything myself — through logs, through code, through the user’s experience.

What I found was a convoluted monolith that made little sense architecturally. A front-end proxy and load balancer that served no discernible purpose. A front-end SOAP API to send the document when everything is on the same network, and all that is needed is to copy files to another directory.

But best, or worst, depending on your point-of-view, of all was that they used brand new Google technology called Angular for assembling dynamic documents in real time in the user’s browser for the exact opposite purpose, to assemble pre-existing static documents which, by law, can never change, otherwise they’re different documents.

In short, this was infrastructure grotesquely disproportionate to the actual problem it was solving, dressed up in complexity that nobody had questioned because nobody had understood it well enough to question it.

The Network Nobody Wanted to Own

The application problems were compounded by infrastructure beneath them that nobody had examined carefully either.

A simple directory listing on an Asian server from North America took three minutes. In a system distributing files across regions, that latency was not a minor inconvenience — it was an architectural constraint shaping everything running on top of it.

I raised an emergency ticket with Dell, labelled as a hard disk issue because that was the closest available category. Dell could have looked at it, noted it was a network problem outside their hardware remit, and closed the ticket. Instead they assembled a full team of experts overnight, looked past the label, identified the actual network misconfiguration, and delivered a clear diagnosis with an implied fix. On a problem that was technically not theirs to solve. In one night.

The bank’s own networking team reviewed the same findings and concluded they saw no problem.

Dell diagnosed it in one night. The bank has not fixed it in ten years. I designed around it and moved on.

Some problems exist not because nobody found them but because the organization that owns them has decided, through inaction if not intention, that finding them is more inconvenient than living with them.

The Numbers Nobody Had Compiled

The statistics I needed to understand the system’s performance did not exist in any accessible form. I compiled them myself from logs.

What they showed was striking.

The system was failing at a rate of 8 to 9 percent daily. Every day. Consistently. The support team was dismissing 99 percent of those failures because their manual told them to. The failures were understood, officially, as normal system behavior.

They were not.

When I looked at what was actually failing, 90 percent of the failures were the front-end attempting to log incoming messages — a pseudo-failure. The system was reporting failure on an operation that was not critical to its core function. Real failures were buried inside a noise floor of logged non-events that had been codified into official procedure as acceptable.

Nobody had looked at this carefully enough to make that distinction. The support manual had been written by people who also had not looked carefully enough. The support team was operating on inherited confusion presented as institutional knowledge.

What I Told Management

After three months — one month of orientation, two months of deeper investigation — I asked for a meeting with management.

I told them two things.

First: you believe you have a unique, sophisticated custom system. You do not. You have a boondoggle — an over-engineered, under-documented monolith that nobody fully understands, running on infrastructure three times larger than your traffic requires, failing at 8 to 9 percent daily in ways your support team has been trained to ignore.

Second: you have two options. Continue as you are — the failure rate will never meaningfully improve because nobody understands the system well enough to fix it systematically. Or let me take it apart and reassemble it correctly — slowly, deliberately, without disrupting the 5,000 people depending on it — and I will get you close to 100 percent success rate.

They chose the second option.

That choice required them to trust a three-month contractor’s diagnosis of a system their own manager did not understand, over the institutional inertia of continuing with what they had. That is not a small act of organizational trust. I understood what it meant when they made it.

The Work

We converted the monolith to microservices over two years. Slowly and deliberately. The system remained in production throughout — serving a 5,000-person organization, processing capital research data, without disruption.

Every component examined. Every unnecessary element removed or replaced. The front-end proxy that served no purpose — removed. The regional over-provisioning rationalized against actual traffic requirements. The architecture rebuilt around what the system actually needed to do rather than what it had accumulated over years of additions nobody fully understood.

The Beijing developers — whose language I could not fully bridge — executed the direction faithfully over two years without the kind of detailed specification a fully documented system would have provided. That required a different kind of trust running in the opposite direction. I trusted their technical execution. They trusted the architectural direction. Across a language gap and a twelve-hour time difference, it worked.

At the end of two years the daily failure rate was 0.1 percent.

I thought that was still too high.

Why 0.1 Percent Was Not Enough

That reaction requires explanation because most people would consider 0.1 percent an extraordinary outcome — from 8 to 9 percent daily failures to one failure in a thousand operations, in two years, starting from a system with no documentation, no SMEs, and a network problem the infrastructure team refused to acknowledge.

But I had been trained at Morgan Stanley, where thousands of global systems ran at 100 percent daily success rate as a matter of operational culture — not aspiration, culture. That standard had become internalized. 99.9 percent felt like unfinished work because at the level of operational discipline I had been formed by, it was.

The 0.1 percent was not a celebration. It was a remaining problem to be solved.

What This Story Is Actually About

I want to be precise about the trust dimension because it is easy to read this as a competence story and miss what matters more.

The competence is documented — diagnosis, prediction, two-year delivery, measurable outcome. Those are facts.

The trust dimension is this: management at a 5,000-person organization believed a contractor they had known for three months when he told them their foundational system was not what they thought it was. They chose the harder option on the basis of that belief. They gave me two years and the latitude to work deliberately rather than reactively.

And the Beijing developers trusted an architectural direction they could not fully discuss in their second language, executing faithfully across a gap that most organizations would have used as an excuse for the work not getting done.

Trust ran in every direction in that engagement. Not just from management to me. From me to management. From me to the Beijing team. From the Beijing team back to the architecture we were building together.

That is what a functioning technical organization actually looks like. Not hierarchy. Not process. Not a manual that codifies inherited confusion as official procedure.

Trust, correctly placed, consistently validated.

I have not always worked in organizations like that. When I have, the work has always been worth doing.

Russ Profant is a solutions architect and independent consultant with 30 years of experience across financial services, investment banking, healthcare, and government. He runs PC4IT, offering cloud cost diagnostics and architecture advisory to mid-market organizations. pc4it.com

Book a 20-minute call

Send a message