The Monolith That Was Perfect
Let me tell you a story about the most beautiful system I ever saw. It was a single Django application backed by a single PostgreSQL database, deployed on a single server, managed by a single developer. It processed orders for a regional e-commerce company. Deployments were a git pull and a process restart. The entire architecture fit on a napkin — because it was a napkin sketch, drawn at a coffee shop where the founder first described the idea.
The response times were under 50 milliseconds. The database schema was 14 tables, each one purposeful and well-indexed. There were no network partitions to worry about because there was no network. There were no distributed transactions because there was nothing to distribute. Debugging meant tailing a single log file. Monitoring was a cron job that checked if the process was alive.
It was a masterpiece of simplicity.
Then users showed up.
Not a few users. Not the gentle trickle the founder had projected on his pitch deck. A flood. A viral tweet, a mention on a popular newsletter, and suddenly this elegant little monolith was fielding 10,000 requests per second from three continents. The single PostgreSQL instance started sweating. The server’s CPU looked like a heart monitor during a horror movie. And that beautiful, napkin-simple architecture? It was about to become something else entirely.
This is a story that plays out thousands of times a year, in startups and enterprises alike. Every system, given enough success, eventually outgrows the single-machine model. Not because the engineers were bad at their jobs. Not because they chose the wrong framework. But because distribution is gravity, and all successful systems fall toward it.
The Five Forces of Distribution
Systems do not become distributed because engineers read a blog post about Kubernetes and got excited on a Friday afternoon. (Well, sometimes they do, but that is a different essay.) Systems become distributed because real, physical forces push them apart. I think of these as the five gravitational forces of distribution, and once you learn to see them, you notice them everywhere.
1. Scale: The Traffic Avalanche
This is the obvious one. Your single server handles 1,000 requests per second beautifully. Then marketing does its job, and now you need to handle 50,000. You cannot simply buy a server that is 50 times more powerful — vertical scaling has a ceiling, and that ceiling is made of physics and your cloud provider’s pricing page.
So you add more servers. Congratulations, you now have a distributed system. You also have new questions: How do you route traffic? How do you share state? What happens when server #3 has a different version of the code than server #7?
The analogy I like is a restaurant kitchen. One great chef can cook for a 20-seat bistro. But when you need to serve 500 covers on a Saturday night, you do not find a chef who is 25 times faster. You hire more cooks. And the moment you do, you have coordination problems that the solo chef never dreamed of.
2. Geography: The Speed of Light Problem
Light travels at roughly 300,000 kilometers per second. That sounds fast until you realize that a round trip from New York to Sydney takes about 160 milliseconds at the theoretical minimum, and real-world latency is closer to 250 milliseconds. For a single API call, that is manageable. For a page load that requires 30 sequential API calls (do not do this, but people do), you are looking at multiple seconds of latency that no amount of code optimization will fix.
The only solution is to put your system closer to your users. Once your data and compute live in multiple regions, you have a distributed system — and you have inherited every problem that comes with replicating state across geographic boundaries. Suddenly “consistency” is not a given; it is a negotiation.
3. Reliability: The Redundancy Tax
A single server has a certain probability of failure. Let us be generous and say it has 99.9% uptime — roughly 8.7 hours of downtime per year. For a side project, that is fine. For a payment processor, that is a career-ending amount of downtime.
The only way to beat the reliability of a single machine is to use multiple machines. Redundancy requires distribution. You need at least two of everything: two servers, two databases, two network paths. And the moment you have two of something, you need a mechanism to keep them in agreement about the state of the world.
This is the redundancy tax: you pay in complexity for every nine you add to your uptime number. Going from 99.9% to 99.99% is not ten times harder. It is roughly a hundred times harder, because you are now dealing with failure modes that sound like riddles — “What happens when server A thinks server B is dead, but server B thinks server A is dead, and they both promote themselves to primary?“
4. Team Boundaries: Conway’s Revenge
Melvin Conway observed in 1967 that organizations design systems that mirror their communication structures. This is not just a cute observation; it is a force of nature. When your company has five engineers sitting in the same room, a monolith is natural. When you have 200 engineers across 15 teams in four time zones, a monolith becomes a coordination nightmare.
The deploy queue alone will make people weep. Team A needs to ship a critical bug fix, but they are blocked behind Team B’s half-finished feature branch, which depends on Team C’s database migration that has not been reviewed yet. The monolith does not just slow down deployment; it creates an organizational bottleneck that grinds everyone to a halt.
So teams start carving out their own services. Not because microservices are architecturally superior, but because organizational independence demands it. This is Conway’s Law operating as a force of distribution, and it is arguably the most powerful of the five because it has nothing to do with technology and everything to do with people.
5. Compliance: The Regulation Ratchet
Data residency laws are the force that catches most teams off guard. GDPR requires that EU citizens’ data stays within the EU. India’s data localization rules mandate that certain financial data is stored domestically. Brazil’s LGPD, China’s PIPL, and a growing patchwork of national regulations all have their own requirements about where data can live and how it can move.
You cannot comply with these regulations from a single data center in Virginia. The data must be distributed geographically, which means your system must be distributed. And unlike the other forces, this one is non-negotiable — you cannot scale your way around it or accept the trade-off. The law is the law.
This force is relatively new and accelerating. Ten years ago, most companies could ignore data residency. Today, it is a first-class architectural concern for any company with international users.
The Uncomfortable Middle
Here is the dirty secret of our industry: most companies are not running a clean monolith or a well-designed microservices architecture. They are stuck in the uncomfortable middle — that purgatorial state where you have just enough services to lose the simplicity of a monolith but not enough to gain the benefits of true service-oriented design.
I call it the “Three Services, Two Databases, Forty-Seven Problems” phase.
In this phase, you have a main application that still does 80% of the work, a “user service” that someone split out during a hackathon, and a “notification service” that exists because the intern read a Martin Fowler article. The main application calls the user service synchronously for every request, which means the user service is now a single point of failure that is worse than the original monolith because it adds network latency and a new failure mode. The notification service uses a message queue, but nobody set up dead-letter handling, so failed messages vanish into the void.
The database situation is equally chaotic. The main application still uses the original PostgreSQL database, but the user service has its own MySQL instance because the team that built it preferred MySQL. Now there are two sources of truth for user data, kept “in sync” by a cron job that runs every five minutes and has a comment at the top that reads # TODO: replace with proper event sourcing. That comment is three years old.
Why Companies Stay Here
The uncomfortable middle is stable in the same way a a house of cards is stable — it works as long as nobody breathes. Companies stay here for several reasons:
The refactoring cost is enormous. Splitting the monolith further requires understanding the entire dependency graph, which nobody fully understands because the original architect left two years ago.
The pain is tolerable. The system works most of the time. Outages happen, but the team has gotten good at firefighting. There is always a more urgent business priority than “fix the architecture.”
There is no consensus on the target state. Half the team wants to go full microservices. The other half wants to consolidate back into the monolith. The architect is pushing for “modular monolith,” which nobody has the same definition for.
If this sounds like your company, take comfort in the fact that you are not alone. The uncomfortable middle is where the majority of real-world systems live. The conference talks about clean microservices architectures are the exception, not the rule. They are the Instagram highlights reel of system design.
When NOT to Distribute
This is the section where I am going to say something unpopular: sometimes the answer is just a bigger server.
The tech industry has an unfortunate tendency to cargo-cult the practices of companies operating at scales they will never reach. You are not Google. You are probably not even approaching Google’s scale from 2004. And yet, teams of five engineers are drawing architecture diagrams with 12 services, a service mesh, three message queues, and a distributed tracing infrastructure that costs more than their actual product.
I want to be clear: premature distribution is the new premature optimization. And just like premature optimization, it is the root of considerable suffering.
The Costs Nobody Talks About
When you distribute a system, you take on costs that are easy to underestimate:
Operational complexity. A monolith needs one deployment pipeline. Ten microservices need ten deployment pipelines, or one very complicated one. Each service needs monitoring, alerting, log aggregation, and someone who understands how it works when it breaks at 3 AM.
Network unreliability. Function calls within a monolith effectively never fail. Network calls between services fail routinely. You now need retry logic, circuit breakers, timeouts, and fallback behaviors for every inter-service call.
Distributed transactions. In a monolith, you can wrap a complex operation in a database transaction and get atomicity for free. Across services, you need sagas, compensating transactions, or eventual consistency — each of which is dramatically more complex to implement and reason about.
Testing difficulty. Testing a monolith is straightforward: spin it up, hit the endpoints, check the database. Testing a distributed system requires either running all services simultaneously (expensive and slow) or mocking service interactions (which means you are not really testing the distributed behavior).
Signs You Are Distributing Too Early
Here is a quick diagnostic. If any of the following are true, you probably should not be splitting your system apart yet:
- Your team has fewer than 10 engineers.
- Your traffic fits comfortably on a single modern server (you would be surprised how much this can handle).
- You do not have production observability (logging, metrics, tracing) in place.
- Your deployment process is not automated.
- You are distributing for “future scale” that has no concrete timeline.
A single well-optimized server running modern hardware can handle a remarkable amount of traffic. Before you reach for horizontal scaling, have you actually tried vertical scaling? A machine with 64 cores and 256 GB of RAM costs a few hundred dollars a month on any cloud provider and can handle workloads that would require a dozen microservices to distribute.
The right time to distribute is when you have a specific, measurable problem that distribution solves. “We might need to scale someday” is not that problem. “Our database is at 90% CPU during peak hours and we have exhausted our optimization options” — that is the problem.
Surviving the Transition
Alright, so the forces of distribution have come for you. Your monolith is genuinely groaning under the weight of real traffic, real geographic requirements, or real team coordination problems. You have tried the bigger server and it is no longer enough. It is time to distribute. Here is how to do it without losing your mind.
Start with the Strangler Fig
The strangler fig is a tree that grows around an existing tree, gradually replacing it. The strangler fig pattern in software works the same way: you do not rewrite the monolith. You grow new services around it, gradually routing traffic from the old system to the new one.
The key insight is that you never have a Big Bang migration day. The monolith and the new services coexist, with a routing layer that decides which system handles each request. Over months or years, you move functionality piece by piece until the monolith is either gone or reduced to a manageable core.
This approach is slower than a rewrite, but it has one enormous advantage: it actually works. Rewrites fail with alarming regularity because they try to replicate years of accumulated business logic in one shot. The strangler fig approach lets you migrate incrementally, validating each piece as you go.
Invest in Observability Before You Split
This is the most common mistake I see: teams split their monolith before they have the tooling to understand the resulting distributed system. You need three things in place before you extract your first service:
Structured logging with correlation IDs. Every request that enters your system gets a unique ID that is passed to every service it touches. Without this, debugging a distributed system is like trying to follow a conversation in a room where everyone is talking at once.
Distributed tracing. You need to see the full journey of a request across services, including timing information. When a request takes 3 seconds, you need to know that 2.8 seconds of that was spent waiting for the inventory service to respond.
Metrics and alerting. Each service needs to emit metrics about its health, and you need dashboards that show the system as a whole. An alert that says “the payment service is returning 500 errors” is useful. An alert that says “error rate is elevated” without telling you which service is the culprit is not.
Invest in this infrastructure before you start splitting services. Retrofitting observability into an existing distributed system is like installing smoke detectors in a building that is already on fire.
Make Idempotency a Religion
In a distributed system, messages get delivered more than once. Network retries, queue redeliveries, and client retries all conspire to ensure that your services receive duplicate requests. If your system is not idempotent — meaning that processing the same request twice produces the same result as processing it once — you will have bugs that are almost impossible to reproduce and debug.
Here is a simple idempotent endpoint pattern:
def process_payment(request):
idempotency_key = request.headers.get("Idempotency-Key")
if not idempotency_key:
return error("Idempotency-Key header is required")
existing = db.get_payment_by_idempotency_key(idempotency_key)
if existing:
return success(existing) # Return the same result as before
payment = create_payment(request.body)
payment.idempotency_key = idempotency_key
db.save(payment)
return success(payment)
Every write operation in your distributed system should follow this pattern. Every. Single. One. The idempotency key is your insurance policy against the inherent unreliability of networks.
Implement Circuit Breakers
When service A depends on service B, and service B goes down, you do not want service A to keep hammering service B with requests that will never succeed. This is where circuit breakers come in:
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = "CLOSED" # CLOSED = normal, OPEN = failing, HALF_OPEN = testing
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise e
The circuit breaker has three states. Closed means everything is normal — requests flow through. Open means the downstream service is considered unhealthy — requests fail immediately without even trying, giving the downstream service time to recover. Half-open means the circuit breaker is cautiously testing whether the downstream service has recovered by letting a single request through.
This pattern prevents cascading failures, where one unhealthy service takes down every service that depends on it. In a distributed system, cascading failures are the number one cause of total outages.
Accept That Distributed Debugging Is a Full-Time Job
In a monolith, a bug manifests as a stack trace. In a distributed system, a bug manifests as a mystery novel. The payment failed, but the payment service logs say it succeeded. The order service says it never received the confirmation. The message queue says the message was delivered. Somewhere between these three truths lies the actual bug, and finding it requires correlating logs across services, examining message timestamps, and understanding the precise ordering of events that led to the failure.
This is not something you can do ad hoc. You need dedicated tooling, dedicated processes, and eventually dedicated people whose job it is to understand how the distributed system behaves as a whole. If you are not ready to invest in this, you are not ready to distribute.
The Paradox
Here is the irony that sits at the heart of distributed systems engineering: we distribute systems to make them simpler, but distribution is the primary source of complexity.
Think about it. We split a monolith into microservices so that each service is simple enough for one team to own. And each individual service is simpler. But the system — the emergent behavior of all those services interacting — is dramatically more complex than the monolith ever was.
A monolith has one failure mode: it is either up or it is down. A distributed system has a combinatorial explosion of failure modes. Service A is up but slow. Service B is up but returning stale data. The message queue between them is backed up. The load balancer is routing traffic unevenly. The DNS cache has not refreshed yet. Each of these partial failure states produces different symptoms, and diagnosing which combination of factors is causing the user-visible problem requires a mental model of the entire system that no single person fully possesses.
This is the fundamental paradox: we trade local simplicity for global complexity. Each piece becomes easier to understand in isolation, but the interactions between pieces become harder to understand — and it is always the interactions where the bugs hide.
Finding Your Place on the Spectrum
The art of systems engineering is not choosing between monolith and microservices. It is finding the right point on the spectrum for your system, your team, and your business at this moment in time. And that point will change as all three of those variables evolve.
A startup with five engineers and 1,000 users should almost certainly be running a monolith. A company with 500 engineers and millions of users probably needs some degree of distribution. But the answer is never binary, and the worst thing you can do is jump to the end state before you have earned the complexity.
The best systems I have worked with share a common trait: they are as distributed as they need to be, and no more. They have two or three well-defined service boundaries that reflect genuine team or domain boundaries. They use a shared database where sharing makes sense and separate databases where isolation is required. They have just enough infrastructure to support the current scale with a reasonable buffer, not a Kubernetes cluster that could run Netflix.
The Question to Keep Asking
Every time you are tempted to split a service, add a message queue, or introduce a new database, ask yourself this question: What specific problem does this solve, and is the complexity it introduces less than the complexity it removes?
If you cannot answer that question clearly, you are not ready to distribute. And there is no shame in that. Some of the most successful products in the world run on architectures that would make a distributed systems researcher wince. They work because the teams that built them understood that architecture is not an end in itself — it is a means to shipping software that users care about.
The gravitational pull toward distribution is real. But gravity is a force you can work with, not a fate you must surrender to. The best engineers I know treat it like a river: they respect its power, they plan their path with care, and they never go further downstream than they need to.
Build the simplest system that solves your problem. Distribute when the forces demand it. Invest in the tooling to survive the transition. And never, ever confuse architectural complexity with engineering excellence.
They are not the same thing. They never were.