From Water Mains to Cloud Storage: Planning for the Break Before It Happens
Last week, the Lake Orion / Orion Township area had one of those reminders that infrastructure is usually invisible until it is suddenly the only thing anyone is talking about.
A water main break is not just a broken pipe.
It becomes pressure, capacity, routing, communication, response time, and trust. It becomes a question every resident feels immediately:
What happens next?
That same question is where cloud resilience starts.
Not with a vendor diagram. Not with a checkbox that says "multi-cloud." Not with the comforting idea that because a platform is managed, the risk is handled.
The real question is simpler and harder:
When the main breaks, what does your system do next?
The Break Is Never the Whole Problem
I am not writing this as a public works expert, and I am not pretending a municipal water system maps perfectly to a software system.
It does not.
But the pattern is familiar to anyone who has supported production infrastructure:
- something foundational fails
- downstream assumptions get exposed
- people learn which dependencies were critical
- the quality of the fallback matters more than the quality of the diagram
In software, we talk about this as disaster recovery, high availability, failover, redundancy, graceful degradation, or multi-cloud architecture.
Those are all useful words.
But they can also hide the uncomfortable part: a fallback is only real if the team knows how to use it when things are already stressful.
Some Cloud Dependencies Are Commodity
There are parts of cloud architecture that are relatively easy to reason about across providers.
Blob or object storage is the cleanest example.
Azure Blob Storage, Amazon S3, Google Cloud Storage, and other providers are not identical, but the mental model is close enough:
- containers or buckets
- object keys
- metadata
- access policies
- lifecycle rules
- upload, download, list, delete
As a developer, I can put a small interface around that.
I can keep object naming predictable. I can store file metadata in my own database. I can build an export path. I can test a restore into another provider. I can document which container maps to which bucket.
That does not make fallback free.
But it makes it understandable.
This is where multi-cloud can be practical. If the dependency is commodity enough, the cost of abstraction can be worth it.
Then the Platform Gets Sticky
The further you move away from commodity primitives, the more provider-specific gravity you create.
That gravity is not always bad. Azure, AWS, and Google Cloud all have services that are valuable precisely because they are integrated. You get speed, security hooks, monitoring, identity integration, deployment ergonomics, and managed operations.
But you also get stickiness.
The hard-to-replace pieces usually look like this:
- identity flows and token assumptions
- eventing systems and delivery semantics
- serverless bindings that shape the application
- observability pipelines and alert rules
- workflow engines and durable orchestration
- proprietary database features
- managed networking and private access patterns
- deployment automation tied to one platform's model
That is where the fallback stops being "plug in another provider" and starts becoming an application architecture problem.
The cloud provider can give you building blocks.
It cannot automatically give you portability after you have designed every code path around provider-specific behavior.
My Line: Abstract Risk, Not Everything
This is where I draw the line as a developer:
Abstract the things that create unacceptable business risk. Do not abstract everything just because a whiteboard says "multi-cloud."
There is a difference between resilience and theater.
Resilience looks like:
- an explicit storage interface because files are critical to the product
- documented restore steps because data loss is unacceptable
- a tested runbook because recovery under stress is different than recovery in a meeting
- feature flags because partial degradation is better than a full outage
- clear RPO and RTO targets because "as soon as possible" is not an engineering requirement
Theater looks like:
- a diagram with two cloud logos and no tested failover
- an abstraction layer nobody understands
- duplicate infrastructure that cannot actually receive production traffic
- "we can always migrate later" without export, import, or verification paths
The boring answer is usually the right one: know which dependencies matter, design those seams intentionally, and test the recovery path before you need it.
Fallbacks Need Muscle Memory
One thing local infrastructure failures make obvious is that response is operational.
It is not just the pipe. It is the people, process, communication, tools, and sequence of actions after the break.
Cloud systems are the same way.
If your fallback depends on a developer remembering a six-step manual process they have never practiced, that is not a fallback. That is hope with a runbook attached.
I like fallback plans that have muscle memory:
- scheduled restore tests
- small automation scripts with clear owners
- dashboards that show the dependency health that actually matters
- alerts tied to user impact, not just resource noise
- incident notes that explain the decision tree
- a way to communicate degraded functionality clearly
You do not own the fallback until you have tested it.
A Practical Resilience Checklist
If you are looking at your own cloud application, start with this:
1. List your mains
What are the dependencies that would stop the product cold?
For many applications, that list includes identity, database, object storage, DNS, queues, secrets, payment processing, email, and observability.
2. Classify each dependency
Ask whether it is commodity, sticky, or custom.
Commodity dependencies can often be wrapped or replicated. Sticky dependencies need stronger design discipline. Custom dependencies need real ownership because no provider is coming to save you.
3. Decide what failure mode is acceptable
Not everything needs hot failover.
Sometimes read-only mode is fine. Sometimes delayed processing is fine. Sometimes a friendly "we are degraded" experience is better than pretending the whole app works.
4. Test one fallback path
Pick one critical dependency and rehearse the recovery.
Do not start with the fanciest possible disaster scenario. Start with the one path you can actually validate end-to-end.
5. Write down where the line is
This is the part engineers skip.
Write the decision down:
We abstract object storage because user files are critical and portable enough. We do not abstract identity yet because the current risk does not justify the complexity, but we document the export path and recovery assumptions.
That kind of sentence is more useful than a vague promise that the system is "cloud agnostic."
The Lesson I Keep Coming Back To
The Lake Orion water main break is a local story, but the engineering lesson is broader.
Infrastructure does not become important when it fails.
It was important the whole time.
The failure just makes the dependency visible.
For cloud applications, the goal is not to avoid every possible break. That is not realistic. The goal is to know what breaks next, how much it hurts, and what you can do about it.
Blob storage fallback? Very possible with the right seams.
Provider-specific orchestration fallback? Much harder.
An entire application designed around one cloud's deepest integrations? Maybe still the right call, but be honest about the trade.
That honesty is the work.
Because when the main breaks, the fallback becomes the product.


Comments (0)
No comments yet. Be the first to share your thoughts!
Leave a Comment
Sign in with Google, Microsoft, or email to leave a comment.