From Water Mains to Cloud Storage: Planning for the Break Before It Happens

Last week the Lake Orion / Orion Township area got one of those reminders that infrastructure stays invisible right up until it's the only thing anyone's talking about. A water main break starts as a broken pipe and quickly stops being one. It turns into pressure, capacity, routing, communication, response time, and trust, and into a question every resident feels immediately: what happens next?

Cloud resilience starts with the same question, and it's a harder question than a vendor diagram or a "multi-cloud" checkbox makes it look. A managed platform takes a slice of the risk off your plate; the rest stays yours whether you plan for it or not. When the main breaks, what does your system do next?

The Break Is Never the Whole Problem

I'm not a public works expert, and I won't pretend a municipal water system maps neatly onto a software system. But the pattern underneath is familiar to anyone who's supported production infrastructure: something foundational fails, downstream assumptions get exposed, people find out which dependencies were critical, and the quality of the fallback matters a lot more than the quality of the diagram.

In software we file all of this under disaster recovery, high availability, failover, redundancy, graceful degradation, or multi-cloud architecture. Useful words. They can also hide the uncomfortable part, which is that a fallback is only real if the team can run it while things are already going wrong.

Some Cloud Dependencies Are Commodity

Some parts of cloud architecture are easy to reason about across providers, and blob or object storage is the cleanest example. Azure Blob Storage, Amazon S3, Google Cloud Storage, and the rest aren't identical, but the mental model is close enough:

containers or buckets
object keys
metadata
access policies
lifecycle rules
upload, download, list, delete

That's a surface I can put a small interface around. On my own projects that looks like predictable object naming, file metadata kept in my own database, an export path, a note of which container maps to which bucket, and a restore into another provider that I've run end to end at least once. None of that makes a storage fallback free, but it does make one understandable.

This is where multi-cloud can be practical. If the dependency is commodity enough, the abstraction can be worth what it costs.

Then the Platform Gets Sticky

The further you get from commodity primitives, the more provider-specific gravity you pick up. That gravity isn't automatically bad. Azure, AWS, and Google Cloud all have services that are valuable precisely because they're integrated: you get speed, security hooks, monitoring, identity integration, deployment ergonomics, and managed operations. The stickiness comes bundled into the same deal.

The hard-to-replace pieces usually look like this:

identity flows and token assumptions
eventing systems and delivery semantics
serverless bindings that shape the application
observability pipelines and alert rules
workflow engines and durable orchestration
proprietary database features
managed networking and private access patterns
deployment automation tied to one platform's model

Once you're deep into that list, a fallback stops being "plug in another provider" and becomes an application architecture problem. The provider hands you building blocks; it can't hand back portability after you've designed every code path around its specific behavior.

What's Worth Abstracting

My rule as a developer is to abstract the dependencies that create unacceptable business risk and leave everything else alone, no matter what the whiteboard says. There's a real difference between resilience and theater.

Resilience looks like:

an explicit storage interface because files are critical to the product
documented restore steps because data loss is unacceptable
a tested runbook because recovery under stress is different than recovery in a meeting
feature flags because partial degradation is better than a full outage
clear RPO and RTO targets (how much data you can afford to lose, and how long you can afford to be down) because "as soon as possible" isn't an engineering requirement

Theater looks like:

a diagram with two cloud logos and no tested failover
an abstraction layer nobody understands
duplicate infrastructure that cannot actually receive production traffic
"we can always migrate later" without export, import, or verification paths

Nothing on the resilience list is exciting, and none of it needs to be. Knowing which dependencies matter, designing those seams on purpose, and rehearsing the recovery path before you need it holds up better than a clever abstraction layer that has never been through a bad day.

Fallbacks Need Muscle Memory

Local infrastructure failures make one thing obvious: most of the response is operational. The repair is the pipe, plus the people, the process, the tools, the communication, and the exact sequence of actions after the break. Cloud systems work the same way.

If your fallback depends on a developer correctly remembering a six-step manual process they've never practiced, what you have is hope with a runbook attached.

I like fallback plans with muscle memory behind them:

scheduled restore tests
small automation scripts with clear owners
dashboards that show the dependency health that actually matters
alerts tied to user impact, not just resource noise
incident notes that explain the decision tree
a way to communicate degraded functionality clearly

Run the restore on a quiet Tuesday afternoon. An incident is a bad time to find out that step four assumes access nobody has.

A Practical Resilience Checklist

If you're looking at your own cloud application, start here.

1. List your mains

What are the dependencies that would stop the product cold? For many applications that list includes identity, database, object storage, DNS, queues, secrets, payment processing, email, and observability.

2. Classify each dependency

Sort each one into commodity, sticky, or custom. Commodity dependencies can often be wrapped or replicated. Sticky ones need stronger design discipline, and custom ones need real ownership, because no provider is coming to save you.

3. Decide what failure mode is acceptable

Not everything needs hot failover. For plenty of products read-only mode is fine, delayed processing is fine, and an honest "we're degraded" message beats pretending the whole app works.

4. Test one fallback path

Pick one critical dependency and rehearse the recovery. Don't start with the fanciest possible disaster scenario; start with the one path you can validate end-to-end.

5. Write down where the line is

This is the part engineers skip. Write the decision down:

We abstract object storage because user files are critical and portable enough. We do not abstract identity yet because the current risk does not justify the complexity, but we document the export path and recovery assumptions.

A sentence like that is worth more than a vague promise that the system is "cloud agnostic."

The Lesson I Keep Coming Back To

The Lake Orion water main break is a local story, but the engineering lesson travels. Infrastructure doesn't become important the moment it fails; it was important the whole time, and the failure just makes the dependency visible.

For cloud applications, the realistic goal is knowing what breaks next, how much it hurts, and what you've already decided to do about it. Nobody can promise that nothing breaks. A blob storage fallback is genuinely doable with the right seams. Provider-specific orchestration is much harder to replace, and an app built around one cloud's deepest integrations may still be the right call, as long as you're honest about the trade.

My own version of that checklist sentence currently says object storage is abstracted, exported, and rehearsed, and identity isn't yet. At least the gap is written down.

From Water Mains to Cloud Storage: Planning for the Break Before It Happens

From Water Mains to Cloud Storage: Planning for the Break Before It Happens

The Break Is Never the Whole Problem

Some Cloud Dependencies Are Commodity

Then the Platform Gets Sticky

What's Worth Abstracting

Fallbacks Need Muscle Memory

A Practical Resilience Checklist

1. List your mains

2. Classify each dependency

3. Decide what failure mode is acceptable

4. Test one fallback path

5. Write down where the line is

The Lesson I Keep Coming Back To

Comments (0)

Leave a Comment

From Water Mains to Cloud Storage: Planning for the Break Before It Happens

From Water Mains to Cloud Storage: Planning for the Break Before It Happens ​

The Break Is Never the Whole Problem

Some Cloud Dependencies Are Commodity

Then the Platform Gets Sticky

What's Worth Abstracting

Fallbacks Need Muscle Memory

A Practical Resilience Checklist

1. List your mains ​

2. Classify each dependency ​

3. Decide what failure mode is acceptable ​

4. Test one fallback path ​

5. Write down where the line is ​

The Lesson I Keep Coming Back To

Comments (0)

Leave a Comment

From Water Mains to Cloud Storage: Planning for the Break Before It Happens

1. List your mains

2. Classify each dependency

3. Decide what failure mode is acceptable

4. Test one fallback path

5. Write down where the line is