Designing Failure Domains on a Single Node: Boot, Platform, Data, Backup Tiers
I used to think failure domains were a multi-node concern.
The usual mental model is to add more servers, then design isolation. But on a single node, the same question still exists, when one part of the system fails, what else should fail with it?
That question changed how I approached my homelab platform.
Instead of starting with workloads, I started with boundaries. The goal was not to make the machine look complicated. The goal was to make failure behavior predictable.
The real problem was not capacity#
The first temptation was to optimize for available space and raw throughput. That is usually where many homelab builds begin.
The bigger risk was operational coupling. Runtime data mixed with system files, database storage mixed with container state, backups treated as just another folder on the same tier.
That design works until you hit pressure or make one wrong command. Then recovery becomes guesswork.
The design target became simple:
One machine, but multiple failure domains with explicit responsibilities.
Boundary model: four storage tiers#
The platform was structured into four tiers, each with a distinct responsibility:
- Boot/System tier for OS and base services
- Platform tier for container runtime and platform services
- Data tier for stateful workloads (database and cache)
- Backup tier for dumps, archives, and recovery artifacts
This is still a single-node setup. But isolation at the storage boundary level reduces blast radius when one tier saturates, degrades, or is misconfigured.
The decision was less about elegance and more about control:
- if platform logs spike, they should not directly consume database I/O budget
- if database growth accelerates, it should not impact boot stability
- if backup jobs fail or bloat, they should not block runtime paths
Request and operations flow (what this enables)#
Failure domains are easier to reason about when the system flow is explicit:
This flow makes ownership clearer, platform tier owns execution, data tier owns state, backup tier owns recoverability.
This is the part I used to underestimate. Separation is not just a storage trick; it is an ownership contract.
Workflow I followed (task-oriented, not ad hoc)#
I treated the setup as a phased platform task sequence rather than one long shell session:
- Task 0001: establish OS baseline and resilient boot path
- Task 0002: implement tiered storage model and mount strategy
- Task 0003: bootstrap k3s with runtime paths aligned to platform tier
This sequence helped me ensure infrastructure prerequisites.
Practical command workflow#
The exact device names depend on the node, but the workflow pattern stayed consistent.
Started with verified disk identity:
Before destructive operations, we clear stale signatures on intended targets only:
After creating filesystems and mount points, validate mount intent and persistence:
When using tiered mounts with bind paths, ensure base mounts are established before bind mounts in fstab, then verify live source mapping:
This step was a bit finicky and can bring bugs/miscofigured drives, so watch out for typos!
For RAID boot resilience checks and sync hygiene:
The commands are straightforward. The discipline is in when and why they are run.
Risks that shaped design decisions#
Several risks changed the implementation details more than any architecture diagram:
1) Device naming ambiguity#
/dev/sdX assignment is not a trust boundary. It is a kernel naming detail.
The risk was formatting the wrong disk during tier setup. The mitigation was explicit runbook mapping (device, model, size) before any destructive command.
2) False confidence in dual-disk boot#
Configuring mirrored boot is not the same as proving failover behavior. A resilient design needs test evidence, not assumptions.
This led to a stricter rule: failover behavior must be validated and documented after setup, not inferred from tool output.
3) fstab drift and bind mount ordering#
Tiering designs can silently fail if mount dependency order is wrong. A clean file is not enough; runtime verification has to confirm expected source-to-mount mapping.
4) Backup tier confusion#
“Backups exist” is not equivalent to “recovery path is protected.” Keeping backups on a logically separate tier made recovery intent explicit and operationally safer.
What changed after implementing these boundaries#
The node did not become magically fault-tolerant. It became easier to reason about under actual pressure.
Three practical shifts stood out from this setup:
- platform operations became more deterministic because responsibilities were clearer
- verification became faster because each tier had a known purpose
- future phase planning (k3s, observability, workload rollout) stopped competing with unresolved storage ambiguity
This was the real gain: fewer hidden coupling points before workload complexity arrives.
Trade-offs I accepted#
Boundary-heavy setup has costs:
- more upfront time before shipping visible workloads
- more documentation burden (mapping, verification, rollback notes)
- less flexibility for quick one-off shortcuts
I accepted those trade-offs because they buy something more valuable: operational confidence when things go wrong.
Closing#
Designing failure domains on a single node is less about pretending to run a cloud and more about practicing system ownership.
The turning point for me was this:
I stopped asking, “Can this run?” and started asking, “How does this fail, and what stays intact when it does?”
Once that question drives design, boundaries stop feeling like overhead. They become the mechanism that keeps the platform understandable and recoverable as it evolves.