Data Protection and Durability
Replication Model
- Primary object content is written first to node1.
- Metadata + replication work intent is persisted durably in PostgreSQL.
- Worker asynchronously copies data to node2 and node3.
Failure Handling
Replication jobs use controlled transitions and retry:
- claim:
pending -> running - transient failure:
running -> pendingwith delayednext_run_at - terminal failure:
running -> failedatmax_attempts - success:
running -> completed
Backoff strategy is currently quadratic by attempt count.
Idempotency
At-least-once worker execution means duplicate processing can occur after process restarts or transient errors. To make this safe:
- replica writes are conflict-safe
- replica identity is modeled by object/node uniqueness
- stale transition attempts are rejected by guarded state updates
Durability Boundary (Current MVP)
What is protected:
- object metadata durability (PostgreSQL)
- replication work durability (job table)
What is not fully production-hardened yet:
- cross-node checksum scrub / anti-entropy reconciliation
- multi-region durability guarantees
- erasure coding
- formal RPO/RTO operational drills
Recovery Expectations
- If worker is down, jobs remain queued and recover once worker resumes.
- If a job repeatedly fails, it becomes
failedfor manual/automated replay support (replay tooling planned). - Primary object availability is independent of secondary replication completion.