Skip to main content

Data Protection And Durability

This page explains exactly what is durable today and where the boundary still stops.

What Is Protected Today

After upload success, the system has:

a primary object copy
durable object metadata in PostgreSQL
durable replication intent in replication_jobs

That combination means a worker crash does not erase the knowledge that replicas still need to be created.

Failure Handling

Replication jobs follow controlled transitions:

pending -> running -> completed
pending -> running -> pending
pending -> running -> failed

Retries use backoff, and terminal failure records preserve the fact that work did not complete.

Why Idempotency Matters

Workers are at-least-once. The same logical work can run again after a crash or retry. To keep replay safe:

replica records use conflict-safe persistence
replica identity is unique per object and node
guarded state updates reject stale transitions

What This Means For Users

upload success does not mean every replica already exists
a temporary worker outage should delay replication, not lose the job
a repeated job should not create duplicate logical replicas

Not Yet Production-Hardened

cross-node checksum scrub
anti-entropy repair
multi-region durability
erasure coding
formal RPO/RTO drills

Recovery Expectations

stopped workers can resume queued jobs later
repeated failures end in failed for later replay or repair tooling
primary availability is independent of secondary completion

What Is Protected Today
Failure Handling
Why Idempotency Matters
What This Means For Users
Not Yet Production-Hardened
Recovery Expectations