Monitoring And Troubleshooting

Until richer metrics exist, the most useful operating signals are still HTTP responses, PostgreSQL state, and filesystem evidence.

Start Here When Something Breaks

Check the HTTP status and body from the failed request.
Inspect the newest replication_jobs rows.
Compare objects and replicas records.
Verify files under storage/node*.

Useful Queries

SELECT id, object_id, status, attempt_count, max_attempts, next_run_at, last_error
FROM replication_jobs
ORDER BY id DESC
LIMIT 50;

SELECT id, version
FROM schema_migrations
ORDER BY id;

SELECT object_id, node_name, file_path, status
FROM replicas
ORDER BY object_id DESC
LIMIT 50;

Common Situations

Symptom	Likely Meaning
many `failed` jobs	persistent filesystem or DB-side issue
`pending` with future `next_run_at`	normal retry backoff window
no jobs being claimed	worker may not be running
upload succeeds but replicas missing	expected until async replication completes
repeated `pending`/`running` loops with growing `attempt_count`	replication keeps failing and retrying

Current Observability Gap

The project does not yet expose the richer signals an operator would want in production, such as request IDs, Prometheus metrics, queue depth dashboards, or health endpoints. Those are planned work, not hidden features.

Quick Recovery Playbook

Ensure service process is running and DB is reachable.
Check latest replication_jobs rows for last_error and attempt_count.
Verify source object file exists at source_file_path.
Verify secondary node directories (storage/node2, storage/node3) are writable.
Fix root cause and allow retries; terminal failed jobs currently require manual replay tooling.

Start Here When Something Breaks​

Useful Queries​

Common Situations​

Current Observability Gap​

Quick Recovery Playbook​

Start Here When Something Breaks

Useful Queries

Common Situations

Current Observability Gap

Quick Recovery Playbook