Monitoring And Troubleshooting
Until richer metrics exist, the most useful operating signals are still HTTP responses, PostgreSQL state, and filesystem evidence.
Start Here When Something Breaks
- Check the HTTP status and body from the failed request.
- Inspect the newest
replication_jobsrows. - Compare
objectsandreplicasrecords. - Verify files under
storage/node*.
Useful Queries
SELECT id, object_id, status, attempt_count, max_attempts, next_run_at, last_error
FROM replication_jobs
ORDER BY id DESC
LIMIT 50;
SELECT id, version
FROM schema_migrations
ORDER BY id;
SELECT object_id, node_name, file_path, status
FROM replicas
ORDER BY object_id DESC
LIMIT 50;
Common Situations
| Symptom | Likely Meaning |
|---|---|
many failed jobs | persistent filesystem or DB-side issue |
pending with future next_run_at | normal retry backoff window |
| no jobs being claimed | worker may not be running |
| upload succeeds but replicas missing | expected until async replication completes |
repeated pending/running loops with growing attempt_count | replication keeps failing and retrying |
Current Observability Gap
The project does not yet expose the richer signals an operator would want in production, such as request IDs, Prometheus metrics, queue depth dashboards, or health endpoints. Those are planned work, not hidden features.
Quick Recovery Playbook
- Ensure service process is running and DB is reachable.
- Check latest
replication_jobsrows forlast_errorandattempt_count. - Verify source object file exists at
source_file_path. - Verify secondary node directories (
storage/node2,storage/node3) are writable. - Fix root cause and allow retries; terminal
failedjobs currently require manual replay tooling.