Skip to main content

Monitoring And Troubleshooting

Until richer metrics exist, the most useful operating signals are still HTTP responses, PostgreSQL state, and filesystem evidence.

Start Here When Something Breaks

  1. Check the HTTP status and body from the failed request.
  2. Inspect the newest replication_jobs rows.
  3. Compare objects and replicas records.
  4. Verify files under storage/node*.

Useful Queries

SELECT id, object_id, status, attempt_count, max_attempts, next_run_at, last_error
FROM replication_jobs
ORDER BY id DESC
LIMIT 50;
SELECT id, version
FROM schema_migrations
ORDER BY id;
SELECT object_id, node_name, file_path, status
FROM replicas
ORDER BY object_id DESC
LIMIT 50;

Common Situations

SymptomLikely Meaning
many failed jobspersistent filesystem or DB-side issue
pending with future next_run_atnormal retry backoff window
no jobs being claimedworker may not be running
upload succeeds but replicas missingexpected until async replication completes
repeated pending/running loops with growing attempt_countreplication keeps failing and retrying

Current Observability Gap

The project does not yet expose the richer signals an operator would want in production, such as request IDs, Prometheus metrics, queue depth dashboards, or health endpoints. Those are planned work, not hidden features.

Quick Recovery Playbook

  1. Ensure service process is running and DB is reachable.
  2. Check latest replication_jobs rows for last_error and attempt_count.
  3. Verify source object file exists at source_file_path.
  4. Verify secondary node directories (storage/node2, storage/node3) are writable.
  5. Fix root cause and allow retries; terminal failed jobs currently require manual replay tooling.