Skip to main content

Operations Runbook

This runbook provides practical operating procedures for FLUX using TypeScript SDK workflows.

Start and Health Check

  1. start broker
  2. run TypeScript SDK smoke test

Smoke test:

import { FLUXClient } from '@flux/typescript-sdk';

const client = new FLUXClient({ host: '127.0.0.1', port: 9092 });
await client.connect();

await client.produce('orders', 'ops', 'health-check', '1');
const join = await client.join('ops', 'orders', 'checker', 'round_robin');
const sync = await client.sync('ops', 'orders', 'checker', join.generation);
await client.heartbeat('ops', 'orders', 'checker', sync.generation);

await client.leave('ops', 'orders', 'checker', sync.generation);
await client.close();

Incident: REPLICATION_TIMEOUT

Possible causes:

  • ISR below configured minimum
  • follower progress stale
  • replica lag too high

Immediate checks:

  • verify FLUX_MIN_ISR and lag timeout settings
  • inspect logs for under-replicated partition warnings
  • validate producer ack mode (acks=all is stricter)

Incident: GENERATION_MISMATCH

Meaning:

  • consumer is stale or reassigned after rebalance

Actions:

  • if using high-level runtime API, rely on automatic rejoin behavior
  • if using low-level API, run join/sync again and refresh generation
  • retry commit only after assignment is current

Incident: NOT_LEADER

Meaning:

  • partition role is follower on this broker

Actions:

  • verify role transition workflows
  • in low-level admin/debug flows, restore role to leader before local writes

Data Inspection

Inspect data directory for:

  • segment logs
  • index files
  • offsets.json
  • groups.json

Backup Guidance (Current)

For single-node setups:

  1. stop broker cleanly
  2. snapshot entire FLUX_DATA_DIR
  3. restore to same or compatible runtime config

Upgrade Guidance (Current)

  • run full test suite before rollout
  • deploy one broker process replacement
  • run TypeScript smoke test after restart