Durable Work in PostgresPart 9

Performance and scaling

Claim-path tuning, indexes, and an interactive performance tuner for Postgres claim loops.

When lag grows, first identify the bottleneck: claim SQL, handler work, table size, or cross-service transport.

Most Postgres work queues should tune the claim path before changing platforms. Brokers help with transport and fan-out; they do not remove the need for leases, idempotency, and handler capacity.

Find the bottleneck before changing tools

Slow claims are usually a fixable SQL problem: poll frequency, claim scan width, missing bucket filters, or table bloat. If claims are fast but lag grows, add handler capacity. If the work is really cross-service fan-out or replay, then add transport.

Realistic target after tuning: hundreds to low thousands of jobs/sec across a worker pool on a modest Postgres instance, if handlers are fast (<100ms) and you adopt the basics below. Slow handlers (HTTP calls, PDF generation) scale by adding workers until handler CPU/I/O is the limit, not claim latency.

What you seeLikely bottleneckFirst move
Claim query p99 is highSQL scan width, locks, or table sizeFilter by bucket in SQL, right-size batches, archive completed rows.
Claims are fast but pending age growsHandlers are slower than enqueue rateAdd handler concurrency or workers until CPU, I/O, or partner limits appear.
Database is mostly idle but workers poll constantlyIdle-loop loadUse LISTEN/NOTIFY plus jittered backoff.
One customer blocks its own streamHot partition keyShard that key if strict global order is not required.
Many services need replay and fan-outTransport, not local coordinationUse an outbox relay into a broker or event log.

First-pass tuning

Filter ownership in SQL

Precompute partition_bucket on enqueue. Claim with WHERE partition_bucket = ANY($owned). Claiming 100 rows and discarding 95 in application code wastes locks and I/O.

Cache the ring

Rebuild the ring and owned-bucket list every 5–10s, not on every poll loop. Heartbeat TTL is 30s; the worker pool does not change that fast.

Wake, don’t hammer

LISTEN inbox_new on enqueue + jittered backoff when idle (1–2s). Poll fast (50–100ms) only when backlog exists. Cuts idle DB load by 10–50×.

Right-size claim batches

Start at 25. Increase until claim p99 stays <50ms or handlers cannot keep up. Huge batches mean long transactions. Tiny batches mean claim overhead dominates.

Keep the inbox table hot

Archive or delete completed rows on a schedule. A 50M-row inbox makes every index scan slower: even with partial indexes.

Small payloads

Store a pointer in payload (S3 key, row id), not the blob. Target <2 KB JSONB per row. Large payloads bloat pages and WAL.

Workload tuning calculator

Describe your workload, get recommended timings, batch sizes, and pool settings based on the rules of thumb below.

Your cluster
Horizontal scale: each runs the claim loop
Sustained jobs/sec into the inbox
Millions of completed rows still in table (bloat factor)
Your workload
Wall-clock time per job at p99: drives lease & batch sizing
Recommended settings
Adjust inputs to see recommendations.
Est. capacity/s
Load %
Per worker/s
SettingRecommended
      generated config

      Start with these settings

      Tune from here, then measure. Adjust one knob at a time so you can tell whether the improvement came from fewer scans, fewer claims, faster handlers, or lower idle polling.

      SettingStart hereRule of thumbWatch for
      Heartbeat interval10s≈ TTL ÷ 3 (TTL = 30s → heartbeat every 10s)Too slow → false dead. Too fast → write noise
      Worker dead TTL30s3× missed heartbeats before marking deadLong TTL = slow failover. Short = flappy ring
      Lease duration90s≥ 3× p99 handler time (short jobs). Renew for long jobsToo short → cleanup churn + duplicates
      Lease renewalEvery 30sEvery lease ÷ 3 while handler runsMiss one renewal window → row reclaimed
      Claim batch size25Drain in ~1–5s of handler wall time. Cap 50–100Big batches lock more rows. Small = overhead
      Idle poll interval1–2s + jitterBackoff when queue empty. Reset on NOTIFYFixed 500ms poll × N workers = needless load
      Active poll interval50–100msWhen pending depth > 0 or after NOTIFYOnly run hot when work exists
      Ring cache TTL5–10sRebuild on NOTIFY or TTL expiryEvery-loop rebuild wastes CPU
      Housekeeping30sOne winner via advisory lockMore frequent rarely helps
      Partition buckets360–1024≥ 10× worker count; ~128 vnodes/workerToo few → skew. Too many → fine
      Handler concurrency4–16 per workerMatch CPU cores × 2 for I/O-bound workSerial loop underuses the machine
      Payload size<2 KBStore heavy data elsewhere. Keep JSONB leanBig JSONB → slow claims + vacuum pain
      Completed row retention24h–7d in inboxArchive then DELETE. Keep pending+processing hotMillions of completed rows → index bloat
      Connection poolPgBouncer transaction modePool size ≈ workers × (1 claim + concurrency)One conn per row = pool exhaustion

      When the first pass is not enough

      Parallel handlers per worker

      Claim a batch, process rows with a concurrency limit (p-limit, semaphores). Claim stays serial. Handlers run in parallel. Often 2–4× throughput per machine.

      Index for ordering guard

      If using per-key FIFO, add partial index on (partition_key) WHERE status = 'processing' so the NOT EXISTS check stays cheap.

      Horizontal workers

      Add processes until claim latency or Postgres CPU rises. Consistent hash spreads keys. Scale is roughly linear until the DB is hot.

      Postgres hygiene

      Aggressive autovacuum on inbox, monitor bloat, consider FILLFACTOR 90 on high-churn tables. Tune after archive strategy is in place.

      Put the tuned loop together

      const BATCH_SIZE = 25;
      const HANDLER_CONCURRENCY = 8;
      const RING_TTL_MS = 8_000;
      const IDLE_POLL_MS = 1_500;
      const ACTIVE_POLL_MS = 75;
      
      let ringCache = { ring: null, ownedBuckets: [], expiresAt: 0 };
      
      async function getRing(db) {
        if (Date.now() < ringCache.expiresAt) return ringCache;
        const live = await db.query(`SELECT id FROM workers
          WHERE status = 'alive'
            AND last_seen_at > now() - interval '30 seconds'
          ORDER BY id`);
        const ring = buildRing(live, 128);
        ringCache = {
          ring,
          ownedBuckets: ownedBucketsFor(WORKER_ID, ring, 360),
          expiresAt: Date.now() + RING_TTL_MS,
        };
        return ringCache;
      }
      
      // LISTEN inbox_new → wake workers and refresh ownership soon
      db.on("notification", () => {
        wakeFlag = true;
        ringCache.expiresAt = 0;
      });
      
      while (true) {
        await maybeRunHousekeeping(db);
        const { ownedBuckets } = await getRing(db);
        const batch = await db.claimInboxBatch({
          workerId: WORKER_ID,
          buckets: ownedBuckets,
          limit: BATCH_SIZE,
        });
      
        if (batch.length === 0) {
          await sleep(wakeFlag ? ACTIVE_POLL_MS : IDLE_POLL_MS + Math.random() * 500);
          wakeFlag = false;
          continue;
        }
      
        await mapWithConcurrency(batch, HANDLER_CONCURRENCY, async (row) => {
          await processWithLease(row, WORKER_ID);
        });
      }

      Keep completed rows out of the hot table

      The inbox should mostly hold pending and processing. Move history out on a schedule (housekeeping can do this too).

      -- Run daily or when completed count > 1M
      INSERT INTO inbox_archive SELECT * FROM inbox
      WHERE status = 'completed' AND completed_at < now() - interval '7 days';
      
      DELETE FROM inbox
      WHERE status = 'completed' AND completed_at < now() - interval '7 days';
      
      -- Batch deletes (10k at a time) if table is huge: avoid one massive lock

      Add indexes only for measured pressure

      -- Speeds per-key ordering guard (NOT EXISTS on processing siblings)
      CREATE INDEX idx_inbox_partition_processing
        ON inbox (partition_key)
        WHERE status = 'processing';
      
      -- Already in schema.sql: claim path depends on these partial indexes:
      -- idx_inbox_bucket_claim (partition_bucket, available_at, created_at, id) WHERE status = 'pending'
      -- idx_inbox_claim (available_at, created_at, id) WHERE status = 'pending'

      Move transport only when transport is the bottleneck

      Separate where you coordinate claims from what carries bytes between services. The table uses product names only as familiar anchors.

      SignalKeep tuning Postgres coordinationAdd or lean on message transport
      Sustained enqueue rate< ~1–5k/s after SQL and worker tuningHigh-volume handoff between services. Many subscribers on one stream
      Claim p99 after tuning< 50ms, coordination is healthyStill pegged after tuning. Or consumers cannot use your DB at all
      Replay / fan-outOutbox relay + per-service inboxShared event log as platform infrastructure
      Cross-service boundaryOutbox in producer DB → transport → inbox in consumer DBNatural fit: transport is the pipe, not the coordinator

      Multi-DC, replicas, and cost signals

      Multi-DC and replicas

      Enqueue and claim must hit the primary. Read replicas work for dashboards and lag metrics, not for claiming work. Cross-DC inbox requires a single write leader, or you accept split-brain risk.

      Cost signals

      When claim queries peg CPU and archiving cannot keep pace, tune coordination first. Move high-volume transport to a broker only if that does not solve the bottleneck.