Durable Work in PostgresPart 9
Performance and scaling
Claim-path tuning, indexes, and an interactive performance tuner for Postgres claim loops.
When lag grows, first identify the bottleneck: claim SQL, handler work, table size, or cross-service transport.
Most Postgres work queues should tune the claim path before changing platforms. Brokers help with transport and fan-out; they do not remove the need for leases, idempotency, and handler capacity.
Find the bottleneck before changing tools
Slow claims are usually a fixable SQL problem: poll frequency, claim scan width, missing bucket filters, or table bloat. If claims are fast but lag grows, add handler capacity. If the work is really cross-service fan-out or replay, then add transport.
Realistic target after tuning: hundreds to low thousands of jobs/sec across a worker pool on a modest Postgres instance, if handlers are fast (<100ms) and you adopt the basics below. Slow handlers (HTTP calls, PDF generation) scale by adding workers until handler CPU/I/O is the limit, not claim latency.
| What you see | Likely bottleneck | First move |
|---|---|---|
| Claim query p99 is high | SQL scan width, locks, or table size | Filter by bucket in SQL, right-size batches, archive completed rows. |
| Claims are fast but pending age grows | Handlers are slower than enqueue rate | Add handler concurrency or workers until CPU, I/O, or partner limits appear. |
| Database is mostly idle but workers poll constantly | Idle-loop load | Use LISTEN/NOTIFY plus jittered backoff. |
| One customer blocks its own stream | Hot partition key | Shard that key if strict global order is not required. |
| Many services need replay and fan-out | Transport, not local coordination | Use an outbox relay into a broker or event log. |
First-pass tuning
Filter ownership in SQL
Precompute partition_bucket on enqueue. Claim with WHERE partition_bucket = ANY($owned). Claiming 100 rows and discarding 95 in application code wastes locks and I/O.
Cache the ring
Rebuild the ring and owned-bucket list every 5–10s, not on every poll loop. Heartbeat TTL is 30s; the worker pool does not change that fast.
Wake, don’t hammer
LISTEN inbox_new on enqueue + jittered backoff when idle (1–2s). Poll fast (50–100ms) only when backlog exists. Cuts idle DB load by 10–50×.
Right-size claim batches
Start at 25. Increase until claim p99 stays <50ms or handlers cannot keep up. Huge batches mean long transactions. Tiny batches mean claim overhead dominates.
Keep the inbox table hot
Archive or delete completed rows on a schedule. A 50M-row inbox makes every index scan slower: even with partial indexes.
Small payloads
Store a pointer in payload (S3 key, row id), not the blob. Target <2 KB JSONB per row. Large payloads bloat pages and WAL.
Workload tuning calculator
Describe your workload, get recommended timings, batch sizes, and pool settings based on the rules of thumb below.
Your cluster
Your workload
Recommended settings
| Setting | Recommended |
|---|
Start with these settings
Tune from here, then measure. Adjust one knob at a time so you can tell whether the improvement came from fewer scans, fewer claims, faster handlers, or lower idle polling.
| Setting | Start here | Rule of thumb | Watch for |
|---|---|---|---|
| Heartbeat interval | 10s | ≈ TTL ÷ 3 (TTL = 30s → heartbeat every 10s) | Too slow → false dead. Too fast → write noise |
| Worker dead TTL | 30s | 3× missed heartbeats before marking dead | Long TTL = slow failover. Short = flappy ring |
| Lease duration | 90s | ≥ 3× p99 handler time (short jobs). Renew for long jobs | Too short → cleanup churn + duplicates |
| Lease renewal | Every 30s | Every lease ÷ 3 while handler runs | Miss one renewal window → row reclaimed |
| Claim batch size | 25 | Drain in ~1–5s of handler wall time. Cap 50–100 | Big batches lock more rows. Small = overhead |
| Idle poll interval | 1–2s + jitter | Backoff when queue empty. Reset on NOTIFY | Fixed 500ms poll × N workers = needless load |
| Active poll interval | 50–100ms | When pending depth > 0 or after NOTIFY | Only run hot when work exists |
| Ring cache TTL | 5–10s | Rebuild on NOTIFY or TTL expiry | Every-loop rebuild wastes CPU |
| Housekeeping | 30s | One winner via advisory lock | More frequent rarely helps |
| Partition buckets | 360–1024 | ≥ 10× worker count; ~128 vnodes/worker | Too few → skew. Too many → fine |
| Handler concurrency | 4–16 per worker | Match CPU cores × 2 for I/O-bound work | Serial loop underuses the machine |
| Payload size | <2 KB | Store heavy data elsewhere. Keep JSONB lean | Big JSONB → slow claims + vacuum pain |
| Completed row retention | 24h–7d in inbox | Archive then DELETE. Keep pending+processing hot | Millions of completed rows → index bloat |
| Connection pool | PgBouncer transaction mode | Pool size ≈ workers × (1 claim + concurrency) | One conn per row = pool exhaustion |
When the first pass is not enough
Parallel handlers per worker
Claim a batch, process rows with a concurrency limit (p-limit, semaphores). Claim stays serial. Handlers run in parallel. Often 2–4× throughput per machine.
Index for ordering guard
If using per-key FIFO, add partial index on (partition_key) WHERE status = 'processing' so the NOT EXISTS check stays cheap.
Horizontal workers
Add processes until claim latency or Postgres CPU rises. Consistent hash spreads keys. Scale is roughly linear until the DB is hot.
Postgres hygiene
Aggressive autovacuum on inbox, monitor bloat, consider FILLFACTOR 90 on high-churn tables. Tune after archive strategy is in place.
Put the tuned loop together
const BATCH_SIZE = 25;
const HANDLER_CONCURRENCY = 8;
const RING_TTL_MS = 8_000;
const IDLE_POLL_MS = 1_500;
const ACTIVE_POLL_MS = 75;
let ringCache = { ring: null, ownedBuckets: [], expiresAt: 0 };
async function getRing(db) {
if (Date.now() < ringCache.expiresAt) return ringCache;
const live = await db.query(`SELECT id FROM workers
WHERE status = 'alive'
AND last_seen_at > now() - interval '30 seconds'
ORDER BY id`);
const ring = buildRing(live, 128);
ringCache = {
ring,
ownedBuckets: ownedBucketsFor(WORKER_ID, ring, 360),
expiresAt: Date.now() + RING_TTL_MS,
};
return ringCache;
}
// LISTEN inbox_new → wake workers and refresh ownership soon
db.on("notification", () => {
wakeFlag = true;
ringCache.expiresAt = 0;
});
while (true) {
await maybeRunHousekeeping(db);
const { ownedBuckets } = await getRing(db);
const batch = await db.claimInboxBatch({
workerId: WORKER_ID,
buckets: ownedBuckets,
limit: BATCH_SIZE,
});
if (batch.length === 0) {
await sleep(wakeFlag ? ACTIVE_POLL_MS : IDLE_POLL_MS + Math.random() * 500);
wakeFlag = false;
continue;
}
await mapWithConcurrency(batch, HANDLER_CONCURRENCY, async (row) => {
await processWithLease(row, WORKER_ID);
});
}
Keep completed rows out of the hot table
The inbox should mostly hold pending and processing. Move history out on a schedule (housekeeping can do this too).
-- Run daily or when completed count > 1M
INSERT INTO inbox_archive SELECT * FROM inbox
WHERE status = 'completed' AND completed_at < now() - interval '7 days';
DELETE FROM inbox
WHERE status = 'completed' AND completed_at < now() - interval '7 days';
-- Batch deletes (10k at a time) if table is huge: avoid one massive lock
Add indexes only for measured pressure
-- Speeds per-key ordering guard (NOT EXISTS on processing siblings)
CREATE INDEX idx_inbox_partition_processing
ON inbox (partition_key)
WHERE status = 'processing';
-- Already in schema.sql: claim path depends on these partial indexes:
-- idx_inbox_bucket_claim (partition_bucket, available_at, created_at, id) WHERE status = 'pending'
-- idx_inbox_claim (available_at, created_at, id) WHERE status = 'pending'
Move transport only when transport is the bottleneck
Separate where you coordinate claims from what carries bytes between services. The table uses product names only as familiar anchors.
| Signal | Keep tuning Postgres coordination | Add or lean on message transport |
|---|---|---|
| Sustained enqueue rate | < ~1–5k/s after SQL and worker tuning | High-volume handoff between services. Many subscribers on one stream |
| Claim p99 after tuning | < 50ms, coordination is healthy | Still pegged after tuning. Or consumers cannot use your DB at all |
| Replay / fan-out | Outbox relay + per-service inbox | Shared event log as platform infrastructure |
| Cross-service boundary | Outbox in producer DB → transport → inbox in consumer DB | Natural fit: transport is the pipe, not the coordinator |
Multi-DC, replicas, and cost signals
Multi-DC and replicas
Enqueue and claim must hit the primary. Read replicas work for dashboards and lag metrics, not for claiming work. Cross-DC inbox requires a single write leader, or you accept split-brain risk.
Cost signals
When claim queries peg CPU and archiving cannot keep pace, tune coordination first. Move high-volume transport to a broker only if that does not solve the bottleneck.
Source
Use the article for explanation, then use these files when you want the complete SQL and TypeScript in one place.