Audience: Backend, SRE, Support
Outcomes: No lost events; quick root cause
Retries & backoff
Providers retry on non-2xx → your handler must be idempotent
Outbound provider calls: exponential backoff + jitter, capped
Poison events
Repeated failures → send to dead-letter queue, alert human
Provide “reprocess” in admin tooling
Observability
Logs: include
x-correlation-id
, event type,orderId
, provider ref; redact PII/secretsMetrics: webhook throughput/backlog & error rate; SSE clients & drops; payment success by rail
Alerts: backlog > threshold; no Stripe events for X minutes; release failures > N/hr; unexpected drop in SSE clients
QA checklist
Synthetic heartbeat events trigger expected alerts
Dead-lettered events are visible and reprocessable
Runbook: “Webhook 2xx but no change”
Handler swallowed error: raise log level; verify DB write in txn; add unit test.