Audience: SRE, Support, Developers, Security
Outcomes: Fast root cause and clear health signals
Structured logs (JSON)
Include: ts
, corrId
, user
, org
, action
, status
, durationMs
, providerRef
.
Redact secrets/PANs.
Sample
{"ts":"2025-08-16T18:30:22Z","corrId":"d3b2e1f0","actor":"usr_17","org":"org_42","action":"contract.approve","status":200,"durationMs":42}
Metrics (Prometheus)
HTTP p95 latency / error rate
Webhook throughput, backlog, error rate
Payment success by rail
SSE clients & drops
Upload AV detections
GET /metrics
→ Prometheus exposition format.
Traces
Propagate
corrId
/trace ID through workers and webhooks.Sample at low rate in prod; 100% in incidents.
SLOs (examples)
API availability 99.9% monthly
Webhook end-to-settle p95 < 60s
SSE reconnect MTTR < 15s
QA checklist
Synthetic checks produce expected alerts.
Correlation IDs appear across app, webhook, worker logs.
Runbook: “2xx but no state change”
Search by
corrId
→ confirm DB write in txn → fix handler; reprocess from DLQ.