Skip to main content

Observability Core

What to log/measure, where to alert, and how to correlate.

C
Written by Catalin Fetean
Updated over 2 weeks ago

Audience: SRE, Support, Developers, Security
Outcomes: Fast root cause and clear health signals

Structured logs (JSON)

Include: ts, corrId, user, org, action, status, durationMs, providerRef.
Redact secrets/PANs.

Sample

{"ts":"2025-08-16T18:30:22Z","corrId":"d3b2e1f0","actor":"usr_17","org":"org_42","action":"contract.approve","status":200,"durationMs":42}

Metrics (Prometheus)

  • HTTP p95 latency / error rate

  • Webhook throughput, backlog, error rate

  • Payment success by rail

  • SSE clients & drops

  • Upload AV detections

GET /metrics → Prometheus exposition format.

Traces

  • Propagate corrId/trace ID through workers and webhooks.

  • Sample at low rate in prod; 100% in incidents.

SLOs (examples)

  • API availability 99.9% monthly

  • Webhook end-to-settle p95 < 60s

  • SSE reconnect MTTR < 15s

QA checklist

  • Synthetic checks produce expected alerts.

  • Correlation IDs appear across app, webhook, worker logs.

Runbook: “2xx but no state change”

  • Search by corrId → confirm DB write in txn → fix handler; reprocess from DLQ.


Did this answer your question?