Skip to main content

On-call responsibilities

Overview

Respond to urgent customer issues
- Autonomous outages
- High priority skill and workflow issues
Monitor overall product perfromance
- See Monitoring levels
Production deployment on Sunday
- backend
- webapp
- workers
- extension

Monitoring levels

There are 3 levels of urgency and priority for the signals being monitored:

Critical alerts
Working hours alerts
Weekly monitoring dashboards
- Guided failures
- Backend errors
- Frontend errors
- LLM timeouts

Alerts

Everything the oncall needs to handle is escalated through incident.io
All alerts need to be acknowledged and investigated
Updates and discussion in the alerts channel
On-call is responsible for tuning the alert (or for opening an item in Linear)

General alert guidelines

Critical alerts must have a TSG
Alerts should contain a link to the TSG (in Axiom: Monitor description)
New Critical alerts must go through a testing phase before being enabled in production
- Alerts with [Test] prefix in the title will be sent to a test channel in Slack

Shift handover

Post a brief summary of the main issues encountered during the shift in the on-call channel in Slack

Useful links