On-call responsibilities
Overviewâ
- Respond to urgent customer issues
- Autonomous outages
- High priority skill and workflow issues
- Monitor overall product perfromance
- Production deployment on Sunday
- backend
- webapp
- workers
- extension
Monitoring levelsâ
There are 3 levels of urgency and priority for the signals being monitored:
- Critical alerts
- Working hours alerts
- Weekly monitoring dashboards
- Guided failures
- Backend errors
- Frontend errors
- LLM timeouts
Alertsâ
- Everything the oncall needs to handle is escalated through incident.io
- All alerts need to be acknowledged and investigated
- Updates and discussion in the alerts channel
- On-call is responsible for tuning the alert (or for opening an item in Linear)
General alert guidelinesâ
- Critical alerts must have a TSG
- Alerts should contain a link to the TSG (in Axiom: Monitor description)
- New Critical alerts must go through a testing phase before being enabled in production
- Alerts with [Test] prefix in the title will be sent to a test channel in Slack