On-call responsibilities
Overview​
- Respond to urgent customer issues
- Autonomous outages
- High priority skill and workflow issues
- Monitor overall product perfromance
- Production deployment on Sunday
- backend
- webapp
- workers
- extension
Monitoring levels​
There are 3 levels of urgency and priority for the signals being monitored:
- Critical alerts
- Working hours alerts
- Weekly monitoring dashboards
- Guided failures
- Backend errors
- Frontend errors
- LLM timeouts
Alerts​
- Everything the oncall needs to handle is escalated through incident.io
- All alerts need to be acknowledged and investigated
- Updates and discussion in the alerts channel
- On-call is responsible for tuning the alert (or for opening an item in Linear)
General alert guidelines​
- Critical alerts must have a TSG
- Alerts should contain a link to the TSG (in Axiom: Monitor description)
- New Critical alerts must go through a testing phase before being enabled in production
- Alerts with [Test] prefix in the title will be sent to a test channel in Slack
Shift handover​
- Post a brief summary of the main issues encountered during the shift in the on-call channel in Slack