Skip to main content

On-call responsibilities

Overview​

  1. Respond to urgent customer issues
    • Autonomous outages
    • High priority skill and workflow issues
  2. Monitor overall product perfromance
  3. Production deployment on Sunday
    • backend
    • webapp
    • workers
    • extension

Monitoring levels​

There are 3 levels of urgency and priority for the signals being monitored:

  1. Critical alerts
  2. Working hours alerts
  3. Weekly monitoring dashboards
    • Guided failures
    • Backend errors
    • Frontend errors
    • LLM timeouts

Alerts​

  • Everything the oncall needs to handle is escalated through incident.io
  • All alerts need to be acknowledged and investigated
  • Updates and discussion in the alerts channel
  • On-call is responsible for tuning the alert (or for opening an item in Linear)

General alert guidelines​

  • Critical alerts must have a TSG
  • Alerts should contain a link to the TSG (in Axiom: Monitor description)
  • New Critical alerts must go through a testing phase before being enabled in production
    • Alerts with [Test] prefix in the title will be sent to a test channel in Slack