πŸ“… On-Call Schedule
Rotation calendar, shift assignments, and escalation contacts Β· Updated per sprint cycle
Current Shift
ALPHA
Ends in β€”
Primary On-Call
T-JOSH
Operator lead Β· UTC+10
Backup On-Call
T-BOT
Automated agent Β· 24/7
Escalations Today
0
No active incidents
πŸ“† This Week's Rotation
πŸ§‘β€πŸ’» Current Shift Assignments
Alpha Shift
πŸ• 00:00 – 08:00 UTC Β· Mon–Fri
πŸ¦†
T-JOSH (Josh) Primary operator Β· Galaxy infra lead PRIMARY
πŸ€–
T-BOT (Automation) Alert router Β· auto-escalation BACKUP
Bravo Shift
πŸ•— 08:00 – 16:00 UTC Β· Mon–Fri
πŸ¦†
T-JOSH (Josh) Primary operator Β· core hours PRIMARY
πŸ€–
T-BOT (Automation) Monitoring agent Β· fallback BACKUP
Charlie Shift
πŸ•“ 16:00 – 24:00 UTC Β· Mon–Fri
πŸ€–
T-BOT (Automation) Primary monitoring Β· low-traffic window PRIMARY
πŸ¦†
T-JOSH (Josh) On-call backup Β· escalations only BACKUP
Weekend / Holiday
πŸ• All hours UTC Β· Sat–Sun + holidays
πŸ€–
T-BOT (Automation) 24/7 automated monitoring PRIMARY
πŸ¦†
T-JOSH (Josh) P0 escalations only Β· 30min SLA P0 ONLY
πŸ“ž Escalation Contacts
Primary Contact
Operator T-JOSH
Telegram @spaceduck_ops
Email ops@spaceduck.bot
Response SLA 15 min (P0)
AWS Emergency
AWS Support console.aws.amazon.com
Account ID 121546003735
Region us-east-1
Support plan Developer+
🚨 Escalation Path
1
Automated alert fires CloudWatch alarm triggers β†’ T-BOT receives event β†’ logs to alert history β†’ attempts auto-remediation if playbook exists T+0 min
2
Primary on-call notified T-JOSH receives Telegram notification β†’ acknowledges within 15 min β†’ begins triage using mission-control.html T+5–15 min
3
Incident declared (if unresolved) Incident mode enabled on Mission Control β†’ Operator Shift Log updated β†’ handoff bundle exported β†’ governance log entry created T+30 min
4
AWS Support escalated (if infra failure) Open case via AWS Support Console β†’ attach CloudWatch logs and Lambda version info β†’ request priority callback T+60 min
5
Post-incident review Complete postmortem within 24h β†’ update GOVERNANCE-LOG.md β†’ update runbooks and alert thresholds β†’ share shift handoff pack T+24 h
πŸ“‹ Severity Definitions
P0 β€” Critical
Complete service outage or data loss risk
Response: 15 min Β· 24/7 Β· All hands
Examples: Lambda down, auth broken, DynamoDB unavailable
P1 β€” High
Degraded functionality, elevated error rate
Response: 1 hour Β· business hours primary
Examples: SES sandbox, peck failures >10%, slow agents
P2 β€” Medium
Non-critical issue, workaround available
Response: next business day
Examples: UI glitches, stale caches, minor config drift