Elevated errors and performance degradation

Incident Report for HyperTrack

Postmortem

Postmortem: System-Wide Outage Due to Database Degradation

Incident Date: May 23, 2025

Time to Resolution: 85 minutes

Status: Resolved

Severity: Critical (P0)

Summary

On May 23, 2025, our platform experienced a widespread outage due to degraded performance in our database infrastructure. Specifically, a set of read replicas were affected during the period. This degradation resulted in elevated error rates and unavailability across multiple APIs, including Orders, Workers, Places, and SDK-related services. The issue was fully resolved within 85 minutes.

We understand how critical our services are to your operations and sincerely apologize for the disruption.

What Happened

A query pattern in our system targeting a key Orders API table failed to use a necessary index. This led to full table scans that overloaded some of our reader instances. As a result, several core APIs failed or experienced extreme latency.

Impact

  • Customers experienced timeouts or errors when accessing Orders, Workers, and Places APIs
  • Monitoring and dashboard functionality was temporarily unavailable

What We Did

  • Identified the problematic query
  • Deployed a hotfix to ensure proper index usage
  • Applied a secondary patch to reduce load when workers were not actively tracking
  • Restarted degraded infrastructure and monitored stabilization
  • Performed a full incident review across impacted components

Remediation and Next Steps

We are taking the following actions to ensure this does not happen again:

  • Automated slow-query detection: We’re enhancing our review pipeline with weekly audits and real-time alerting.
  • Improved infrastructure alarms: CPU and query performance alarms will provide earlier visibility into degradation.

Final Thoughts

We are committed to providing a stable and resilient platform. This incident has highlighted areas we must improve, and we’re taking swift action to reinforce our architecture. Thank you for your trust and patience.

Posted May 27, 2025 - 19:16 UTC

Resolved

Issues have been resolved at 18:45 UTC. The team is gathering data for the postmortem and action steps to prevent future degradations.
Posted May 23, 2025 - 18:45 UTC

Update

We are continuing to investigate this issue. The issue emerged at 17:20 UTC.
Posted May 23, 2025 - 18:25 UTC

Investigating

We are currently investigating the issue and working to resolve it.
Posted May 23, 2025 - 18:25 UTC
This incident affected: Orders, Dashboard, Ops Dashboard, Order tracking views, and Webhooks.