On July 12, 2023, at 20:29 UTC, one of our replicated and highly available database systems experienced severe service degradation and didn't automatically recover. Our incident response team was alerted. This issue caused intermittent errors and high latency in the overall user experience, leading to a partial outage across the VTEX Platform. Sales flow, Product Indexing, and the Administrative Environment were severely impacted.
At 21:15 UTC, the initial issue was manually remediated. Unfortunately, several of our services circuit breakers opened up and failed to automatically close once the initial issue had been resolved. This extended the incident to some customers for longer than we'd like. The circuit breakers are in place to avoid complete unavailability, and as a side effect, a subset of our platform continued to operate.
By 22:00 UTC, our incident response team was still investigating the failure's cascading effects and manually implementing remediations to address the issue. At 19:32 GMT-03:00, we confirmed that remediation efforts were underway to recover the platform. However, some intermittent errors and higher latency persisted for several customers.
At 22:56 UTC, the additional team's remediation actions were proving successful, with sessions and orders gradually increasing towards expected levels.
At 23:03 UTC, we validated that the remediation actions had been effective, resulting in a steady recovery.
Moving forward, the following actions will be taken to prevent similar incidents:
- Re-evaluating the size of the failure domain associated with the replicated and highly available database system.
- Investigating and fixing the cause of the severe service degradation in the database system with one of VTEX's cloud providers.
- Investigating and fixing the bug related to the circuit breakers not properly closing once the database issue had been resolved.
Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.
We expressed appreciation for your understanding and patience during the incident. We remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.
Posted Jul 13, 2023 - 00:36 UTC