Elevated 5xx Error Rate in Navigation flow
Incident Report for VTEX
Postmortem
Posted Aug 30, 2023 - 12:20 UTC

Resolved
This incident has been resolved.

On August 24 2023, between 21:03 and 21:30 UTC, we observed elevated error rates in our platform. This was caused by changes that were performed in the configuration of our cloud infrastructure, as follow-up actions of our previous status.

We apologize for any inconvenience this may have caused, and appreciate your patience as we work on implementing additional safeguards to our infrastructure to ensure uninterrupted service.

Timeline:

At 21:03 UTC, our team applied a change in our cloud infrastructure configuration. This action was applied to reduce the size of the failure domain associated with the interaction between applications, by isolating critical and non-critical applications.

At 21:09 UTC, our alarms were triggered alerting our incident response team of a rapid increase in error rates in our platform.

At 21:14 UTC, the incident response team identified that one of our clusters had degraded performance and started mitigation actions. These actions involve redirecting network traffic from the degraded cluster to healthy clusters. Unfortunately this cannot be done instantly, as abrupt changes in traffic volume could negatively impact the healthy clusters and generate a global outage.

At 21:30 UTC, the mitigation actions were completed and errors ceased.
Posted Aug 24, 2023 - 22:38 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 24, 2023 - 21:32 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 24, 2023 - 21:21 UTC
This incident affected: Checkout, WebStore, Administrative Environment, and Internal Modules.