Elevated 5xx Error Rate in Navigation flow
Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

Wed, Aug 30, 2023, 12:18 PM
5d earlier...

Resolved

This incident has been resolved.

On August 24 2023, between 21:03 and 21:30 UTC, we observed elevated error rates in our platform. This was caused by changes that were performed in the configuration of our cloud infrastructure, as follow-up actions of our <a href="https://status.vtex.com/incidents/0lx9b68z8k80">previous status</a>.

We apologize for any inconvenience this may have caused, and appreciate your patience as we work on implementing additional safeguards to our infrastructure to ensure uninterrupted service.

Timeline:

At 21:03 UTC, our team applied a change in our cloud infrastructure configuration. This action was applied to reduce the size of the failure domain associated with the interaction between applications, by isolating critical and non-critical applications.

At 21:09 UTC, our alarms were triggered alerting our incident response team of a rapid increase in error rates in our platform.

At 21:14 UTC, the incident response team identified that one of our clusters had degraded performance and started mitigation actions. These actions involve redirecting network traffic from the degraded cluster to healthy clusters. Unfortunately this cannot be done instantly, as abrupt changes in traffic volume could negatively impact the healthy clusters and generate a global outage.

At 21:30 UTC, the mitigation actions were completed and errors ceased.

Thu, Aug 24, 2023, 10:38 PM
1h earlier...

Monitoring

A fix has been implemented and we are monitoring the results.

Thu, Aug 24, 2023, 09:32 PM
10m earlier...

Identified

The issue has been identified and a fix is being implemented.

Thu, Aug 24, 2023, 09:21 PM