Elevated Errors in the Platform
Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

Wed, Aug 2, 2023, 12:26 PM
2 weeks earlier...

Resolved

On July 12, 2023, at 20:29 UTC, one of our replicated and highly available database systems experienced severe service degradation and didn't automatically recover. Our incident response team was alerted. This issue caused intermittent errors and high latency in the overall user experience, leading to a partial outage across the VTEX Platform. Sales flow, Product Indexing, and the Administrative Environment were severely impacted.

At 21:15 UTC, the initial issue was manually remediated. Unfortunately, several of our services circuit breakers opened up and failed to automatically close once the initial issue had been resolved. This extended the incident to some customers for longer than we'd like. The circuit breakers are in place to avoid complete unavailability, and as a side effect, a subset of our platform continued to operate.

By 22:00 UTC, our incident response team was still investigating the failure's cascading effects and manually implementing remediations to address the issue. At 19:32 GMT-03:00, we confirmed that remediation efforts were underway to recover the platform. However, some intermittent errors and higher latency persisted for several customers.

At 22:56 UTC, the additional team's remediation actions were proving successful, with sessions and orders gradually increasing towards expected levels.

At 23:03 UTC, we validated that the remediation actions had been effective, resulting in a steady recovery.

Moving forward, the following actions will be taken to prevent similar incidents:

  • Re-evaluating the size of the failure domain associated with the replicated and highly available database system.

  • Investigating and fixing the cause of the severe service degradation in the database system with one of VTEX's cloud providers.

  • Investigating and fixing the bug related to the circuit breakers not properly closing once the database issue had been resolved.

Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.

We expressed appreciation for your understanding and patience during the incident. We remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.

Thu, Jul 13, 2023, 12:36 AM
1h earlier...

Monitoring

Our remediation efforts have been effective, leading to a steady recovery of sessions and orders towards nominal levels. As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.

Wed, Jul 12, 2023, 11:26 PM

Identified

Our remediation efforts have been effective, leading to a steady recovery of sessions and orders towards nominal levels. As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.

Wed, Jul 12, 2023, 11:25 PM
28m earlier...

Identified

Our remediation actions are successful, and we observe a gradual increase in sessions and orders, bringing us closer to nominal levels. However, we remain vigilant and are prepared to take additional actions to ensure the long-term stability of our platform.

Wed, Jul 12, 2023, 10:57 PM
25m earlier...

Identified

We continue to have intermittent errors and high latency in the IO Platform. The overall impact can be experienced in Sales flow, Product Indexing, and Administrative Environment.

Currently, we are applying remediation to recover the IO Platform gradually.

Wed, Jul 12, 2023, 10:32 PM
31m earlier...

Identified

At 17:28 BRT, one internal infrastructure service got overloaded, generating intermittent errors and high latency in the IO Platform, which caused a partial outage in the entire VTEX Platform. At 18:12 BRT, We completed the first remediation, recovering the internal service, which slightly recovered the platform.

Sales flow, Product Indexing, and Administrative Environment are being severely impacted.

We are still investigating this failure's cascading effects and applying the following remediations.

Wed, Jul 12, 2023, 10:00 PM
29m earlier...

Identified

We are continuing to work on a fix for this issue. The first fix did not completely solve the problem.

Wed, Jul 12, 2023, 09:31 PM
12m earlier...

Identified

The issue has been identified and a fix is being implemented.

Wed, Jul 12, 2023, 09:18 PM
20m earlier...

Investigating

We are currently investigating this issue.

Wed, Jul 12, 2023, 08:58 PM