Elevated Errors in the Platform
Incident Report for VTEX
Postmortem
Posted Aug 02, 2023 - 12:27 UTC

Resolved
On July 12, 2023, at 20:29 UTC, one of our replicated and highly available database systems experienced severe service degradation and didn't automatically recover. Our incident response team was alerted. This issue caused intermittent errors and high latency in the overall user experience, leading to a partial outage across the VTEX Platform. Sales flow, Product Indexing, and the Administrative Environment were severely impacted.

At 21:15 UTC, the initial issue was manually remediated. Unfortunately, several of our services circuit breakers opened up and failed to automatically close once the initial issue had been resolved. This extended the incident to some customers for longer than we'd like. The circuit breakers are in place to avoid complete unavailability, and as a side effect, a subset of our platform continued to operate.

By 22:00 UTC, our incident response team was still investigating the failure's cascading effects and manually implementing remediations to address the issue. At 19:32 GMT-03:00, we confirmed that remediation efforts were underway to recover the platform. However, some intermittent errors and higher latency persisted for several customers.

At 22:56 UTC, the additional team's remediation actions were proving successful, with sessions and orders gradually increasing towards expected levels.

At 23:03 UTC, we validated that the remediation actions had been effective, resulting in a steady recovery.

Moving forward, the following actions will be taken to prevent similar incidents:
- Re-evaluating the size of the failure domain associated with the replicated and highly available database system.
- Investigating and fixing the cause of the severe service degradation in the database system with one of VTEX's cloud providers.
- Investigating and fixing the bug related to the circuit breakers not properly closing once the database issue had been resolved.

Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.

We expressed appreciation for your understanding and patience during the incident. We remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.
Posted Jul 13, 2023 - 00:36 UTC
Monitoring
Our remediation efforts have been effective, leading to a steady recovery of sessions and orders towards nominal levels. As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.
Posted Jul 12, 2023 - 23:26 UTC
Update
Our remediation efforts have been effective, leading to a steady recovery of sessions and orders towards nominal levels. As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.
Posted Jul 12, 2023 - 23:25 UTC
Update
Our remediation actions are successful, and we observe a gradual increase in sessions and orders, bringing us closer to nominal levels. However, we remain vigilant and are prepared to take additional actions to ensure the long-term stability of our platform.
Posted Jul 12, 2023 - 22:57 UTC
Update
We continue to have intermittent errors and high latency in the IO Platform. The overall impact can be experienced in Sales flow, Product Indexing, and Administrative Environment.

Currently, we are applying remediation to recover the IO Platform gradually.
Posted Jul 12, 2023 - 22:32 UTC
Update
At 17:28 BRT, one internal infrastructure service got overloaded, generating intermittent errors and high latency in the IO Platform, which caused a partial outage in the entire VTEX Platform. At 18:12 BRT, We completed the first remediation, recovering the internal service, which slightly recovered the platform.

Sales flow, Product Indexing, and Administrative Environment are being severely impacted.

We are still investigating this failure's cascading effects and applying the following remediations.
Posted Jul 12, 2023 - 22:00 UTC
Update
We are continuing to work on a fix for this issue. The first fix did not completely solve the problem.
Posted Jul 12, 2023 - 21:31 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 12, 2023 - 21:18 UTC
Investigating
We are currently investigating this issue.
Posted Jul 12, 2023 - 20:58 UTC
This incident affected: Checkout, WebStore, Administrative Environment, and Internal Modules.