Partial global sales outage due to network failure across multiple failure domains
Incident Report for VTEX
Postmortem

On Sep 27, 2022, between 16:12 UTC and 19:56 UTC, our customers initially experienced high latency and timeout errors in our administrative environment. Later on, from 17:13 UTC forward the incident started to significantly impact sales, affecting 50% of our global orders per minute. We prioritised mitigating the incident on the sales flow and had our global orders return to normal levels at 19:28 UTC. At 22:02 UTC, we changed the incident status to resolved after more than one hour of monitoring the platform.

During the incident we were able to identify the direct contributing factors that lead to this outage. We mitigated them and added protections to avoid it from re-occurring. Now, we are focusing on further understanding the indirect contributing factors and triggers that lead to this happening in the first place. From that we will be defining a more thorough plan to solve this more broadly instead of being particular to this incident. We will be publishing more details in the coming days.

Last but not least, we understand that this brought considerable impact to our customers and do apologise for it. While incidents do happen, we do take them very seriously and do our best to respond and mitigate them accordingly. Equally important, we do our best to learn from them. All in all we understand how reliability is an important aspect of the service we provide.

You can read more details here: https://io.vtex.com.br/incident-report/Incident Sep. 27%2C 2022_ Partial global sales outage due to network failure across multiple failure domains.pdf

Posted Sep 28, 2022 - 17:38 GMT-03:00

Resolved
All systems nominal, we are fully recovered
Posted Sep 27, 2022 - 19:02 GMT-03:00
Update
We are still working on fixing the admin's increased latency and error rates and restoring it to normal behaviour. All other systems are operating normally.
Posted Sep 27, 2022 - 18:26 GMT-03:00
Update
We are still monitoring our admin environment's increased latency and restoring it to normal behavior. All other systems are back to normal.
Posted Sep 27, 2022 - 17:23 GMT-03:00
Monitoring
Our mitigation actions have restored the performance of the stores. We are monitoring the result of our actions and confirming that our administrative environment has returned to normal.
Posted Sep 27, 2022 - 16:56 GMT-03:00
Update
Our mitigation actions have restored great part of the stores behavior, and we expect to be back to normal soon.
Posted Sep 27, 2022 - 16:32 GMT-03:00
Update
We are continuing to work on restoring the behavior of the stores. We have seen some improvement in the webstore flow, but we aren't still up to normal levels.
Posted Sep 27, 2022 - 16:07 GMT-03:00
Update
We are continuing to work on a fix for this issue.
Posted Sep 27, 2022 - 15:24 GMT-03:00
Update
We can confirm that this partial outage is impacting sales, especially on VTEX IO stores, mostly due to customers being unable to navigate the store smoothly. We are working to restore our latency and normal behavior.
Posted Sep 27, 2022 - 14:46 GMT-03:00
Update
We have confirmed that a part of some stores' sales have been affected. We are continuing to apply mitigation measures.
Posted Sep 27, 2022 - 14:31 GMT-03:00
Identified
We've identified degraded performance in VTEX IO clusters. The admin panel is the most affected. Symptoms include high latency when serving pages and increased number of timeouts.

We are applying mitigation measures in order to reestablish normal behavior.
Posted Sep 27, 2022 - 14:04 GMT-03:00
This incident affected: Checkout, WebStore, Administrative Environment, and Internal Modules.