On Sep 27, 2022, between 16:12 UTC and 19:56 UTC, our customers initially experienced high latency and timeout errors in our administrative environment. Later on, from 17:13 UTC forward the incident started to significantly impact sales, affecting 50% of our global orders per minute. We prioritised mitigating the incident on the sales flow and had our global orders return to normal levels at 19:28 UTC. At 22:02 UTC, we changed the incident status to resolved after more than one hour of monitoring the platform.
During the incident we were able to identify the direct contributing factors that lead to this outage. We mitigated them and added protections to avoid it from re-occurring. Now, we are focusing on further understanding the indirect contributing factors and triggers that lead to this happening in the first place. From that we will be defining a more thorough plan to solve this more broadly instead of being particular to this incident. We will be publishing more details in the coming days.
Last but not least, we understand that this brought considerable impact to our customers and do apologise for it. While incidents do happen, we do take them very seriously and do our best to respond and mitigate them accordingly. Equally important, we do our best to learn from them. All in all we understand how reliability is an important aspect of the service we provide.
You can read more details here: https://io.vtex.com.br/incident-report/Incident Sep. 27%2C 2022\_ Partial global sales outage due to network failure across multiple failure domains.pdf