Partial global sales outage due to network failure across multiple failure domains

Resolved

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

On Sep 27, 2022, between 16:12 UTC and 19:56 UTC, our customers initially experienced high latency and timeout errors in our administrative environment. Later on, from 17:13 UTC forward the incident started to significantly impact sales, affecting 50% of our global orders per minute. We prioritised mitigating the incident on the sales flow and had our global orders return to normal levels at 19:28 UTC. At 22:02 UTC, we changed the incident status to resolved after more than one hour of monitoring the platform.

During the incident we were able to identify the direct contributing factors that lead to this outage. We mitigated them and added protections to avoid it from re-occurring. Now, we are focusing on further understanding the indirect contributing factors and triggers that lead to this happening in the first place. From that we will be defining a more thorough plan to solve this more broadly instead of being particular to this incident. We will be publishing more details in the coming days.

Last but not least, we understand that this brought considerable impact to our customers and do apologise for it. While incidents do happen, we do take them very seriously and do our best to respond and mitigate them accordingly. Equally important, we do our best to learn from them. All in all we understand how reliability is an important aspect of the service we provide.

Wed, Sep 28, 2022, 11:51 AM

Resolved

All systems nominal, we are fully recovered

Tue, Sep 27, 2022, 10:02 PM(13 hours earlier)

Monitoring

We are still working on fixing the admin's increased latency and error rates and restoring it to normal behaviour. All other systems are operating normally.

Tue, Sep 27, 2022, 09:26 PM(36 minutes earlier)

Monitoring

We are still monitoring our admin environment's increased latency and restoring it to normal behavior. All other systems are back to normal.

Tue, Sep 27, 2022, 08:23 PM(1 hour earlier)

Monitoring

Our mitigation actions have restored the performance of the stores. We are monitoring the result of our actions and confirming that our administrative environment has returned to normal.

Tue, Sep 27, 2022, 07:56 PM(27 minutes earlier)

Identified

Our mitigation actions have restored great part of the stores behavior, and we expect to be back to normal soon.

Tue, Sep 27, 2022, 07:32 PM(23 minutes earlier)

Identified

We are continuing to work on restoring the behavior of the stores. We have seen some improvement in the webstore flow, but we aren't still up to normal levels.

Tue, Sep 27, 2022, 07:07 PM(25 minutes earlier)

Identified

We are continuing to work on a fix for this issue.

Tue, Sep 27, 2022, 06:24 PM(42 minutes earlier)

Identified

We can confirm that this partial outage is impacting sales, especially on VTEX IO stores, mostly due to customers being unable to navigate the store smoothly. We are working to restore our latency and normal behavior.

Tue, Sep 27, 2022, 05:46 PM(38 minutes earlier)

Identified

We have confirmed that a part of some stores' sales have been affected. We are continuing to apply mitigation measures.

Tue, Sep 27, 2022, 05:31 PM(14 minutes earlier)

Identified

We've identified degraded performance in VTEX IO clusters. The admin panel is the most affected. Symptoms include high latency when serving pages and increased number of timeouts.

We are applying mitigation measures in order to reestablish normal behavior.

Tue, Sep 27, 2022, 05:04 PM(27 minutes earlier)