Global Outage in the Platform
Resolved
Lasted for 8h

On July 13, 2023, at 15:02 UTC, our alarms were triggered in regard to a possible global outage in the VTEX Platform. The incident response team promptly gathered and identified an issue in our License Manager module and started to remediate it.

At 15:14 UTC, we rolled back a deployment observing a gradual recovery in the License Manager module. Unfortunately, we also identified a cascading failure in several modules.

At 15:36 UTC, we mitigated most of our cascading failure, but one part of our underlying infrastructure was still degraded and impacting almost half of our global sessions and orders.

At 16:47 UTC, after several actions that increased our platform stability but not consistently up to 100%. We began a disaster recovery protocol to completely rebuild part of the underlying degraded infrastructure mentioned in the previous paragraph. Over the next two hours, the sales flow gradually recovered from almost half to almost 100%.

At 19:26 UTC, we had recovered more than 90% of the global sales flow. Shortly afterwards recovery reached 100%, completing our disaster recovery phase. Some minor instability persisted for a few customers due to a higher than normal rate of order cancellations of approx. 2-5% resolving at 23:20 UTC.

Throughout this incident, our incident response team and engineering team worked to restore the platform's functionality.

Moving forward, here is a non-exhaustive list of actions that will be taken to prevent such incidents. Additional actions will be evaluated and prioritized as well:

  • Improve our system deployment metrics more promptly and identify services without updates for a certain period in order to ensure preventive change management.

  • Redesign our caching strategy across multiple services adding flush-cache knobs that can be more easily adjusted in a centralized way. This will reduce our time to mitigate incidents by avoiding having to provision capacity to invalidate certain caches.

Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.

We expressed appreciation for your understanding and patience during the incident. We remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.

Thu, Jul 13, 2023, 11:26 PM
9 months ago
Affected components

No components marked as affected

Updates

Resolved

On July 13, 2023, at 15:02 UTC, our alarms were triggered in regard to a possible global outage in the VTEX Platform. The incident response team promptly gathered and identified an issue in our License Manager module and started to remediate it.

At 15:14 UTC, we rolled back a deployment observing a gradual recovery in the License Manager module. Unfortunately, we also identified a cascading failure in several modules.

At 15:36 UTC, we mitigated most of our cascading failure, but one part of our underlying infrastructure was still degraded and impacting almost half of our global sessions and orders.

At 16:47 UTC, after several actions that increased our platform stability but not consistently up to 100%. We began a disaster recovery protocol to completely rebuild part of the underlying degraded infrastructure mentioned in the previous paragraph. Over the next two hours, the sales flow gradually recovered from almost half to almost 100%.

At 19:26 UTC, we had recovered more than 90% of the global sales flow. Shortly afterwards recovery reached 100%, completing our disaster recovery phase. Some minor instability persisted for a few customers due to a higher than normal rate of order cancellations of approx. 2-5% resolving at 23:20 UTC.

Throughout this incident, our incident response team and engineering team worked to restore the platform's functionality.

Moving forward, here is a non-exhaustive list of actions that will be taken to prevent such incidents. Additional actions will be evaluated and prioritized as well:

  • Improve our system deployment metrics more promptly and identify services without updates for a certain period in order to ensure preventive change management.

  • Redesign our caching strategy across multiple services adding flush-cache knobs that can be more easily adjusted in a centralized way. This will reduce our time to mitigate incidents by avoiding having to provision capacity to invalidate certain caches.

Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.

We expressed appreciation for your understanding and patience during the incident. We remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.

Thu, Jul 13, 2023, 11:26 PM
1h earlier...

Monitoring

We are observing global sessions and orders at nominal levels. As we continue to monitor the situation closely, our team is investigating reported scenarios of canceled orders in the Payments flow.

We remain committed to maintaining the stability of our platform and will address any remaining issues to guarantee a seamless user experience.

Thu, Jul 13, 2023, 09:48 PM
53m earlier...

Monitoring

Our additional efforts have been effective and recovered the remaining degraded infrastructure. We are observing a steady recovery of sessions and orders towards nominal levels.

As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.

Thu, Jul 13, 2023, 08:55 PM
1h earlier...

Identified

We'd like to give you the latest status update regarding the ongoing efforts to restore our platform.

While part of the sales flow is still experiencing negative impacts, we recovered 90% or more of our global sales flow, bringing sessions and orders close to nominal levels. To ensure this progress's stability, we diligently distribute traffic to the healthy infrastructure, taking precautions to avoid any regression in our mitigation efforts.

We appreciate your continued patience and support as we work towards a complete resolution. We will provide further updates as we make additional progress.

Thu, Jul 13, 2023, 07:16 PM
1h earlier...

Identified

We'd like to give you the latest status update regarding the ongoing efforts to restore our platform.

While part of the sales flow is still experiencing negative impacts, our disaster recovery plan is working. We are successfully recovering sessions. We have established a healthy infrastructure handling a significant portion of our traffic. To ensure this progress's stability, we diligently distribute traffic to the healthy infrastructure, taking precautions to avoid any regression in our mitigation efforts. Our team is working meticulously to maintain the highest level of service during this recovery phase.

We appreciate your continued patience and support as we work towards a complete resolution. We will provide further updates as we make additional progress.

Thu, Jul 13, 2023, 06:07 PM
23m earlier...

Identified

Part of our sales flow continues to be negatively impacted, and we have been executing our disaster recovery plan to rebuild the affected infrastructure. However, the process is taking longer than initially anticipated. We understand the frustration and inconvenience this may cause. We want to assure you that our entire incident response team and associated engineering teams are working to expedite the recovery process. Our top priority is to restore the platform to its full functionality as soon as possible.

Thu, Jul 13, 2023, 05:44 PM
41m earlier...

Identified

We'd like to give you the latest status update regarding the ongoing efforts to restore our platform.

Our team is actively working around the clock to recover the system entirely. While progress has been made, we are still in the process of resolving the issue completely. We appreciate your patience and understanding during this time. Rest assured that we are dedicating all available resources to expedite the resolution and ensure a seamless experience for our users. We will continue to provide regular updates as we work toward the full recovery of the platform.

Thu, Jul 13, 2023, 05:02 PM
25m earlier...

Identified

We want to provide an update on the current status of the platform. While we have progressed in recovering a portion of our sales flow, our team is still actively working towards a full recovery. Although we have observed a gradual return of orders to normal levels, ongoing efforts and additional actions are still being taken to ensure the complete restoration of the platform. We understand the importance of a seamless user experience and are fully committed to resolving any remaining issues.

Thu, Jul 13, 2023, 04:37 PM
27m earlier...

Identified

We have made progress in recovering a portion of our sales flow, which has resulted in a gradual return of orders to normal levels. While this is a positive development, we are still actively working to ensure the complete recovery of our platform. Our team is implementing additional actions and measures to address any remaining issues.

Thu, Jul 13, 2023, 04:09 PM
10m earlier...

Identified

We are diligently working to resolve the ongoing issue fully. We have made progress in restoring functionality to the Stores navigation and Administrative environment, although it is not yet fully operational.

Our team remains dedicated to taking further actions in order to achieve a complete recovery of the sales flow. We appreciate your patience and understanding as we work toward a resolution. We will provide further updates as the situation progresses.

Thu, Jul 13, 2023, 03:59 PM
15m earlier...

Identified

We continue working towards the full resolution of this issue, we recovered the navigation flow, but the checkout process is still degraded.

Thu, Jul 13, 2023, 03:43 PM
12m earlier...

Identified

We have identified the root cause of the increased error rates. We are working to resolve this issue.

Thu, Jul 13, 2023, 03:30 PM
20m earlier...

Investigating

We are investigating increased error rates in our platform.

Thu, Jul 13, 2023, 03:09 PM