On July 13, 2023, at 15:02 UTC, our alarms were triggered in regard to a possible global outage in the VTEX Platform. The incident response team promptly gathered and identified an issue in our License Manager module and started to remediate it.
At 15:14 UTC, we rolled back a deployment observing a gradual recovery in the License Manager module. Unfortunately, we also identified a cascading failure in several modules.
At 15:36 UTC, we mitigated most of our cascading failure, but one part of our underlying infrastructure was still degraded and impacting almost half of our global sessions and orders.
At 16:47 UTC, after several actions that increased our platform stability but not consistently up to 100%. We began a disaster recovery protocol to completely rebuild part of the underlying degraded infrastructure mentioned in the previous paragraph. Over the next two hours, the sales flow gradually recovered from almost half to almost 100%.
At 19:26 UTC, we had recovered more than 90% of the global sales flow. Shortly afterwards recovery reached 100%, completing our disaster recovery phase. Some minor instability persisted for a few customers due to a higher than normal rate of order cancellations of approx. 2-5% resolving at 23:20 UTC.
Throughout this incident, our incident response team and engineering team worked to restore the platform's functionality.
Moving forward, here is a non-exhaustive list of actions that will be taken to prevent such incidents. Additional actions will be evaluated and prioritized as well:
Improve our system deployment metrics more promptly and identify services without updates for a certain period in order to ensure preventive change management.
Redesign our caching strategy across multiple services adding flush-cache knobs that can be more easily adjusted in a centralized way. This will reduce our time to mitigate incidents by avoiding having to provision capacity to invalidate certain caches.
Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.
We expressed appreciation for your understanding and patience during the incident. We remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.