Between 16:56 and 17:12 UTC, our shoppers and merchants experienced a major outage in the platform caused by CPU saturation in the License Manager module.
At 16:56 UTC, the License Manager module observed an abnormal increase in CPU usage. This exceeded the normal levels, indicating a potential issue. Subsequently, at 16:58 UTC, our alarms were triggered, and the platform experienced a major outage, affecting both the Sales flow and the Administrative environment.
To address the situation, we promptly took action. At 17:03 UTC, we increased the minimum nodes of the License Manager module. This adjustment aimed to alleviate the strain on the system and restore stability. By 17:07 UTC, the CPU usage of the License Manager had returned to the normal threshold.
The recovery process of the platform's nominal metrics started at 17:10 UTC. Finally, at 17:12 UTC, the platform had fully recovered, and we ensured ongoing monitoring of the License Manager and its dependencies to prevent further incidents.
Posted Jul 11, 2023 - 18:18 UTC
Update
We are continuing to monitor our platform for any further issues. The user experience in the platform is back to nominal.
Posted Jul 11, 2023 - 17:37 UTC
Monitoring
VTEX Sales flow and Administrative environment are back to normal.
We have experienced major degradation in our License Manager module that impacted the sales flow and administrative environment. Our team quickly identified and fixed the issue. We are now monitoring the environment.
Posted Jul 11, 2023 - 17:28 UTC
Identified
We are now recovering from the increased error rates. We continue to work towards full recovery.
Posted Jul 11, 2023 - 17:18 UTC
Investigating
We are investigating increased error rates in our platform.
Posted Jul 11, 2023 - 17:09 UTC
This incident affected: Checkout and Administrative Environment.