Between 16:56 and 17:12 UTC, our shoppers and merchants experienced a major outage in the platform caused by CPU saturation in the License Manager module.
At 16:56 UTC, the License Manager module observed an abnormal increase in CPU usage. This exceeded the normal levels, indicating a potential issue. Subsequently, at 16:58 UTC, our alarms were triggered, and the platform experienced a major outage, affecting both the Sales flow and the Administrative environment.
To address the situation, we promptly took action. At 17:03 UTC, we increased the minimum nodes of the License Manager module. This adjustment aimed to alleviate the strain on the system and restore stability. By 17:07 UTC, the CPU usage of the License Manager had returned to the normal threshold.
The recovery process of the platform's nominal metrics started at 17:10 UTC. Finally, at 17:12 UTC, the platform had fully recovered, and we ensured ongoing monitoring of the License Manager and its dependencies to prevent further incidents.