No components marked as affected
Write-up published
Resolved
On October 25th, our RNB (Rates and Benefits) module, responsible for all the discounts/promotions at VTEX Platform, began to show elevated response times for all requests. The root cause of this was an unexpected combination of promotion scenarios.
This incident caused two periods of disruption on sales. The first between 13:07 GMT and 14:56 GMT and the second between 15:44 GMT and 16:12 GMT
About the Rates and Benefits Module:
This Module is responsible for all the pricing processing at VTEX, it runs a huge ammount of requests per minute, and like all the other services it is automatically scaled if the workload changes.
About the incident:
Today a certain combination of promotions caused all the servers to reach 100% CPU usage. Our team tripled the amount of servers, and even so we were unable to process all the requests.During this time another team was analising all the logs and identified a uncommon behavior in certain promotion scenarios. We disabled the cenarios and the system was able to recover full capabilities at the end of the first period.We made some adjustments to the system and tried to turn on all the scenarios again, this didn't work as expected and we had the second period of downtime. These scenarios were turned off and the system recovered again.The system has a time limit control to prevent hanging requests and CPU draining, but for this specific case it didn't work as expected.
Three actions are being executed now and will be released tomorrow.
1 - A better timeout control during the discount processing: if the discount takes too long to be calculated the system will stop processing and will return a timeout.2 - Promotion limits: We will create new limits at certain promotion scenarios.3 - New metrics will be added to reduce the identification time of root causes like this one.
We know the impact of this kind of disruption in our customer's operations and we will work even harder to learn from this incident and improve our services.
Resolved
Between 01:44 PM and 02:12 PM UTC-3 (Brasilia), We experienced delays when using some resources of our platform. The service has recovered and is operating normally.
Monitoring
We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.
Identified
We have identified the root cause of the increased error rates. We are working to resolve this issue.
Investigating
We are investigating increased error rates in our platform.