Elevated Errors

Resolved

Lasted for 9 months

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

On October 25th, our RNB (Rates and Benefits) module, responsible for all the discounts/promotions at VTEX Platform, began to show elevated response times for all requests. The root cause of this was an unexpected combination of promotion scenarios.

This incident caused two periods of disruption on sales. The first between 13:07 GMT and 14:56 GMT and the second between 15:44 GMT and 16:12 GMT

About the Rates and Benefits Module:

This Module is responsible for all the pricing processing at VTEX, it runs a huge ammount of requests per minute, and like all the other services it is automatically scaled if the workload changes.

About the incident:

Today a certain combination of promotions caused all the servers to reach 100% CPU usage. Our team tripled the amount of servers, and even so we were unable to process all the requests.During this time another team was analising all the logs and identified a uncommon behavior in certain promotion scenarios. We disabled the cenarios and the system was able to recover full capabilities at the end of the first period.We made some adjustments to the system and tried to turn on all the scenarios again, this didn't work as expected and we had the second period of downtime. These scenarios were turned off and the system recovered again.The system has a time limit control to prevent hanging requests and CPU draining, but for this specific case it didn't work as expected.

Three actions are being executed now and will be released tomorrow.

1 - A better timeout control during the discount processing: if the discount takes too long to be calculated the system will stop processing and will return a timeout.2 - Promotion limits: We will create new limits at certain promotion scenarios.3 - New metrics will be added to reduce the identification time of root causes like this one.

We know the impact of this kind of disruption in our customer's operations and we will work even harder to learn from this incident and improve our services.

Mon, Jul 30, 2018, 07:29 PM

9 months earlier...

Resolved

Between 01:44 PM and 02:12 PM UTC-3 (Brasilia), We experienced delays when using some resources of our platform. The service has recovered and is operating normally.

Wed, Oct 25, 2017, 04:25 PM

10m earlier...

Monitoring

We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.

Wed, Oct 25, 2017, 04:15 PM

14m earlier...

Identified

We have identified the root cause of the increased error rates. We are working to resolve this issue.

Wed, Oct 25, 2017, 04:00 PM

11m earlier...

Investigating

We are investigating increased error rates in our platform.

Wed, Oct 25, 2017, 03:49 PM