Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

On a few days in May, our Webstore module, responsible for the rendering pages and store navigation, presented momentary instabilities in some periods, causing an increased errors rates in our system.

This error did not interrupt all the sales, however, it showed relative impact during the periods below:

  • 23rd May - 07:42 PM and 08:06 PM and 12:23 PM and 12:42 PM

  • 21st May - 08:03 PM and 08:30 PM

  • 19th May - 08:45 AM and 09:97 AM

  • 12th May - 04:02 PM and 04:18 PM

About the incident:

In the previous month, VTEX changed servers configurations used by this module. A few weeks after this change, the module had a different behavior that caused some problems during the ramp up in the morning and later during the afternoon and evening after some servers were turned off.

The response time of requests to the system during these periods increased considerably, leading the servers to be removed from the load balancer that attends the requisitions.

VTEX have some cache layers, and for that reason, during this period many requests were still correctly answered, but requests that were not cached showed errors, and thus some sales were affected during the periods described.

Analyzing the crisis periods our team verified that the system showed an excessive increase in the response time and it did not affect the CPU, which made the scenario harder since the servers upscaling and downscaling are currently based on the CPU.

After identified this scenario our team tried to adjust the upscaling settings to solve the problem, which showed an improvement but caused the problem to occur again in periods where the number of requests had larger oscillations.

Changes were made to the system to have more information about the behavior in this new scenario. After collecting more data about the occurrence our team realized that the configuration exposed an existing problem in the application, which was crucial for this new scenario.

This adjustment would take considerable time to complete and many places in the system would need to be changed, a temporary workaround was made to avoid further instabilities while our team was refactoring the system code.

We performed several tests during periods of low access. Today (05/26) a new version of the system was released at 05:47 AM which solved the issue and improved performance in the application.

About server configuration and more details:

The new configuration doubled the server's processing capacity and the expected result was to obtain an increase in performance, but the number of parallel actions performed by the operational system did not keep up with the processing capacity.

As the upscaling were based on CPU, the problem occurred mainly because the server did not use the processing power, instead it was queuing new actions to be executed, the actions were waiting to be attended, and thus, it increased the response time a lot, getting the situation even worse, since new servers were not upscaled.

The new version of the system forces the operating system to initiate more threads and the requisitions are not queued but answered at the moment they are performed, making the processing of the server be used correctly and consequently performing properly upscaling.

A large refactoring of the system was required to make sure the system is now able to handle all requests in parallel, hundreds of files were changed, thousands of lines of code were improved to ensure the final solution and deliver a better product.

We know how repetitive incidents degrade system utilization and worked hard to solve the problem as quickly and correctly as possible. We are always monitoring the performance and improving the system to deliver the best e-commerce solution.

Mon, Jul 30, 2018, 07:29 PM
1 year earlier...

Resolved

Between 12:23 PM and 12:42 PM UTC-3 (Brasilia), we experienced elevated error rates in our platform. We will work to avoid this issues in the future. The service is now operating normally.

Tue, May 23, 2017, 04:00 PM
13m earlier...

Monitoring

We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.

Tue, May 23, 2017, 03:46 PM

Identified

We are continuing to work towards full resolution of this issue. We continue to work on recovery.

Tue, May 23, 2017, 03:39 PM

Identified

We have identified the root cause of the increased error rates. We are working to resolve this issue.

Tue, May 23, 2017, 03:34 PM

Investigating

We are investigating increased error rates in our platform.

Tue, May 23, 2017, 03:30 PM