Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

On the afternoon of April 4th, our CDN module, responsible for distributing all the static assets at VTEX Platform began to not respond to HTTPS requests. This incident caused the disruption on sales during the period between 03:54 PM BRT and 04:42 PM BRT.

About the incident:

The team responsible for this module changed one HTTPS protocol used by VTEX, aiming to increase the security of our applications but our cache tier for static files was using this retired protocol to deliver these files.

Our monitoring system detected the problem and the team tried to execute the rollback operation immediately but found failures in the rollback process.

About the CDN Module Rollback Process:

During the disruption, the rollback process had dependencies that disallow us to execute the rollback. Basically, it needs to generate information from another module to execute the rollback process but this module had dependencies from the CDN Module which was unable to generate the information for the rollback. Our team generated the necessary information and executed the rollback manually but we got much more time than expected.

The CDN Module team will work to improve the rollback process and will find other ways to generate all necessary information to execute the rollback properly and rethink about important changes like that in the future.

We know the impact of this kind of disruption in our customer's operations and we will work hard to learn with this incident.

Mon, Jul 30, 2018, 07:29 PM
1 year earlier...

Resolved

Between 03:54 PM and 04:42 PM UTC-3 (Brasilia), we experienced elevated error rates in our platform. We will work to avoid this issues in the future. The service is now operating normally.

Thu, May 4, 2017, 08:11 PM
16m earlier...

Monitoring

We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.

Thu, May 4, 2017, 07:55 PM
21m earlier...

Identified

We are now recovering from the increased error rates. We continue to work towards full recovery.

Thu, May 4, 2017, 07:34 PM
12m earlier...

Identified

We are continuing to work towards full resolution of this issue. We continue to work on recovery.

Thu, May 4, 2017, 07:21 PM

Identified

We have identified the root cause of the increased error rates. We are working to resolve this issue.

Thu, May 4, 2017, 07:16 PM

Investigating

We are continuing to investigate increased error rates in some parts of our platform.

Thu, May 4, 2017, 07:16 PM
13m earlier...

Investigating

We are investigating increased error rates in our platform.

Thu, May 4, 2017, 07:02 PM