On the afternoon of April 4th, our CDN module, responsible for distributing all the static assets at VTEX Platform began to not respond to HTTPS requests. This incident caused the disruption on sales during the period between 03:54 PM BRT and 04:42 PM BRT.
About the incident:
The team responsible for this module changed one HTTPS protocol used by VTEX, aiming to increase the security of our applications but our cache tier for static files was using this retired protocol to deliver these files.
Our monitoring system detected the problem and the team tried to execute the rollback operation immediately but found failures in the rollback process.
About the CDN Module Rollback Process:
During the disruption, the rollback process had dependencies that disallow us to execute the rollback. Basically, it needs to generate information from another module to execute the rollback process but this module had dependencies from the CDN Module which was unable to generate the information for the rollback. Our team generated the necessary information and executed the rollback manually but we got much more time than expected.
The CDN Module team will work to improve the rollback process and will find other ways to generate all necessary information to execute the rollback properly and rethink about important changes like that in the future.
We know the impact of this kind of disruption in our customer's operations and we will work hard to learn with this incident.