No components marked as affected
Write-up published
Resolved
On the morning of April 25th, our service PCI Gateway module, responsible for all the payment transactions at VTEX Platform, began to show elevated errors for all calls. The root cause of this was the expiration of one SSL Certificate used by all the internal routes.
This incident caused the disruption on sales during the period between 09:02 AM BRT and 11:05 AM BRT.
About the SSL Certificate:
VTEX is responsible for managing, buying and renewing all SSL Certificates, not just for all the stores but also for all the internal services of our platform. We periodically renew these certificates through a monitoring system that manage all expiration dates of each installed certificate. Last year we changed our certificates supplier. All the stores and all VTEX services were updated, including more than a thousand of certificates, without any disruption.
About the PCI Compliance
The PCI Gateway module follows the PCI Compliance which includes a specific access protocol and rules to guarantee the information security.
About the incident.
On April 25th at 9:00 AM (BRT) our monitoring system detected an elevated number of errors in our PCI Gateway module. Immediately we identified the expiration of the SSL certificate of this service but the monitoring system of SSL certificate expiration dates did not detect the date of expiration for this specific module.Despite we identified the problem quickly, the solution got hard due to the ways to authenticate into the environment of this service and to install the SSL certificate following PCI rules. The actions bellow were executed:
1- Renewed the SSL certificate and reinstall it in all places where it is required;
2- Implemented a fast way to make the maintenance easier;
On April 25th at 10:39 AM (BRT) we finished those actions but we took more time to propagate the solutions for all areas.On April 25th at 11:05 the system was completed recovered.
We know the impact of this kind of disruption in our customer's operations and we will work hard to learn with this incident.We will work on a solution to renew all SSL certificates automatically and improve the ways to monitoring the expiration dates of all certificates.
Resolved
Between 09:00 AM and 11:00 AM UTC-3 (Brasilia), we experienced elevated error rates in our platform. We will work to avoid this issues in the future. The service is now operating normally.
Monitoring
We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.
Investigating
We are continuing to investigate increased error rates in some parts of our platform.
Monitoring
We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.
Identified
We are still working on recovering from the increased error rates. We continue to work towards full recovery.
Identified
We are now recovering from the increased error rates. We continue to work towards full recovery.
Identified
We are continuing to work towards full resolution of this issue. We continue to work on recovery.
Identified
We have identified the root cause of the increased error rates. We are working to resolve this issue.
Investigating
We are investigating increased error rates in our platform.