No components marked as affected
Write-up published
Resolved
We had a problem with our strategy for allocating instances. The guarantee of resource allocation depends on diversifying the types of instances as much as possible, but unfortunately in the context, it was observed that the current quantity is less than necessary to avoid the abrupt and simultaneous withdrawal of spot instances that reached their limit of life.
We also noticed that the registration of new substitute instances found a bottleneck in our clusters manager.
As an immediate action, we doubled the number of types of spot instances that can be allocated in a given zone and we are considering increasing the proportion of minimum non-spot instances.
We already have an action plan to improve the capacity of our clusters manager, allowing us to expand the capacity to register new instances simultaneously, accelerating recovery in these scenarios.
Resolved
Between 9:00 AM and 9:20 AM UTC-3 (BRT), we experienced elevated error rates in our VTEX IO platform. Most affected environments were administrative and development ones. We are working to avoid this issue in the future. The service is now operating normally.
Monitoring
A fix has been implemented and we are monitoring the results.
Identified
We are recovering from the increased error rates since 9:14 AM UTC-3. We continue to work towards full recovery.
Investigating
We are investigating increased error rates in our VTEX IO platform.