Elevated Errors on VTEX IO platform

Write-up

Lasted for 8h

We had a problem with our strategy for allocating instances. The guarantee of resource allocation depends on diversifying the types of instances as much as possible, but unfortunately in the context, it was observed that the current quantity is less than necessary to avoid the abrupt and simultaneous withdrawal of spot instances that reached their limit of life.

We also noticed that the registration of new substitute instances found a bottleneck in our clusters manager.

As an immediate action, we doubled the number of types of spot instances that can be allocated in a given zone and we are considering increasing the proportion of minimum non-spot instances.

We already have an action plan to improve the capacity of our clusters manager, allowing us to expand the capacity to register new instances simultaneously, accelerating recovery in these scenarios.