Elevated Errors on VTEX IO platform

Resolved

Lasted for 8h

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

We had a problem with our strategy for allocating instances. The guarantee of resource allocation depends on diversifying the types of instances as much as possible, but unfortunately in the context, it was observed that the current quantity is less than necessary to avoid the abrupt and simultaneous withdrawal of spot instances that reached their limit of life.

We also noticed that the registration of new substitute instances found a bottleneck in our clusters manager.

As an immediate action, we doubled the number of types of spot instances that can be allocated in a given zone and we are considering increasing the proportion of minimum non-spot instances.

We already have an action plan to improve the capacity of our clusters manager, allowing us to expand the capacity to register new instances simultaneously, accelerating recovery in these scenarios.

Fri, Sep 18, 2020, 08:57 PM

8h earlier...

Resolved

Between 9:00 AM and 9:20 AM UTC-3 (BRT), we experienced elevated error rates in our VTEX IO platform. Most affected environments were administrative and development ones. We are working to avoid this issue in the future. The service is now operating normally.

Fri, Sep 18, 2020, 12:39 PM

Monitoring

A fix has been implemented and we are monitoring the results.

Fri, Sep 18, 2020, 12:31 PM

Identified

We are recovering from the increased error rates since 9:14 AM UTC-3. We continue to work towards full recovery.

Fri, Sep 18, 2020, 12:26 PM

Investigating

We are investigating increased error rates in our VTEX IO platform.

Fri, Sep 18, 2020, 12:19 PM