Major Outage

Resolved

Lasted for 1d

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

VTEX Cloud Commerce Platform Major Outage Event: May 27, 2019

We wanted to provide some additional details about the problem we experienced on Monday, May 27th.

Starting at 6:30 am ET \(7:30 am GMT -3\) until 10:20 am ET \(11:20 am GMT-3\), VTEX platform experienced a major outage, during which the platform’s ability for customers to place new orders was hindered, as detailed below.

The incident was platform-wide, affecting the entire client-base, for which we apologize and feel that we owe it to our clients to be fully transparent about what happened and bring clarity to our future action plans in order to avoid this type of incident reoccurring.

Before going into details about the incident, we would like to take a moment and explain the architecture and disaster relief plan used by our engineering team, to give you the necessary context for a better understanding of the events surrounding the incident.

Solution architecture and engineering methodology

VTEX is made up of over 70 microservices working together to offer a scalable platform. Due to the inherent demand variation and traffic spikes, servers are automatically added or removed from the entire infrastructure in a process called “Auto Scaling”.

Auto Scaling:

The auto scaling process is performed several times a day, adding new servers according to an increase in demand or removing them as it decreases. And, aiming for maximum platform management efficiency, we use various technologies and providers to ensure an efficient process.

Disaster Recovery Plan:

The plan consists of policies and procedures that VTEX follows in case of major outages. This could happen because of a natural disaster, a technological failure or human factors. Consequently, the goal is to restore the affected business processes as quickly as possible, whether by bringing the disrupted services back online or by switching to a contingency system.

The Disaster Recovery Plan includes:

all services that constitute the solution;
all business processes that support it or its operation;
all business processes that support our customers who depend on our platform;

Similar to our development lifecycle, the Disaster Recovery Plan is fully supported and implemented by automated processes that are triggered based on automated monitoring and alarms.

About the event

On Monday morning, May 27th, 2019, at 6:30 am ET \(7:30 am GMT -3\), our alarm system indicated that a majority of our services were not scaling according to the increase in demand.

Until 8:50 am ET \(9:50 am GMT -3\), we were under the impression that our main infrastructure provider was undergoing problems, given that the new servers were started but the necessary software was not properly installed. And, during this period, our engineers were working with the infrastructure support team trying to identify the root cause of the problem.

What was stymieing us finding a solution was the fact that the problem only occurred in one of our accounts. In addition, in such isolated and non-generalised cases, it's not easy to maintain a clear and productive communication between the affected parties.

As we had started to work with other hypotheses and due to the fact that the auto scaling service had been updated during the weekend, other teams had started to work on different scenarios as per the time being.

At 8:54 am ET \(9:54 am GMT -3\), another team of engineers working with other hypotheses of identifying the root cause of the problem, managed to find an issue with the auto scaling service updates from one of our infrastructure providers.

Our auto scaling was failing and given the sophistication of its services, the main challenge was to initially isolate the failing component.

As of 8:57 am ET \(9:57 am GMT -3\), we initiated the failover process and started to replace the unhealthy auto scaling infrastructure, which had been updated during the weekend and was identified as the root cause of the problem.

From 8:57 am ET \(9:57 am GMT -3\) until 10:20 am ET \(11:20 am GMT -3\) we worked on removing the dependencies of the specific unhealthy auto scaling services and at 10:01 am ET \(11:01 am GMT -3\) we started to notice an improvement in our entire infrastructure, which then fully recovered at 10:20 am ET \(11:20 am GMT -3\).

The auto scaling service provider was able to identify its update failure only at 1:00 pm ET \(2:00 pm GMT -3\) and executed the update’s rollback thereafter. We are still working together with the service provider to identify the incident's specific root cause and we will be providing a follow-up as more information becomes available.

Next steps

Although operational failures of our providers generate natural disruptions to our services, our engineering team’s round-the-clock awareness will ensure that such disruptions are transparent so that the impact on our clients is minimal in terms of effect and time.

Given the history of this unprecedented event, we failed to identify the defective component and should have a faster contingency plan in case one of our third-party critical mission providers fails. In addition to the obvious continuous close collaboration with all our providers in order to continue to increase the resilience of our solutions.

Short-term action plan

We will review our processes to improve our capability to diagnose unprecedented scenarios.

We will start working with multiple hypotheses much faster when the issue seems to be related to one of our infrastructure providers and there isn’t a 100% certainty.

Long-term action plan

Although it won't have an impact in the short to medium term, this unprecedented event provoked a change in our roadmap. We will accelerate our roadmap to develop our infrastructure’s multi-region capability.

Finally, we want you to know that we are passionate about providing the best possible solution so that you can spend more time focusing on your business rather than having to engage in building scalable, reliable digital commerce infrastructure.

Although we're constantly improving our operational processes, we know that any major outage is not acceptable and we will not be satisfied until our continuous service is at a comfortable level.

Tue, May 28, 2019, 01:24 PM

21h earlier...

Resolved

Between 8:07 AM and 11:20 AM UTC-3 (BRT), we experienced a major outage in our platform. We are working to avoid this issue in the future. The service is now operating normally.

A post-mortem will be available soon.

Mon, May 27, 2019, 04:13 PM

35m earlier...

Monitoring

We can confirm an improvement in the error rates. Users are once again able to place orders.

We are still monitoring all the services and working towards full resolution.

Mon, May 27, 2019, 03:37 PM

1h earlier...

Identified

We are now recovering from the increased error rates. We continue to work towards full recovery.

Mon, May 27, 2019, 02:21 PM

1h earlier...

Identified

Dear Customers,

We are suffering from a major outage that is affecting all of our customers. It started at 8:07AM UTC -03:00 and still ongoing at this moment, this may be the largest outage event that we ever faced.

Impact may vary but most of our customers are facing issues during navigation and checkout.

The ability to place orders is hindered.

This incident does not affect your data, all of your data is still safely stored in our secure storage.

Our engineering team is working tirelessly towards full resolution.

Those issues affecting our ability to sell were already identified to a service outage of the auto scaling services, at our infrastructure provider.

We assure you that we are doing everything that can be done, more updates as the events unfolds.

Mon, May 27, 2019, 01:14 PM

15m earlier...

Identified

We are continuing to work on a fix for this issue.

Mon, May 27, 2019, 12:58 PM

Identified

We are continuing to work towards full resolution of this issue. We continue to work on recovery.

Mon, May 27, 2019, 12:58 PM

43m earlier...

Identified

We are continuing to work towards full resolution of this issue. We continue to work on recovery.

Mon, May 27, 2019, 12:14 PM

27m earlier...

Identified

We are continuing to work towards full resolution of this issue. We continue to work on recovery.

Mon, May 27, 2019, 11:47 AM

19m earlier...

Investigating

We are investigating increased error rates in our platform.

Mon, May 27, 2019, 11:28 AM