Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

Between 12:05 PM and 12:26 PM UTC-3 \(Brasilia\) today \(Oct 24 2018\) we experienced a major outage in the VTEX IO platform that serves all administrative environments and a few stores.

The incident started at 11:58 AM UTC-3, when we deployed a few bug fixes to a core infrastructure service in VTEX IO. Every request that happens in VTEX IO goes through this service, that is responsible for routing and authorization.

The new version was tested successfully in our staging environment, but only under high load it started to show any problems. The root cause was an incompatibility between the old and new versions of the service in the format we use for caching some data crucial to the core responsibilities of this service.

The inability to read the cached data forced it to recompute and save them again in the new format, which led to an increase in the number of requests and to much more use of resources. This, in turn, caused the latency to go up until timeouts started to happen. By the time the pages started not to load and the alarms to fire, the version update was almost complete.

Once the new format was being written to cache, a rollback could only cause more problems, as it could not be read by the previous version. The best course of action, then, was to get the system back to a healthy condition by manually scaling up new deployments of our rendering and data resolving servers and migrating the stores traffic to it. This fixed the outage due to the high load of requests and the stores on IO started behaving properly again.

To reduce the risk of issues like this to happen again in the future we plan to:

  • Fix the root cause of the incompatibility of the cached data. We already known what caused it and we will change it so no more bugs happen because of that;

  • Use the “stale if error” caching behavior for store pages and queries so the impact to the end-user is reduced;

  • Better tune our circuit-breaker sub-system in that service, so that we allow the whole platform to gradually but automatically heal itself when some outage happens on specific internal services;

  • Improve monitoring in this core infrastructure service to identify any problems earlier and cancel the version update in time;

  • Automatize many of the steps taken to recover the system so it can be brought back in a much quicker fashion.

    A few more technical details

The core service that caused the issue is the internal request router of the IO infrastructure. It is responsible for handling every HTTP request, identifying it’s destination account, workspace and backend service, authenticating and authorizing the call and providing a centralized HTTP cache for IO services.

The step of identifying the destination service for a request also involves fetching and processing what we call the “route map” of a workspace. This route map contains information about what specific versions are installed in a workspace and some directives on how each of those app services should be deployed in our cluster.

There are several internal layers of cache apart from that HTTP cache provided to services. This internal cache is performed both in local memory and in a distributed Redis service for better sharing of these cached resources across different instances of this service. This access to an external caching service demands that any data written to it must be serialized to a binary format to then be deserialized on reading.

One of the layers of cache is for that route map of workspaces. There exists a bug with this layer of cache, involving concurrent reads and writes to a memory hash map. This incident is currently generating a really small number of errors \(~0.0002% of requests\), but is a bug in the code that could easily escalate into something bigger especially as we grow the platform to handle more and more concurrent request.

The fix of this concurrent access bug involved a slight clean-up in the code as well, by moving a type between code packages for it to stay closer to where it was actually used. This change was incompatible with the current way the service serializes data to cache.

The main incident in the code is that, in this serialization logic, the service includes full type information of the objects in the resulting JSON. This means that the original type of the object can be identified directly from the serialized JSON, but also that any change in the name or location of the type in the code would result in errors when reading the previously serialized version. This is considered a bad behavior to be used by default as the most common assumption is that if the corresponding JSON from a specific type is compatible it shouldn’t really matter what is the type name or location in the project.

This is what caused the incompatibility with the new version that simply moved a type from one place to another in the project. This incompatibility would only be detected under high load, as a failure in deserializing an entry from the cache is simply considered a cache miss for any consumer code.

The more specific steps we are taking to avoid the issue in the future, considering the additional context provided here is:

  • Regarding the root cause of incompatibility, which is this serialization type handling behavior, stop including full type information in the JSONs. That can only lead to incompatibility bugs like this and other bad design choices in code;

  • Regarding further monitoring in the router service, to include at least an alarm when too much data fails to be deserialized from cache.

  • The “stale if error” and “circuit-breaker” are features from this router service that would be out of the scope of this bug but could have mitigated some of the problems created with these errors if better tuned and used.

Thu, Oct 25, 2018, 08:04 PM
1d earlier...

Resolved

Between 12:05 PM and 12:26 PM UTC-3 (Brasilia), we experienced elevated error rates in some parts of our platform. We will work to avoid this issues in the future. The service is now operating normally.

Wed, Oct 24, 2018, 04:27 PM
57m earlier...

Monitoring

We can confirm an improvement in the elevated error rates in our platform. We are monitoring the result of our actions.

Wed, Oct 24, 2018, 03:29 PM

Investigating

We are investigating increased error rates with our Render pages and administrative environment ("myvtex.com").

Wed, Oct 24, 2018, 03:27 PM