No components marked as affected
Write-up published
Resolved
All date-times are on GTM-3.
On July 19, 2019, at 6:42 PM we deployed a new version of a core component of the VTEX IO infrastructure that contained 2 major changes.
An improvement for one of the platform's cache logic, which fixed an issue that was impacting a minority of our customers. This was of extreme importance and urgency as there were clues that it could be the cause of larger problems in the short-term.
A change in the format of a specific cookie which is used by the platform to perform user allocation in A/B tests. The cookie carried only workspace and session related information, but we also needed to take into account the test id that the session belonged to, so we could properly identify lingering sessions from previous tests and aimed to remove a data distortion in the results of running A/B tests that were being mixed with results from a previous tests.
This last change caused a malfunction in some services, which inadvisedly had dependencies in this cookie's format, one of these malfunctioning services was critical to the checkout flow and impacted the sales of stores running on top of VTEX IO platform.
On July 20, 2019, at 9:30 AM a customer got in contact with our support team reporting the problem, we notified the engineering team who started investigating the root cause. At 10:48 AM the root cause was identified and we started the rollback process, which as completed by 10:52 AM fixing the problem.
There are two main reasons why we took so long to identify and resolve this incident.
A deploy by the end of the day where our customers and our engineering teams were out of the office.
Due to the specificity of the scenario that was the root cause of the incident, it impacted less than 1% of our customer base as our monitoring systems were unable to detect a behavior change after the release of the new version.
We are working internally to reinforce our practices to avoid deploys where neither the customer or our team is available to identify and fix unwated behavior changes.
Monitoring behavior change in such a specific scenario is a challenging endeavor, in the short-term, we will add individual monitoring to such customers as we work on improving our analytical capabilities to identify such scenarios at scale.
Resolved
Between July 19 at 6:42 PM and July 20 at 10:52 AM UTC-3 (BRT), we experienced an outage of VTEX IO platform that impacted < 1% of our customers. The service is now operating normally.