Increase in Master Data 5xx errors

Resolved·Partial outage

This incident has been resolved. We will provide further information in a public incident report.

Summary

On Mar 14, 2024, from 14:46 to 15:33 UTC, systems and integrations that have Master Data as a dependency (such as some Logistics/B2B apps and services) experienced a higher than usual 5xx error rate.

Our global sales flow was partially affected for 20 minutes during this incident. After that, degraded performance was observed for 27 minutes.

We apologize for any inconvenience this may have caused.

Timeline

At 14:46 UTC, the team responsible for Master Data identified that a recent configuration change was unsucessful. They immediately started mitigation actions to revert the configuration back to its original state.

At 14:48 UTC, the configuration was reverted back to its original state. However, some instances lingered in an unhealthy state even after the configuration was reverted.

At 14:53 UTC, our incident response team was notified of the issue.

At 15:05 UTC, the team identified that the instances lingered in an unhealthy state due to corrupted cache. The team started discussing potential mitigation actions.

At 15:19 UTC, the team started identifying and terminating unhealthy instances as a mitigation action.

At 15:33 UTC, the team completed all mitigation actions and normal platform behavior was reestablished.

Thu, Mar 14, 2024, 04:12 PM

(1 year ago)

Affected components

Mar 14, 2024, 02:45 PM

03:39 PM

Updates

Resolved

This incident has been resolved. We will provide further information in a public incident report.

Summary

On Mar 14, 2024, from 14:46 to 15:33 UTC, systems and integrations that have Master Data as a dependency (such as some Logistics/B2B apps and services) experienced a higher than usual 5xx error rate.

Our global sales flow was partially affected for 20 minutes during this incident. After that, degraded performance was observed for 27 minutes.

We apologize for any inconvenience this may have caused.

Timeline

At 14:48 UTC, the configuration was reverted back to its original state. However, some instances lingered in an unhealthy state even after the configuration was reverted.

At 14:53 UTC, our incident response team was notified of the issue.

At 15:05 UTC, the team identified that the instances lingered in an unhealthy state due to corrupted cache. The team started discussing potential mitigation actions.

At 15:19 UTC, the team started identifying and terminating unhealthy instances as a mitigation action.

At 15:33 UTC, the team completed all mitigation actions and normal platform behavior was reestablished.

Thu, Mar 14, 2024, 04:12 PM

Monitoring

The fix for the issue in Master Data has been implemented.

Stores should no longer be experiencing higher than usual 5xx errors in Master Data.

Our incident response team is monitoring to guarantee that normal platform behavior is fully reestablished.

We will send an additional update in the next 30 minutes, or as soon as we have more information to share.

Thu, Mar 14, 2024, 03:39 PM(32 minutes earlier)

Identified

We have identified the contributing factors of the issue in Master Data: a recent configuration change made to its core services was unsucessful, and some of the instances running these services lingered in an unhealthy state even after the configuration was reverted due to corrupted cache

Our incident response team is currently fixing the issues by intervening directly in the instances that remain in an unhealthy state. We estimate that the fix will be completed in the next minutes.

We will send an additional update in the next 30 minutes, or as soon as we have more information to share.

Thu, Mar 14, 2024, 03:26 PM(13 minutes earlier)

Investigating

We are currently investigating an increase in 5xx errors in Master Data.

Systems and integrations that have Master Data as a dependency (such as some Logistics/B2B apps and services) may be experiencing a higher than usual error rate.

Our incident response team is working to identify the root cause and implement a solution.

We will send an additional update in the next 30 minutes, or as soon as we have more information to share.

Thu, Mar 14, 2024, 03:18 PM