A component of SunFire's system failed and caused network delays and latency across the enrollment and quoting platform and API services.
An AWS service, called memcache, failed unexpectedly. While this service is fault-tolerant, the age of the node(s) in particular caused a failure within the platform. The memcache service is used to reduce latency in system responsiveness.
Moving forward SunFire has implemented routine reviews of the node(s) and periodic recycling will be applied.
Incident Characteristics
Detailed Information | |
---|---|
Total Outage Time (actual outage, not degradation) | 00h00m |
Total degradation time (errors or slower than normal response) | 00h38m |
Start Time | 9/5/2024 10:10 AM |
Perceived End Time (SunFire monitoring showed a fully functioning environment) | 9/5/2024 11:09 AM |
Actual End Time | 9/5/2024 1:45 PM |
Affected services | Enrollment and Quoting Platform, Enrollments, Call Center (Blaze), Field Agent (BlazeConnect) , Ember, API Services |
Root Cause | An underlying AWS service failed unexpectedly. |
Current Status | Operational |
Remediations | During the incident, SunFire load balanced to its other active region to move away from the caching nodes affecting them. |
Data Loss | No data was Lost |
Mitigations (how do we stop this from occurring again) | SunFire has implemented a caching node hygiene posture to mitigate this in the future. |
SunFire, for full transparency, is including a screenshot of our third-party monitoring service indicating the outage period for non-cached network traffic.