Reports of Network Timeouts, Blank Pages, or No Response
Incident Report for SunFire
Postmortem

Summary

A component of SunFire's system failed and caused network delays and latency across the enrollment and quoting platform and API services.

What Happened

An AWS service, called memcache, failed unexpectedly. While this service is fault-tolerant, the age of the node(s) in particular caused a failure within the platform. The memcache service is used to reduce latency in system responsiveness.

Mitigations

Moving forward SunFire has implemented routine reviews of the node(s) and periodic recycling will be applied.

Incident Characteristics

Detailed Information
Total Outage Time (actual outage, not degradation) 00h00m
Total degradation time (errors or slower than normal response) 00h38m
Start Time 9/5/2024 10:10 AM
Perceived End Time (SunFire monitoring showed a fully functioning environment) 9/5/2024 11:09 AM
Actual End Time 9/5/2024 1:45 PM
Affected services Enrollment and Quoting Platform, Enrollments, Call Center (Blaze), Field Agent (BlazeConnect) , Ember, API Services
Root Cause An underlying AWS service failed unexpectedly.
Current Status Operational
Remediations During the incident, SunFire load balanced to its other active region to move away from the caching nodes affecting them.
Data Loss No data was Lost
Mitigations (how do we stop this from occurring again) SunFire has implemented a caching node hygiene posture to mitigate this in the future.

SunFire, for full transparency, is including a screenshot of our third-party monitoring service indicating the outage period for non-cached network traffic.

Posted Sep 10, 2024 - 16:10 UTC

Resolved
SunFire considers this issue resolved. Root cause analysis will be made available within 15 business days unless otherwise agreed upon.
Posted Sep 05, 2024 - 16:45 UTC
Update
Nominal network traffic levels have leveled off. Due to the nature of networking connections, please ensure you completely close your browser, clear your cookies, and cache.
Posted Sep 05, 2024 - 16:13 UTC
Update
A second issue was identified, and a fix implemented. Traffic is being re-routed and we are hearing reports of the application loading as intended. We are considering this degraded while users are re-routed. Please remember to clear your cookies and cache.
Posted Sep 05, 2024 - 16:00 UTC
Update
SunFire has new reports of network issues, but systems remain operational and within spec. Please remember to clear your cookies and cache.
Posted Sep 05, 2024 - 15:27 UTC
Monitoring
SunFire has confirmed system levels have returned back to normal and are monitoring the system after reports from customers they are no longer seeing the network issues.
Posted Sep 05, 2024 - 15:09 UTC
Update
SunFire has implemented a fix and are seeing services return to normal levels.
Posted Sep 05, 2024 - 14:45 UTC
Identified
SunFire has identified the issue as extremely high network traffic as outside of normal conditions.
Posted Sep 05, 2024 - 14:30 UTC
Investigating
We are currently investigating reports of network timeouts, blank pages when loading the app, or no response.
Posted Sep 05, 2024 - 14:14 UTC
This incident affected: Enrollment and Quoting Platform (Enrollments, HRA Forms, Call Center (Blaze), Field Agent (BlazeConnect), Direct to Consumer (Ember)).