Summary
At approximately 04:06 UTC we started to receive reports of connection failures to Degreed from users attempting to access our CA (Canadian) Datacenter via SSO. After identifying the issue, our Database team began our failover process to our secondary servers at 09:50 UTC. Once this process was completed, the system was fully restored and operational at 10:30 UTC.
Root Cause
Our Engineering team was alerted to connection failures on the CA Datacenter due to a routine network infrastructure update by Microsoft Azure. Due to this update, the SAML based SSO login broke for users in that datacenter. After failing over to our secondary servers that were not affected by the Microsoft update, the CA datacenter was fully restored.
Plans for Improvement & Prevention
To ensure we are proactive in these types of incidents, we are working to create regular automated test for SAML based SSO logins, as well as additional alerts to monitor regular Microsoft updates in this environment.