[osc-fr1] A database-hosting server is unresponsive
» View Event Details | Created
Resolved
All databases restarted, end of incident
Posted:
» View Event Details | Created
Resolved
All databases restarted, end of incident
Posted:
» View Event Details | Created
Post-Mortem
# Incident Report: Surviving the outage of Level3, major Internet peering provider
## TL;DR
On Sunday 30th of August a major actor of the Internet Level3, part of the CenturyLink telecommunication company, experienced a massive outage which impacted large parts of the Internet transit, including our direct networking provider Outscale, and as a consequence the Scalingo Platform. The incident is not the direct responsibility of Scalingo but we did our maximum to diminish the impact on our customers.
The incident impacted both the ingress traffic (users connecting to applications and databases hosted on the platform) and egress traffic (connection from apps to external services). This post goes over the course of events, covers the reaction from our team and explain which actions will be taken in the future.
## Timeline of the incident
All timestamps are in Central European Summer Time (CEST).
- [12:02] Pagers are ringing, some components of our infrastructure are detected as unreachable by our external probes on both `osc-fr1` and `osc-secnum-fr1`, our team is instantly starting our incident procedure. Creation of a post on our status page announcing networking issues to access the infrastructure. Beginning of the diagnostic.
- [12:18] Services and applications are mostly reachable again, our team is checking that the situation is back to normal.
- [12:26] Network is blinking again, a few minutes up, then down, our network provider Outscale is reached to gather more information. They acknowledge that an incident is ongoing, but they don't have a precise cause to share with us.
- [13:00] Network access is completely down again, our team has gathered network analysis showing heavy loss of packets when trying to reach the infrastructure. Our team, working remotely, is using multiple Internet carrier, we realize that people on the Vodafone/SFR network can still access part of the infrastructure, while those linked to Orange are completely in the dark. It gives us the first insight to the fact it's not a DDoS, it's not Outscale-related, it's a problem related to Internet transit.
- [14:00] Our team discovers the status pages of other hosting providers relating the same incident:
- Scaleway: [https://status.scaleway.com/incident/956](https://status.scaleway.com/incident/956)
- Cloudflare: [https://www.cloudflarestatus.com/incidents/hptvkprkvp23](https://www.cloudflarestatus.com/incidents/hptvkprkvp23)
- [14:28] Communication from Outscale that their attempt to propagate new routes to announce our IPs over the BGP network is not working. That's why the IPs targeting our infrastructure have not failed over other Internet transit Provider. (Outscale is peering with Level3, Cogent, FranceIX)
- [14:35] A mitigation method is validated to allow most applications to be reachable again:
- By booting servers from providers which were not impacted by the outage, we found an Internet route to reach our infrastructure without transiting through Level3. The idea was to create a complete traffic redirection of HTTP(S) traffic passing by this route (2 additional connection hops we would be controlling).
- [15:25] The mitigation has been setup and our tests are working as expected for our region `osc-fr`. Our DNS rules are updated, all application reached using `*.osc-fr1.scalingo.io` URL or which domains are using a CNAME field to target their Scalingo domain are now reachable again.
- [15:50] Same mitigation is setup for our region `osc-secnum-fr1`, applications of this region are available again.
- [17:10] End of the Level3 Outage, after ensuring everything is working as expected, we are disabling the mitigation which was ongoing.
## Analysis
This incident was a global worldwide incident. Companies and Internet users were impacted especially in Europe and North America where CenturyLink/Level3 is particularly present.
Normally such incident is easy to get around when an infrastructure is directly linked with multiple transit providers as **it is the case** for Outscale (Cogent, Level3, FranceIX). If one of them has difficulty, through the BGP protocol, announcements are done to update how IPs should be routed to prevent passing through the damaged links. However in this case, Level3 was not reacting to the declaration of route updates and kept broadcasting the old rules, preventing the new ones to be applied.
Several actors of the Internet couldn't apply the standard procedure in order to change the Internet routing rules to a point where major peer entities gathered and decided to start blacklisting the infrastructure of Level3 to solve this problem worldwide. This operation correlated with the restoration of their service by Level3 at 17:10, when everything started working normally again.
## Impact
Until the mitigation was running:
- Timeout when accessing Scalingo services (Dashboard, APIs, Deployment, One-off containers, etc.)
- Auto-deployments / Review apps from SCM Integrations were failing. We might have missed operations since webhooks from the different platforms were not reaching our services either.
- Timeout/Connection Refused when reaching the applications deployed on Scalingo.
Until the Level3 global outage resolution:
- Connection to external services was timing out (e.g. GitHub), impacting deployments
- Interactive one-off containers were not working
- Auto-deployments/Review apps from SCM Integrations were not always triggered.
## Communication
Our status page [https://scalingostatus.com](https://scalingostatus.com) was being updated regularly during the day.
We've answered to all messages coming through Intercom either via the in-app chat, or through our support email [email protected].
Our Twitter account [@ScalingoHQ](https://twitter.com/ScalingoHQ) posted about the major parts of the incident.
Specific information has been pushed personally to some customers or to people who asked.
## Actions Taken and Future Plan
### Mitigation of the incident
During the incidents, a mitigation strategy has been designed and implemented to decrease the impact of the Level3 outage on Scalingo customers.
As we are controlling the DNS endpoint through CNAME fields of most applications of the platform we were able to divert the traffic to target an IP which was not impacted by the outage and then redirect it through the best route possible to reach Outscale infrastructure without crossing Level3 network.
This was done, and during the incident the traffic was diverted through a specific region of the OVH infrastructure, then reaching another Outscale datacenter and finally arriving in the `eu-west-2` datacenter, where the Scalingo `osc-fr1` region is hosted. Although this is not the standard data path for our users' requests, we ensured this would not jeopardize the security of applications while increasing their availability.
Once this piping work was done, people were able to reach hosted applications (if their own Internet connection was not broken because of Level3 outage).
### Future plan
After each major incident, our team gather to achieve a retrospective of the event, to elaborate a plan of actions in order to improve the overall services provided by Scalingo.
- On the technical point of view, we are going to discuss with Outscale how we can ensure that the multiple IPs used for ingress and egress are systematically routed differently in order to be able to rollback efficiently if updating BGP routes is not an option.
- The mitigation plan which was applied after 2h of downtime. We strongly believe we could have done better on that side. We have updated our processes in order to ensure that an operator is dedicated to deploying a mitigation strategy from the beginning of a major incident, without waiting for an advanced diagnosis of the incident.
- Our operator team decided that it was an acceptable mitigation to divert the traffic through OVH network since most of the traffic is encrypted (HTTPS) and that the data were staying on the European territory. The choice was made to ensure our continuity of service, but the operator team couldn't be sure that on the contractual side, this solution was acceptable. Thus it has been decided we will define a framework to define if a mitigation is acceptable or not.
- Some updates were not sent to the subscribers of our status page, the error comes from unclear UI/UX from the status page provider, especially under stress. A solution has been designed to ensure the right settings are set when publishing and updating incident publications.
Posted:
» View Event Details | Created
Resolved
The issue is now resolved. One of our node got overloaded during a backup. The automated process tried to free applications from this node but this caused another node to get overloaded and had a snowball effect.
The situation is back to normal, we're still monitoring the situation.
Posted:
» View Event Details | Created
Resolved
The issue has been identified to one of our software not correctly handling messages to update its internal configuration. A fix has been applied. Our team keeps monitoring the situation.
Posted:
» View Event Details | Created
Resolved
Build queue is now empty, and new build are built at normal speed
Posted: