» Published on
On Sunday 30th of August a major actor of the Internet Level3, part of the CenturyLink telecommunication company, experienced a massive outage which impacted large parts of the Internet transit, including our direct networking provider Outscale, and as a consequence the Scalingo Platform. The incident is not the direct responsibility of Scalingo but we did our maximum to diminish the impact on our customers.
The incident impacted both the ingress traffic (users connecting to applications and databases hosted on the platform) and egress traffic (connection from apps to external services). This post goes over the course of events, covers the reaction from our team and explain which actions will be taken in the future.
All timestamps are in Central European Summer Time (CEST).
osc-fr1
and osc-secnum-fr1
, our team is instantly starting our incident procedure. Creation of a post on our status page announcing networking issues to access the infrastructure. Beginning of the diagnostic.osc-fr
. Our DNS rules are updated, all application reached using *.osc-fr1.scalingo.io
URL or which domains are using a CNAME field to target their Scalingo domain are now reachable again.osc-secnum-fr1
, applications of this region are available again.This incident was a global worldwide incident. Companies and Internet users were impacted especially in Europe and North America where CenturyLink/Level3 is particularly present.
Normally such incident is easy to get around when an infrastructure is directly linked with multiple transit providers as it is the case for Outscale (Cogent, Level3, FranceIX). If one of them has difficulty, through the BGP protocol, announcements are done to update how IPs should be routed to prevent passing through the damaged links. However in this case, Level3 was not reacting to the declaration of route updates and kept broadcasting the old rules, preventing the new ones to be applied.
Several actors of the Internet couldn't apply the standard procedure in order to change the Internet routing rules to a point where major peer entities gathered and decided to start blacklisting the infrastructure of Level3 to solve this problem worldwide. This operation correlated with the restoration of their service by Level3 at 17:10, when everything started working normally again.
Until the mitigation was running:
Until the Level3 global outage resolution:
Our status page https://scalingostatus.com was being updated regularly during the day.
We've answered to all messages coming through Intercom either via the in-app chat, or through our support email [email protected].
Our Twitter account @ScalingoHQ posted about the major parts of the incident.
Specific information has been pushed personally to some customers or to people who asked.
During the incidents, a mitigation strategy has been designed and implemented to decrease the impact of the Level3 outage on Scalingo customers.
As we are controlling the DNS endpoint through CNAME fields of most applications of the platform we were able to divert the traffic to target an IP which was not impacted by the outage and then redirect it through the best route possible to reach Outscale infrastructure without crossing Level3 network.
This was done, and during the incident the traffic was diverted through a specific region of the OVH infrastructure, then reaching another Outscale datacenter and finally arriving in the eu-west-2
datacenter, where the Scalingo osc-fr1
region is hosted. Although this is not the standard data path for our users' requests, we ensured this would not jeopardize the security of applications while increasing their availability.
Once this piping work was done, people were able to reach hosted applications (if their own Internet connection was not broken because of Level3 outage).
After each major incident, our team gather to achieve a retrospective of the event, to elaborate a plan of actions in order to improve the overall services provided by Scalingo.
All Scalingo services are reachable again. The new BGP routes have been propagated following large internet peering provider decision to blacklist Level3. The mitigation is not useful anymore and has been removed.
Our team keeps monitoring the situation.
» UpdatedOn a more global scale, the Level 3 Outage has been now considered global: https://news.ycombinator.com/item?id=24322861
Other large Internet peering actors are blacklisting CenturyLink/Level3 from their network in order to get back to a working state. Once this operation is achieved by these various operators, all services should be working back as expected.
» UpdatedSame mitigation has been added to osc-secnum-fr1, our services are now up as well as applications when custom domains are configured with CNAME fields. Come to us through the in-app support if you're using A fields in this region.
» UpdatedFor osc-fr1, as mitigation, we've changed our HTTP/HTTPS routing to access the infrastructure.
If you've configured your traffic with a A field, please send an email to [email protected] or through the in-app chat.
What's not working for the moment:
We'll apply the same mitgation for region osc-secnum-fr1.
Update on the standard routing part. Level3 is not propagating new routing BGP rules, it explains why attempts to mitigate the problem at this level hasn't worked so far.
» UpdatedOur services are still impacted.
Various entities are reporting incidents (Cloudflare, Scaleway, a lot of individuals). The Internet is seriously impacted in Paris.
New routing solutions to circumvent the peer provider Level3 which is dropping the traffic have not been successful so far (through Cogent)
We'll keep adding information as soon as we have improvements.
» UpdatedThe incident is not impacting only Outscale, but seems more global in region "Île-de-France' (Paris region), several actors are impacted, including Outscale and as a consequence Scalingo.
The peer provider Level3 seems impacted, attempts to divert traffic have been done without an efficient mitigation method found for now. We'll add information as soon as we have them.
» UpdatedNetwork access to our 'osc-fr1' region is still heavily impacted, we've seen some improvement on for 'osc-secnum-fr1'
Our team is still handling the incident. It has been acknowledged it's a network-related issue, all apps are running, but not all requests are succeeding to reach them.
Outscale is still handling the incident on their side.
» UpdatedNetwork access to the platform is not 100% stable, we're still observing and analyzing our monitoring data.
Our provider acknowledged the networking instability, we are waiting for details from their side.
» UpdatedThe situation seems to be unstable. We are in touch with our infrastructure provider to resolve the situation as soon as possible.
» UpdatedThe network to the platform seems to get better on both osc-secnum-fr1 and osc-fr1. Our team is still monitoring the situation to gather information about this incident.
» UpdatedOur probes detected that our public IPs are currently unreachable on both osc-fr1 and osc-secnum-fr1. Our team has been alerted, we're on it.
» Updated