[osc-fr1][osc-secnum-fr1] Network access to the platform unreachable

RSS ATOM

» Published on Sun, 30 Aug 2020 10:06:00 +0000

Post-Mortem

Incident Report: Surviving the outage of Level3, major Internet peering provider

TL;DR

On Sunday 30th of August a major actor of the Internet Level3, part of the CenturyLink telecommunication company, experienced a massive outage which impacted large parts of the Internet transit, including our direct networking provider Outscale, and as a consequence the Scalingo Platform. The incident is not the direct responsibility of Scalingo but we did our maximum to diminish the impact on our customers.

The incident impacted both the ingress traffic (users connecting to applications and databases hosted on the platform) and egress traffic (connection from apps to external services). This post goes over the course of events, covers the reaction from our team and explain which actions will be taken in the future.

Timeline of the incident

All timestamps are in Central European Summer Time (CEST).
- [12:02] Pagers are ringing, some components of our infrastructure are detected as unreachable by our external probes on both osc-fr1 and osc-secnum-fr1, our team is instantly starting our incident procedure. Creation of a post on our status page announcing networking issues to access the infrastructure. Beginning of the diagnostic.
- [12:18] Services and applications are mostly reachable again, our team is checking that the situation is back to normal.
- [12:26] Network is blinking again, a few minutes up, then down, our network provider Outscale is reached to gather more information. They acknowledge that an incident is ongoing, but they don't have a precise cause to share with us.
- [13:00] Network access is completely down again, our team has gathered network analysis showing heavy loss of packets when trying to reach the infrastructure. Our team, working remotely, is using multiple Internet carrier, we realize that people on the Vodafone/SFR network can still access part of the infrastructure, while those linked to Orange are completely in the dark. It gives us the first insight to the fact it's not a DDoS, it's not Outscale-related, it's a problem related to Internet transit.
- [14:00] Our team discovers the status pages of other hosting providers relating the same incident:
  - Scaleway: https://status.scaleway.com/incident/956
  - Cloudflare: https://www.cloudflarestatus.com/incidents/hptvkprkvp23
- [14:28] Communication from Outscale that their attempt to propagate new routes to announce our IPs over the BGP network is not working. That's why the IPs targeting our infrastructure have not failed over other Internet transit Provider. (Outscale is peering with Level3, Cogent, FranceIX)
- [14:35] A mitigation method is validated to allow most applications to be reachable again:
  - By booting servers from providers which were not impacted by the outage, we found an Internet route to reach our infrastructure without transiting through Level3. The idea was to create a complete traffic redirection of HTTP(S) traffic passing by this route (2 additional connection hops we would be controlling).
- [15:25] The mitigation has been setup and our tests are working as expected for our region osc-fr. Our DNS rules are updated, all application reached using *.osc-fr1.scalingo.io URL or which domains are using a CNAME field to target their Scalingo domain are now reachable again.
- [15:50] Same mitigation is setup for our region osc-secnum-fr1, applications of this region are available again.
- [17:10] End of the Level3 Outage, after ensuring everything is working as expected, we are disabling the mitigation which was ongoing.
Analysis

This incident was a global worldwide incident. Companies and Internet users were impacted especially in Europe and North America where CenturyLink/Level3 is particularly present.

Normally such incident is easy to get around when an infrastructure is directly linked with multiple transit providers as it is the case for Outscale (Cogent, Level3, FranceIX). If one of them has difficulty, through the BGP protocol, announcements are done to update how IPs should be routed to prevent passing through the damaged links. However in this case, Level3 was not reacting to the declaration of route updates and kept broadcasting the old rules, preventing the new ones to be applied.

Several actors of the Internet couldn't apply the standard procedure in order to change the Internet routing rules to a point where major peer entities gathered and decided to start blacklisting the infrastructure of Level3 to solve this problem worldwide. This operation correlated with the restoration of their service by Level3 at 17:10, when everything started working normally again.

Impact

Until the mitigation was running:
- Timeout when accessing Scalingo services (Dashboard, APIs, Deployment, One-off containers, etc.)
- Auto-deployments / Review apps from SCM Integrations were failing. We might have missed operations since webhooks from the different platforms were not reaching our services either.
- Timeout/Connection Refused when reaching the applications deployed on Scalingo.
Until the Level3 global outage resolution:
- Connection to external services was timing out (e.g. GitHub), impacting deployments
- Interactive one-off containers were not working
- Auto-deployments/Review apps from SCM Integrations were not always triggered.
Communication

Our status page https://scalingostatus.com was being updated regularly during the day.

We've answered to all messages coming through Intercom either via the in-app chat, or through our support email [email protected].

Our Twitter account @ScalingoHQ posted about the major parts of the incident.

Specific information has been pushed personally to some customers or to people who asked.

Actions Taken and Future Plan

Mitigation of the incident

During the incidents, a mitigation strategy has been designed and implemented to decrease the impact of the Level3 outage on Scalingo customers.

As we are controlling the DNS endpoint through CNAME fields of most applications of the platform we were able to divert the traffic to target an IP which was not impacted by the outage and then redirect it through the best route possible to reach Outscale infrastructure without crossing Level3 network. This was done, and during the incident the traffic was diverted through a specific region of the OVH infrastructure, then reaching another Outscale datacenter and finally arriving in the eu-west-2 datacenter, where the Scalingo osc-fr1 region is hosted. Although this is not the standard data path for our users' requests, we ensured this would not jeopardize the security of applications while increasing their availability. Once this piping work was done, people were able to reach hosted applications (if their own Internet connection was not broken because of Level3 outage).

Future plan

After each major incident, our team gather to achieve a retrospective of the event, to elaborate a plan of actions in order to improve the overall services provided by Scalingo.
- On the technical point of view, we are going to discuss with Outscale how we can ensure that the multiple IPs used for ingress and egress are systematically routed differently in order to be able to rollback efficiently if updating BGP routes is not an option.
- The mitigation plan which was applied after 2h of downtime. We strongly believe we could have done better on that side. We have updated our processes in order to ensure that an operator is dedicated to deploying a mitigation strategy from the beginning of a major incident, without waiting for an advanced diagnosis of the incident.
- Our operator team decided that it was an acceptable mitigation to divert the traffic through OVH network since most of the traffic is encrypted (HTTPS) and that the data were staying on the European territory. The choice was made to ensure our continuity of service, but the operator team couldn't be sure that on the contractual side, this solution was acceptable. Thus it has been decided we will define a framework to define if a mitigation is acceptable or not.
- Some updates were not sent to the subscribers of our status page, the error comes from unclear UI/UX from the status page provider, especially under stress. A solution has been designed to ensure the right settings are set when publishing and updating incident publications.
» Updated Thu, 10 Sep 2020 12:52:00 +0000
Resolved

All Scalingo services are reachable again. The new BGP routes have been propagated following large internet peering provider decision to blacklist Level3. The mitigation is not useful anymore and has been removed.

Our team keeps monitoring the situation.
» Updated Sun, 30 Aug 2020 15:14:00 +0000
Update
- Applications are available through the mitigation mechanisms our team had setup
- TCP Gateway and Internet Available databases are partly reachable (traffic running through SFR/Vodafone/Cogent do get through at least)
- Deployment start but mostly fail because downloading dependencies from GitHub and other destination is not always possible.
On a more global scale, the Level 3 Outage has been now considered global: https://news.ycombinator.com/item?id=24322861

Other large Internet peering actors are blacklisting CenturyLink/Level3 from their network in order to get back to a working state. Once this operation is achieved by these various operators, all services should be working back as expected.
» Updated Sun, 30 Aug 2020 15:01:00 +0000
Update

Same mitigation has been added to osc-secnum-fr1, our services are now up as well as applications when custom domains are configured with CNAME fields. Come to us through the in-app support if you're using A fields in this region.
» Updated Sun, 30 Aug 2020 13:34:00 +0000
Update

For osc-fr1, as mitigation, we've changed our HTTP/HTTPS routing to access the infrastructure.
- Our APIs, Dashboards are back UP
- All applications using a CNAME should back online
If you've configured your traffic with a A field, please send an email to [email protected] or through the in-app chat.

What's not working for the moment:
- Login with GitHub: please use the username/password credentials
- Deployments
- Database Tunnel
- Internet accessibility for databases
- TCP Gateway addon
We'll apply the same mitgation for region osc-secnum-fr1.

Update on the standard routing part. Level3 is not propagating new routing BGP rules, it explains why attempts to mitigate the problem at this level hasn't worked so far.
» Updated Sun, 30 Aug 2020 13:27:00 +0000
Update

Our services are still impacted.

Various entities are reporting incidents (Cloudflare, Scaleway, a lot of individuals). The Internet is seriously impacted in Paris.

New routing solutions to circumvent the peer provider Level3 which is dropping the traffic have not been successful so far (through Cogent)

We'll keep adding information as soon as we have improvements.
» Updated Sun, 30 Aug 2020 12:44:00 +0000
Update

The incident is not impacting only Outscale, but seems more global in region "Île-de-France' (Paris region), several actors are impacted, including Outscale and as a consequence Scalingo.

The peer provider Level3 seems impacted, attempts to divert traffic have been done without an efficient mitigation method found for now. We'll add information as soon as we have them.
» Updated Sun, 30 Aug 2020 12:04:00 +0000
Update

Network access to our 'osc-fr1' region is still heavily impacted, we've seen some improvement on for 'osc-secnum-fr1'

Our team is still handling the incident. It has been acknowledged it's a network-related issue, all apps are running, but not all requests are succeeding to reach them.

Outscale is still handling the incident on their side.
» Updated Sun, 30 Aug 2020 11:53:00 +0000
Update

Network access to the platform is not 100% stable, we're still observing and analyzing our monitoring data.

Our provider acknowledged the networking instability, we are waiting for details from their side.
» Updated Sun, 30 Aug 2020 11:08:00 +0000
Update

The situation seems to be unstable. We are in touch with our infrastructure provider to resolve the situation as soon as possible.
» Updated Sun, 30 Aug 2020 10:33:00 +0000
Update

The network to the platform seems to get better on both osc-secnum-fr1 and osc-fr1. Our team is still monitoring the situation to gather information about this incident.
» Updated Sun, 30 Aug 2020 10:20:00 +0000
Investigating

Our probes detected that our public IPs are currently unreachable on both osc-fr1 and osc-secnum-fr1. Our team has been alerted, we're on it.
» Updated Sun, 30 Aug 2020 10:06:00 +0000

Scalingo Status

[osc-fr1][osc-secnum-fr1] Network access to the platform unreachable

Post-Mortem

Incident Report: Surviving the outage of Level3, major Internet peering provider

TL;DR

Timeline of the incident

Analysis

Impact

Communication

Actions Taken and Future Plan

Mitigation of the incident

Future plan

Resolved

Update

Update

Update

Update

Update

Update

Update

Update

Update

Investigating