[osc-fr1][osc-secnum-fr1] Network access to the hosted applications is degraded

RSS ATOM

» Published on Mon, 28 Sep 2020 08:12:00 +0000

Post-Mortem

20200928-DDoS

Incident Report: DDoS osc-fr1 / osc-secnum-fr1

TL;DR

Monday September 28th 2020, Outscale infrastructure used by Scalingo has suffered a large-scale DDoS attack, we have now setup a long-term solution to cope efficiently against this type of attack. Your applications are now protected by this means without any additional costs. It is important to say that your apps and your data have stayed secured during the incident.

General Information

What happened? Yesterday, European data centers of 3DS Outscale were attacked multiple times by Distributed Denial of Service attacks (DDoS). The attackers targeted their infrastructure and achieved to make them unavailable, and as a consequence, the Scalingo platform hosted on them. This is not the first attack and it won't be the last. Each time we harden our defenses but only a failed attack could prove the walls to be strong enough.

Who's the target? Targeted IP addresses are those used by Scalingo. But the end target could be any of our customers or even 3DS Outscale. In reality, we have no way to know precisely who's the target! ðŸ˜ž We also don't know who/what is the source of the attack and why. This situation is frustrating for everyone.

What happens during a DDoS attack? During such an attack, the principle is to break one of the components which allow an application/website to provide its service. These components are various: saturating the networking link (send more traffic than the bandwidth of the wire), overloading the CPU/RAM of routers and/or firewalls, one could use weaknesses of the TCP or UDP protocols to slow down routers and/or firewalls... For each type of attack, there is a possible response. (Buying more bandwidth to the networking provider, configured routers, adding custom rules about traffic shaping...) but it's difficult to know in advance all the possible problems.

These last years, most attacks were of the type: "jamming the Internet pipes with as much traffic as possible". The reason is that their price has highly decreased. (Thanks to poorly secured connected devices: SmartTV, CCTV, etc).

The type of attack we suffered yesterday is from the type "jamming the pipes". The method used changed several times during the day, from TCP to UDP, targeting different IPs as the day was ongoing. It impacted both French datacenters from 3DS Outscale (eu-west-2, clougouv-eu-west-1) and the entirety of their services was down.

What was impacted? During the attacks, Internet links of 3DS Outscale were saturated, but the traffic didn't reach our servers directly. It means that neither servers nor databases and their data has been impacted. There was no data leak during these attacks.

What has been done? Scalingo does not own its Internet access. In the scope of our hosting contract, 3DS Outscale is buying Internet bandwidth to different providers. When such an incident happens, 3DS Outscale is responsible for managing the incident internally and with their different peer providers.

Yesterday, we changed the structure of our networking to ease the protection of Scalingo and with Outscale, we have decided to deploy filtering and traffic scrubbing measures to protect the Scalingo infrastructure, and more globally the Outscale infrastructure.

Will it happen again? It is impossible to prevent attackers from targeting our infrastructure. However, in the target situation, when an attack will happen, you and your users will not notice it at all. Protection solutions which have been setup this Monday September 28th. should allow your applications to stay online without any change during this type of attack.

What additional solution can I setup to protect my application? In the precise case of this attack, having a multi-datacenters deployment among osc-fr1 and osc-secnum-fr would not have protected your application, since both regions were targeted by the attacks.

One really efficient method to stay online is to protect your service with a CDN with an Anti-DDoS solution. Cloudflare is proposing a free plan which is very efficient for small and medium sites. It does not protect backend components, but frontend can keep working on the cache of Cloudflare and it's possible to propose a degraded version of your service instead of errors.

Timeline of the incident

All times given are in CEST (UTC+2)
- 10:12 Our team is alerted, some services are detected to be unavailable as well as customer applications. It seems to impact bother osc-fr1 and osc-secnum-fr1 regions
- 10:15 Incident is classified as Major by the operator handling the incident, more people are interrupted to handle the incident.
- 10:19 The problem seems to impact the complete access to our infrastructure, the phone support of Outscale of operator is reached, they have already started handling the incident. The symptoms look like a DDoS attack.
- 11:10 The incident is considered over by Outscale, but some of our IPs have been blacklisted in order to mitigate the attack. Application access is partially restored, traffic from the application to the Internet is still impacted
- 12:30 Situation back to nominal, applications are accessible and can reach the rest of the Internet
- 15:03 Second wave of the attacks, both regions are impact, same impact, applications are not reachable.
- 15:28 Situation is better on osc-fr1, region osc-secnum-fr1 is still impacted
- 16:04 Situation is now better on osc-secnum-fr1, our IP addressing has changed in order to help mitigating the attack, all apps of both regions are reachable as expected.
- 17:50 IP addressing is back to normal.
Impact

During each wave of the attack, most legit requests to applications hosted on the platform were timing out, thus applications were unavailable.
- osc-fr1 Around 2 hours of unavailability have been measured
- osc-secnum-fr1 Around 3 hours of unavailability have been measured
Once the attack was mitigated or ceased, everything got back to normal, no other impact than blocking legit traffic had been detected.

Communication

Our status page https://scalingostatus.com was being updated regularly during the day.

We've answered to all messages coming through Intercom either via the in-app chat, or through our support email [email protected].

Our Twitter account @ScalingoHQ posted about the major parts of the incident.

Specific information has been pushed personally to some customers or to people who asked.

Actions Taken and Future Plan

Two improvement fields have been detected:
- During the first wave of the attack (during the morning), apps kept being unavailable while the attack was considered mitigated by our provider. The reason is that our egress traffic (Traffic from inside our infrastructure to the Internet) was black-holed as a method by our provider to cope with the attack. No strict process was previously existing to cope with that situation. We have had to learn on-the-go which took more time than it should have been. A recurrent internal workshop aiming at creating and documenting more mitigation solutions will be planned in order to be more ready when real incidents are actually happening.
- The Anti-DDoS solution that Outscale is installing was not yet ready nor it wasn't clear to us how it could be leveraged to mitigate the on-going attack. Following this event, the topic has become top 1 priority for us and for them, leading to an acceleration in the process of deployment of new protection methods inside their infrastructure. (Higher network capacity, active filtering of traffic through scrubbing centers). This work has started the week of the attack and is still an active work in progress.
Following our internal process of continuous improvement, other internal measures have also been decided concerning incident management, communication and mitigations.

Financial compensation

As our Terms of Service states it, we propose a 99.9% SLA to our Business customers (99.96% for databases using a Business plan).

We're fully aware that the downtime which occurred September 28th has heavily impacted this engagement.

Therefore all Business customers will automatically get a financial compensation on their invoice for the month of October (5% per hour of downtime).
- osc-secnum-fr1 : 15% discount
- osc-fr1 : 10% discount
To qualify as a Business user you must own at least one application with a database using a Business plan and at least 2 containers serving web or TCP traffic to your app.
» Updated Wed, 21 Oct 2020 13:18:00 +0000
Resolved

Our infrastructure provider has setup a workaround. The network access is now fully working as expected. Our team keeps monitoring the situation. Please contact the support if you are still experiencing any issue to reach your application.
» Updated Mon, 28 Sep 2020 16:05:00 +0000
Update

The situation seems to be stable since last update. The incident is not considered over as we still see some lost packets, but access to the applications work as expected.
» Updated Mon, 28 Sep 2020 14:47:00 +0000
Update

The network access to both regions (osc-fr1 and osc-secnum-fr1) is getting better. The access to the applications should work again, with degraded performance.
» Updated Mon, 28 Sep 2020 14:04:00 +0000
Update

The access to the osc-fr1 region is getting better, yet still degraded. Our team keeps monitoring the situation on this region.

Network access to the osc-secnum-fr1 region is still down.
» Updated Mon, 28 Sep 2020 13:28:00 +0000
Update

Our probes notified us that the network access is degraded again. Our team is working on it.
» Updated Mon, 28 Sep 2020 13:03:00 +0000
Resolved

The network access to the osc-fr1 region has been restored. Our team is running some post-incidents checks. The access to your application hosted on Scalingo should be back to normal.

As the IP address to reach the Scalingo infrastructure changed to mitigate the attack, you will need to update your DNS record if you use a type A record. Please contact the support for more information.
» Updated Mon, 28 Sep 2020 10:33:00 +0000
Update

Our team is working on a solution to fix the routing of the egress traffic. We will share more updates as soon as we have more.
» Updated Mon, 28 Sep 2020 10:17:00 +0000
Update

The situation is improving on the region osc-secnum-fr1. Network access to this region seems to be back to normal.

The network access to the osc-fr1 region is also improving. Ingress traffic to the region is working. We still have an issue with the egress traffic. Our team is working on a solution.
» Updated Mon, 28 Sep 2020 09:53:00 +0000
Update

Our infrastructure provider confirmed the network issue we face is caused by a DDoS attack. They keep working on fixing the issue.
» Updated Mon, 28 Sep 2020 09:29:00 +0000
Update

Network access to the applications hosted on osc-fr1 is also degraded. Some requests may fail from time to time.
» Updated Mon, 28 Sep 2020 08:47:00 +0000
Update

Deployments on the region osc-fr1 and osc-secnum-fr1 are also impacted. Our team is in contact with our infrastructure provider about this issue.
» Updated Mon, 28 Sep 2020 08:33:00 +0000
Investigating

Our team is alerted that the network access to the hosted applications on the region osc-secnum-fr1 is degraded. TCP addon and the "public availability" feature of the databases hosted on osc-fr1 are also impacted.

Our team is investigating.
» Updated Mon, 28 Sep 2020 08:12:00 +0000

Scalingo Status

[osc-fr1][osc-secnum-fr1] Network access to the hosted applications is degraded

Post-Mortem

20200928-DDoS

Incident Report: DDoS osc-fr1 / osc-secnum-fr1

TL;DR

General Information

Timeline of the incident

Impact

Communication

Actions Taken and Future Plan

Financial compensation

Resolved

Update

Update

Update

Update

Resolved

Update

Update

Update

Update

Update

Investigating