Scalingo Status

[osc-fr1][osc-secnum-fr1] Network access to the hosted applications is degraded

» View Event Details | Created Mon, 28 Sep 2020 08:12:00 +0000

Post-Mortem # 20200928-DDoS # Incident Report: DDoS osc-fr1 / osc-secnum-fr1 ## TL;DR Monday September 28th 2020, Outscale infrastructure used by Scalingo has suffered a large-scale DDoS attack, we have now setup a long-term solution to cope efficiently against this type of attack. Your applications are now protected by this means without any additional costs. It is important to say that your apps and your data have stayed secured during the incident. ## General Information **What happened?** Yesterday, European data centers of 3DS Outscale were attacked multiple times by Distributed Denial of Service attacks (DDoS). The attackers targeted their infrastructure and achieved to make them unavailable, and as a consequence, the Scalingo platform hosted on them. This is not the first attack and it won't be the last. Each time we harden our defenses but only a failed attack could prove the walls to be strong enough. **Who's the target?** Targeted IP addresses are those used by Scalingo. But the end target could be any of our customers or even 3DS Outscale. In reality, we have no way to know precisely who's the target! ðŸ˜ž We also don't know who/what is the source of the attack and why. This situation is frustrating for everyone. **What happens during a DDoS attack?** During such an attack, the principle is to break one of the components which allow an application/website to provide its service. These components are various: saturating the networking link (send more traffic than the bandwidth of the wire), overloading the CPU/RAM of routers and/or firewalls, one could use weaknesses of the TCP or UDP protocols to slow down routers and/or firewalls... For each type of attack, there is a possible response. (Buying more bandwidth to the networking provider, configured routers, adding custom rules about traffic shaping...) but it's difficult to know in advance all the possible problems. These last years, most attacks were of the type: "jamming the Internet pipes with as much traffic as possible". The reason is that their price has highly decreased. (Thanks to poorly secured connected devices: SmartTV, CCTV, etc). The type of attack we suffered yesterday is from the type "jamming the pipes". The method used changed several times during the day, from TCP to UDP, targeting different IPs as the day was ongoing. It impacted both French datacenters from 3DS Outscale (eu-west-2, clougouv-eu-west-1) and the entirety of their services was down. **What was impacted?** During the attacks, Internet links of 3DS Outscale were saturated, but the traffic didn't reach our servers directly. It means that neither servers nor databases and their data has been impacted. There was no data leak during these attacks. **What has been done?** Scalingo does not own its Internet access. In the scope of our hosting contract, 3DS Outscale is buying Internet bandwidth to different providers. When such an incident happens, 3DS Outscale is responsible for managing the incident internally and with their different peer providers. Yesterday, we changed the structure of our networking to ease the protection of Scalingo and with Outscale, we have decided to deploy filtering and traffic scrubbing measures to protect the Scalingo infrastructure, and more globally the Outscale infrastructure. **Will it happen again?** It is impossible to prevent attackers from targeting our infrastructure. However, in the target situation, when an attack will happen, you and your users will not notice it at all. Protection solutions which have been setup this Monday September 28th. should allow your applications to stay online without any change during this type of attack. **What additional solution can I setup to protect my application?** In the precise case of this attack, having a multi-datacenters deployment among *osc-fr1* and *osc-secnum-fr* would not have protected your application, since both regions were targeted by the attacks. One really efficient method to stay online is to protect your service with a CDN with an Anti-DDoS solution. [Cloudflare is proposing a free plan](https://cloudflare.com/) which is very efficient for small and medium sites. It does not protect backend components, but frontend can keep working on the cache of Cloudflare and it's possible to propose a degraded version of your service instead of errors. ## Timeline of the incident All times given are in CEST (UTC+2) - 10:12 Our team is alerted, some services are detected to be unavailable as well as customer applications. It seems to impact bother `osc-fr1` and `osc-secnum-fr1` regions - 10:15 Incident is classified as Major by the operator handling the incident, more people are interrupted to handle the incident. - 10:19 The problem seems to impact the complete access to our infrastructure, the phone support of Outscale of operator is reached, they have already started handling the incident. The symptoms look like a DDoS attack. - 11:10 The incident is considered over by Outscale, but some of our IPs have been blacklisted in order to mitigate the attack. Application access is partially restored, traffic from the application to the Internet is still impacted - 12:30 Situation back to nominal, applications are accessible and can reach the rest of the Internet - 15:03 Second wave of the attacks, both regions are impact, same impact, applications are not reachable. - 15:28 Situation is better on `osc-fr1`, region `osc-secnum-fr1` is still impacted - 16:04 Situation is now better on `osc-secnum-fr1`, our IP addressing has changed in order to help mitigating the attack, all apps of both regions are reachable as expected. - 17:50 IP addressing is back to normal. ## Impact During each wave of the attack, most legit requests to applications hosted on the platform were timing out, thus applications were unavailable. - `osc-fr1` Around 2 hours of unavailability have been measured - `osc-secnum-fr1` Around 3 hours of unavailability have been measured Once the attack was mitigated or ceased, everything got back to normal, no other impact than blocking legit traffic had been detected. ## Communication Our status page [https://scalingostatus.com](https://scalingostatus.com/) was being updated regularly during the day. We've answered to all messages coming through Intercom either via the in-app chat, or through our support email [email protected]. Our Twitter account [@ScalingoHQ](https://twitter.com/ScalingoHQ) posted about the major parts of the incident. Specific information has been pushed personally to some customers or to people who asked. ## Actions Taken and Future Plan Two improvement fields have been detected: - During the first wave of the attack (during the morning), apps kept being unavailable while the attack was considered mitigated by our provider. The reason is that our egress traffic (Traffic from inside our infrastructure to the Internet) was black-holed as a method by our provider to cope with the attack. No strict process was previously existing to cope with that situation. We have had to learn on-the-go which took more time than it should have been. A recurrent internal workshop aiming at creating and documenting more mitigation solutions will be planned in order to be more ready when real incidents are actually happening. - The Anti-DDoS solution that Outscale is installing was not yet ready nor it wasn't clear to us how it could be leveraged to mitigate the on-going attack. Following this event, the topic has become top 1 priority for us and for them, leading to an acceleration in the process of deployment of new protection methods inside their infrastructure. (Higher network capacity, active filtering of traffic through scrubbing centers). This work has started the week of the attack and is still an active work in progress. Following our internal process of continuous improvement, other internal measures have also been decided concerning incident management, communication and mitigations. ## Financial compensation As our Terms of Service states it, we propose a [99.9% SLA](https://uptime.is/99.9) to our Business customers ([99.96%](https://uptime.is/99.96) for databases using a Business plan). We're fully aware that the downtime which occurred September 28th has heavily impacted this engagement. Therefore all Business customers will **automatically** get a **financial compensation** on their invoice for the month of October (5% per hour of downtime). - `osc-secnum-fr1` : 15% discount - `osc-fr1` : 10% discount To qualify as a Business user you must own at least one application with a database using a Business plan and at least 2 containers serving web or TCP traffic to your app.
Posted: Wed, 21 Oct 2020 13:18:00 +0000

[osc-fr1] Performance degraded for one database host

» View Event Details | Created Wed, 23 Sep 2020 15:14:00 +0000

Resolved Situation is stable up to now, incident is considered resolved
Posted: Wed, 23 Sep 2020 15:37:00 +0000

[osc-fr1][osc-secnum-fr1] Network access to the hosted applications is degraded

» View Event Details | Created Tue, 15 Sep 2020 11:02:00 +0000

Resolved The situation seems to be back to normal on both regions: osc-fr1 and osc-secnum-fr1. Our team keeps monitoring the situation. Please contact the support if you are still having trouble reaching your application.
Posted: Tue, 15 Sep 2020 12:54:00 +0000

[osc-fr1] Deployments hanging in "pushing" step

» View Event Details | Created Mon, 07 Sep 2020 15:28:00 +0000

Resolved The situation is stable again. Outscale's object storage is our primary storage again.
Posted: Mon, 07 Sep 2020 16:07:00 +0000

[osc-fr1] Deployments hanging in "pushing" step

» View Event Details | Created Thu, 03 Sep 2020 12:29:00 +0000

Resolved All pending deployments have finished now, incident is now over.
Posted: Thu, 03 Sep 2020 14:40:00 +0000

Scalingo Status

Event History

September 2020

[osc-fr1][osc-secnum-fr1] Network access to the hosted applications is degraded

[osc-fr1] Performance degraded for one database host

[osc-fr1][osc-secnum-fr1] Network access to the hosted applications is degraded

[osc-fr1] Deployments hanging in "pushing" step

[osc-fr1] Deployments hanging in "pushing" step