[osc-fr1] Apps are unreachable

RSS ATOM

» Published on Mon, 04 Apr 2022 18:42:00 +0000

Post-Mortem

Incident Report: Platform Unavailability on Region osc-fr1 the 4th of April

TL;DR

Monday, April 4th, 2022 at 20h36, Scalingo has suffered from a Denial of Service (DoS) attack coming from the inside of our infrastructure. We have now put a first layer of protection in place to prevent such events from happening and we are working with our infrastructure provider to have a better long-term resiliency to those attacks. It is important to say that this incident only impacted the network availability of your apps, your apps themselves and especially your data have stayed secured during the incident.

Timeline of the incident

All times given are in CEST (UTC+2)
- 20:08 A malicious user deploys a lot of applications on the platform
- 20:36 All those apps are generating a lot of network traffic
- 20:36 A component in charge of managing our network (VPC) is overwhelmed by this increase in traffic and starts dropping packets
- 20:42 We receive our first alert: Applications deployed on the osc-fr1 region are unavailable. The first on-call responder is on deck.
- 20:44 The first responder declares a major incident on the platform, more engineers are called to assist with diagnostics and incident resolution.
- 20:46 There is no way to contact the region. It seems that the incident impacts all of our Public IPs but also our private IPs used for the platform administration. The region is effectively cut from the outside world.
- 20:47 The incident is escalated to our Infrastructure provider. They are not aware of any global ongoing incident on their side. They start investigating our resources precisely
- 21:42 Our infrastructure provider found the cause: the component in charge of managing our network is overloaded. It seems that it is getting way too much traffic than it can manage. However, they are not sure of what the source of the traffic is.
- 22:04 Our operators theorize that it could be an external attack and stop our load balancers to regain access through our administration interface. As a result, we are stopping completely the load balancers receiving ingress traffic, but no impact on the availability is detected, the hypothesis is wrong.
- 22:47 Second hypothesis is being tested, network saturation is coming from inside. After getting the list of servers from our infrastructure providers, we immediately stop them. The situation improves immediately. We are getting access to our administration tools.
- 22:59 We restart our load balancers on the osc-fr1 region. ~70% of the app are immediately back online and running. The ~30% remaining are on the instances we stopped and have to be restarted.
- 23:24 We notice some unusual behaviors with our public IPs (5.104.101.30 and 109.232.236.90), we exclude them from our DNS pools when needed, time to recover standard behavior.
- 00:01 The majority (> 98%) of the apps are up and running, the last apps are being diagnosed case by case by our operators.
- 00:40 All apps are now back up and running.
- 01:02 We find the source issue with our public IPs. It is due to a bug on our Infrastructure provider networking layer, a workaround is deployed and our IPs are back up and running.
- 01:09 All of our post-incident checks are green, the incident is now declared over.
Impact

On osc-secnum-fr1, none of the applications nor databases were impacted. However, our dashboard and APIs were not usable for the duration of the incident.

On osc-fr1 applications were unreachable for at least 2h50 (and at most 4h04).

There was no impact on the integrity and confidentiality.

Communication

Our status page https://scalingostatus.com was being updated regularly during the day.

We have answered all messages coming through Intercom either via the in-app chat or through our support email [email protected].

Our Twitter account @ScalingoHQ posted about the major parts of the incident.

Specific information has been pushed personally to some customers or to people who asked, whatever the channel.

Analysis and Actions Taken

The root cause was a saturation of a component managing the network of the region. This was possible because our rate limits were not strict enough. Those rate limits have been reduced to prevent such an event from happening. Plus, monitoring improvements have been realized to be able to be notified much more efficiently of such type of abuse.

The fact that we were blind during the incident significantly slowed our incident response. There were already some projects on the roadmap to be more resilient to that kind of issues. The priority of those projects is being bumped to address such issues.

We are also in communication with our network provider to fasten the incident response. We also opened a discussion on how to improve the resiliency of the VPC.

Last but not least, it became clear that Scalingo needs to enforce stronger identity verification techniques and stricter default quotas until the identity of a customer has been proven. It would help to detect malicious users earlier and prevent them from abusing the platform.

Service Level Agreement Consideration

We propose a 99.9% SLA for applications scaled on at least 2 containers and 99.96% for databases using a Business plan.

We are fully aware that the downtime which occurred on April 4th has heavily impacted this engagement.

Therefore all customers that meet these criteria must contact the support to get the appropriate financial compensation.
» Updated Thu, 07 Apr 2022 12:48:00 +0000
Resolved

All applications are back online on the osc-fr1 region. The incident is considered over on our side. Please contact our support if you are still experiencing trouble with your applications.
» Updated Mon, 04 Apr 2022 23:06:00 +0000
Update

All applications have been restarted. All Scalingo services are restored. Applications with a custom domain and a DNS record of A type may have some issues. We are still working on resolving this last issue.
» Updated Mon, 04 Apr 2022 22:40:00 +0000
Update

The recent blink is due to an intermittent issue with one of our IP. Applications hosted on the osc-fr1 region are reachable again. Our team is working hard on stabilizing the situation.
» Updated Mon, 04 Apr 2022 22:19:00 +0000
Update

Our team is notified of some new issues to reach applications hosted on the osc-fr1 region. We are jumping on it.
» Updated Mon, 04 Apr 2022 22:13:00 +0000
Update

We are still in the process of restarting all the down applications. All the other Scalingo services (deployments, databases hosting, etc) are back online.
» Updated Mon, 04 Apr 2022 21:54:00 +0000
Update

Most applications hosting nodes have access again to the network. This means most applications are reachable again. Our team is still working on restoring the full service.
» Updated Mon, 04 Apr 2022 21:16:00 +0000
Update

We successfully reduced the network load and recovered part of our infrastructure. Our team is working on restoring the service for all our customers.
» Updated Mon, 04 Apr 2022 21:01:00 +0000
Update

Our team still works hand in hand with our infrastructure provider to resolve the issue.
» Updated Mon, 04 Apr 2022 20:47:00 +0000
Update

Our team is still investigating the root cause of this incident. A part of our infrastructure is under a heavy network load. We are working on a way to improve the situation.
» Updated Mon, 04 Apr 2022 20:05:00 +0000
Update

We cannot provide you with any ETA for now. Be assure that all the team is working on fixing this issue.
» Updated Mon, 04 Apr 2022 19:35:00 +0000
Update

Our team is in touch with our infrastructure provider to determine the root cause of this incident.
» Updated Mon, 04 Apr 2022 19:12:00 +0000
Investigating

Our team has been notified that apps hosted on osc-fr1 are unreachable. We are working on it.
» Updated Mon, 04 Apr 2022 18:42:00 +0000

Scalingo Status

[osc-fr1] Apps are unreachable

Post-Mortem

Incident Report: Platform Unavailability on Region `osc-fr1` the 4th of April

TL;DR

Timeline of the incident

Impact

Communication

Analysis and Actions Taken

Service Level Agreement Consideration

Resolved

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Investigating

Scalingo Status

[osc-fr1] Apps are unreachable

Post-Mortem

Incident Report: Platform Unavailability on Region osc-fr1 the 4th of April

TL;DR

Timeline of the incident

Impact

Communication

Analysis and Actions Taken

Service Level Agreement Consideration

Resolved

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Investigating

Incident Report: Platform Unavailability on Region `osc-fr1` the 4th of April