All Systems are Online

Return to Statuspage

Event History

April 2022

Deployments may be slower than usual

» View Event Details | Created Wed, 27 Apr 2022 08:48:00 +0000

Resolved The situation is stable for the last hour. You shouldn't face any issue with your deployments. Please contact our support team if you still experience slowness during your deployments.
Posted: Wed, 27 Apr 2022 13:24:00 +0000

[osc-fr1] Apps are unreachable

» View Event Details | Created Mon, 04 Apr 2022 18:42:00 +0000

Post-Mortem # Incident Report: Platform Unavailability on Region `osc-fr1` the 4th of April ## TL;DR Monday, April 4th, 2022 at 20h36, Scalingo has suffered from a Denial of Service (DoS) attack coming from the inside of our infrastructure. We have now put a first layer of protection in place to prevent such events from happening and we are working with our infrastructure provider to have a better long-term resiliency to those attacks. It is important to say that this incident only impacted the network availability of your apps, your apps themselves and especially your data have stayed secured during the incident. ## Timeline of the incident All times given are in CEST (UTC+2) - 20:08 A malicious user deploys a lot of applications on the platform - 20:36 All those apps are generating a lot of network traffic - 20:36 A component in charge of managing our network (VPC) is overwhelmed by this increase in traffic and starts dropping packets - 20:42 We receive our first alert: Applications deployed on the osc-fr1 region are unavailable. The first on-call responder is on deck. - 20:44 The first responder declares a major incident on the platform, more engineers are called to assist with diagnostics and incident resolution. - 20:46 There is no way to contact the region. It seems that the incident impacts all of our Public IPs but also our private IPs used for the platform administration. The region is effectively cut from the outside world. - 20:47 The incident is escalated to our Infrastructure provider. They are not aware of any global ongoing incident on their side. They start investigating our resources precisely - 21:42 Our infrastructure provider found the cause: the component in charge of managing our network is overloaded. It seems that it is getting way too much traffic than it can manage. However, they are not sure of what the source of the traffic is. - 22:04 Our operators theorize that it could be an external attack and stop our load balancers to regain access through our administration interface. As a result, we are stopping completely the load balancers receiving ingress traffic, but no impact on the availability is detected, the hypothesis is wrong. - 22:47 Second hypothesis is being tested, network saturation is coming from inside. After getting the list of servers from our infrastructure providers, we immediately stop them. The situation improves immediately. We are getting access to our administration tools. - 22:59 We restart our load balancers on the osc-fr1 region. ~70% of the app are immediately back online and running. The ~30% remaining are on the instances we stopped and have to be restarted. - 23:24 We notice some unusual behaviors with our public IPs (5.104.101.30 and 109.232.236.90), we exclude them from our DNS pools when needed, time to recover standard behavior. - 00:01 The majority (> 98%) of the apps are up and running, the last apps are being diagnosed case by case by our operators. - 00:40 All apps are now back up and running. - 01:02 We find the source issue with our public IPs. It is due to a bug on our Infrastructure provider networking layer, a workaround is deployed and our IPs are back up and running. - 01:09 All of our post-incident checks are green, the incident is now declared over. ## Impact On `osc-secnum-fr1`, none of the applications nor databases were impacted. However, our dashboard and APIs were not usable for the duration of the incident. On `osc-fr1` applications were unreachable for at least 2h50 (and at most 4h04). There was no impact on the integrity and confidentiality. ## Communication Our status page [https://scalingostatus.com](https://scalingostatus.com/) was being updated regularly during the day. We have answered all messages coming through Intercom either via the in-app chat or through our support email [email protected]. Our Twitter account [@ScalingoHQ](https://twitter.com/ScalingoHQ) posted about the major parts of the incident. Specific information has been pushed personally to some customers or to people who asked, whatever the channel. ## Analysis and Actions Taken The root cause was a saturation of a component managing the network of the region. This was possible because our rate limits were not strict enough. Those rate limits have been reduced to prevent such an event from happening. Plus, monitoring improvements have been realized to be able to be notified much more efficiently of such type of abuse. The fact that we were blind during the incident significantly slowed our incident response. There were already some projects on the roadmap to be more resilient to that kind of issues. The priority of those projects is being bumped to address such issues. We are also in communication with our network provider to fasten the incident response. We also opened a discussion on how to improve the resiliency of the VPC. Last but not least, it became clear that Scalingo needs to enforce stronger identity verification techniques and stricter default quotas until the identity of a customer has been proven. It would help to detect malicious users earlier and prevent them from abusing the platform. ## Service Level Agreement Consideration We propose a 99.9% SLA for applications scaled on at least 2 containers and 99.96% for databases using a Business plan. We are fully aware that the downtime which occurred on April 4th has heavily impacted this engagement. Therefore all customers that meet these criteria must contact the support to get the appropriate financial compensation.
Posted: Thu, 07 Apr 2022 12:48:00 +0000

[osc-fr1] A server hosting databases is misbehaving

» View Event Details | Created Sat, 02 Apr 2022 00:07:00 +0000

Resolved All databases are back up and running.
Posted: Sat, 02 Apr 2022 00:52:00 +0000

[Maintenance] [osc-fr1] Planned maintenance on our front servers

» View Event Details | Created Wed, 20 Apr 2022 18:00:00 +0000

Resolved Maintenance is now complete. Everything looks good on our side. If you encounter any problems please contact us via support.
Posted: Wed, 20 Apr 2022 20:13:00 +0000

Subscribe to Updates