» Published on
osc-fr1
the 4th of AprilMonday, April 4th, 2022 at 20h36, Scalingo has suffered from a Denial of Service (DoS) attack coming from the inside of our infrastructure. We have now put a first layer of protection in place to prevent such events from happening and we are working with our infrastructure provider to have a better long-term resiliency to those attacks. It is important to say that this incident only impacted the network availability of your apps, your apps themselves and especially your data have stayed secured during the incident.
All times given are in CEST (UTC+2)
On osc-secnum-fr1
, none of the applications nor databases were impacted. However, our dashboard and APIs were not usable for the duration of the incident.
On osc-fr1
applications were unreachable for at least 2h50 (and at most 4h04).
There was no impact on the integrity and confidentiality.
Our status page https://scalingostatus.com was being updated regularly during the day.
We have answered all messages coming through Intercom either via the in-app chat or through our support email [email protected].
Our Twitter account @ScalingoHQ posted about the major parts of the incident.
Specific information has been pushed personally to some customers or to people who asked, whatever the channel.
The root cause was a saturation of a component managing the network of the region. This was possible because our rate limits were not strict enough. Those rate limits have been reduced to prevent such an event from happening. Plus, monitoring improvements have been realized to be able to be notified much more efficiently of such type of abuse.
The fact that we were blind during the incident significantly slowed our incident response. There were already some projects on the roadmap to be more resilient to that kind of issues. The priority of those projects is being bumped to address such issues.
We are also in communication with our network provider to fasten the incident response. We also opened a discussion on how to improve the resiliency of the VPC.
Last but not least, it became clear that Scalingo needs to enforce stronger identity verification techniques and stricter default quotas until the identity of a customer has been proven. It would help to detect malicious users earlier and prevent them from abusing the platform.
We propose a 99.9% SLA for applications scaled on at least 2 containers and 99.96% for databases using a Business plan.
We are fully aware that the downtime which occurred on April 4th has heavily impacted this engagement.
Therefore all customers that meet these criteria must contact the support to get the appropriate financial compensation.
» UpdatedAll applications are back online on the osc-fr1 region. The incident is considered over on our side. Please contact our support if you are still experiencing trouble with your applications.
» UpdatedAll applications have been restarted. All Scalingo services are restored. Applications with a custom domain and a DNS record of A type may have some issues. We are still working on resolving this last issue.
» UpdatedThe recent blink is due to an intermittent issue with one of our IP. Applications hosted on the osc-fr1 region are reachable again. Our team is working hard on stabilizing the situation.
» UpdatedOur team is notified of some new issues to reach applications hosted on the osc-fr1 region. We are jumping on it.
» UpdatedWe are still in the process of restarting all the down applications. All the other Scalingo services (deployments, databases hosting, etc) are back online.
» UpdatedMost applications hosting nodes have access again to the network. This means most applications are reachable again. Our team is still working on restoring the full service.
» UpdatedWe successfully reduced the network load and recovered part of our infrastructure. Our team is working on restoring the service for all our customers.
» UpdatedOur team still works hand in hand with our infrastructure provider to resolve the issue.
» UpdatedOur team is still investigating the root cause of this incident. A part of our infrastructure is under a heavy network load. We are working on a way to improve the situation.
» UpdatedWe cannot provide you with any ETA for now. Be assure that all the team is working on fixing this issue.
» UpdatedOur team is in touch with our infrastructure provider to determine the root cause of this incident.
» UpdatedOur team has been notified that apps hosted on osc-fr1 are unreachable. We are working on it.
» Updated