We would like to discuss the recent increase in server outages. We would like to clarify what we know so far.
First of all, there is not a single definitive root cause and we unfortunately can’t give you guarantees about the stability of the system. We know it’s a software issue.
Over the past one to two months we have identified two, possibly three, distinct issues that are currently being investigated. One of them involves the server CPU reaching its limits at seemingly random times, though it happens more often during peak activity. More recently, we have also observed a new type of issue with a different technical signature.
We are investigating all of these. So far, much of the work has focused on adding proper telemetry and diagnostic tooling to our infrastructure. Until recently, the server's stability was satisfactory, which meant we had little immediate need for deep diagnostics and therefore limited visibility when these problems began to appear.
At this point we have not yet identified a single root cause responsible for the crashes. It seems to get triggered by high player counts which causes a series of different failures to spread across the system without a single root cause standing out. With better observability we are identifying new bugs and resolving them one by one.
Fingers crossed, and thank you for your patience. :fingers_crossed:
No. Game servers and lobby servers are separate. Once a game has started, it will not be interrupted by a lobby server outage.
Based on our analysis so far, this seems unlikely. That said, we cannot fully rule it out yet.
No. In fact, they are not. We use standard, professional hosting services.
We could directly benefit from day-one experience with running Erlang/Elixir applications in production, debugging them, and knowing what to integrate for better visibility. If you have other related skills, head to https://beyond-all-reason.github.io/infrastructure/contributing/.
No. It is worth noting that all of the above refers to our currently used, legacy infrastructure. In the longer term, the biggest improvement will come from shipping the new client together with Tachyon, which simplifies the overall system architecture. This is not a short-term fix, but it represents the largest long-term payoff in terms of stability and maintainability. Balancing how much effort we want to spend on the parts of the code base we want to deprecate versus investing into the new architecture is challenging.
Okay, okay, you want some technical details of what we fixed so far I guess:

