Linux Supercomputing Dominance: A Look Under the Hood

62

A few weeks ago, the Top500 Supercomputer list came out, as it does each November. As expected, Linux is still the most used OS for supercomputing, as it has been since taking the list by storm in the early 2000s.

While this is certainly a feel-good thing to write about (who doesn’t love being #1?) it’s worth taking some time to think about how this happened, and why. Linux’s rise to dominance at the forefront of modern science was no accident.

One of the main reasons it has done so well is that quite frankly, supercomputing is somewhat weird computing. Supercomputers are dramatically different from the average PC or server, even the supers that are x86 based. This is due in large part to the way the systems are connected together and pass data around, for example, exotic switching infrastructures, stringent requirements for efficient message passing, efficient and rapid dispersal of data among many nodes, and so on.

The net result is that something has to run these one-of-a-kind, highly specialized environments. One of the reasons Linux has performed so well in this space is that anyone is able to adapt, optimize, and tune it according to the latest and greatest system design thinking.

We also have academia to thank for this. It certainly does not hurt that when academic research is being done into optimal system design, proofs of concept are almost always built on Linux. Why? Because it’s easier to prototype (you have all the source code) and there are no restrictions on publishing the source code in a paper (tenure matters). For the system designers, it’s easier to apply the latest academic thinking if you work in the same environment as the researchers.

Finally, many supercomputers are designed so that you can use a normal system interface when you are coding and submitting a job, and then the system farms the real work out to super-optimized compute nodes. Whereas a typical x86 server has 10-30% system utilization, a compute node must function as close to maximum capacity as possible. We’ve heard (jokingly) that even an operation system is too much overhead, which basically means a compute node needs to be optimized to the extreme.

Linux works well for these compute nodes because it is so flexible, and because it’s actually fairly common to strip the kernel down into a hyper-optimized package – the mobile and embedded space being a great example. It is the only OS with a single kernel that runs on both the tiniest and the largest systems in the world.

We also recently had a question, why is it that Linux has done so well on the list, but Unix and Windows are doing so poorly?

Linux has completely dominated Unix and Windows in the supercomputer market for three reasons: What the server vendors want to sell, what their clients want to buy, and what the system does.

The conventional server selling process is basically an assembly line – you figure out what features most of your clients need, develop an OS and server accordingly, and sell them something that’s mostly optimized for most of their workloads. You try to make your money by selling a lot of them as fast as possible, and without having to do much customization to keep your customers happy.

Supercomputers are an entirely different breed, which means they don’t fit well into this mold. More often than not they are custom designed, very expensive, sell in low volumes, and must be deeply tuned for the specific workload. It’s very hard for the proprietary OS vendors to keep up, because they are limited by a fixed number of developers who are focused on their core server market – and more importantly, nobody but their own developers can get their product out the door.

With Linux, though, anyone can see (and optimize) the source code, and scalability work done by one company can be extremely beneficial for others – after all, ~90% of the kernel is architecture agnostic. It’s just plain faster to build a price competitive machine with Linux, and when selling a supercomputer, the time to first boot is critically intertwined with profitability.

The second reason is due to the skills of the people who will be using the system. Most supercomputers are deployed to academia or government labs. The customers (professors, students, researchers, engineers) don’t want to spend time learning something different before getting down to science. The fresh talent all comes in knowing Linux. UNIX just isn’t a marketable skill in research anymore, and the choice of OS reflects this.

Finally, Linux evolves much more rapidly than any other OS, and diversifies faster. No other operating system has jumped to new platforms, new workloads, and new architectures so quickly. As a result, Linux tends to get new features and functionality sooner than the other options, on a wider variety of platforms. In the case of supercomputers, this could be support for a new interconnect, advancements in extreme scalability, or any other feature that tends to appeal to the ultra high end of the computing market.

The point is, Linux is where it is in supercomputing for a reason. Congratulations Tux, you’ve earned it.