The Linux kernel contains a lot of code support for Xen. This code isn’t just meant to optimize Linux to run as a virtualized guest. As a type 1 hypervisor, Xen relies a lot on the support of the operating system running as dom0. Although other operating systems can be used as dom0, Linux is the most popular dom0 choice — due to its widespread use and for historical reasons (Linux was chosen as dom0 in the first Xen implementation). Given this, a lot of the work of adding new functionality to Xen is done in the Linux kernel.
In this article, I’ll cover some highlights of Xen related work that has been done in the past year and what’s expected in the near future, as well as few best practices learned along the way. This post will be helpful for anyone who is interested in Xen Project technology and its impact on the Linux kernel.
History of Xen support in the Linux kernel
When the Xen Project was released in 2003, it was using a heavily modified Linux kernel as dom0. Over the years, a lot of effort has gone into merging those modifications into the official Linux kernel code base. And, in 2011, this goal was achieved.
However, because some distributions — like SUSE’s SLE — had included Xen support for quite some time, they had built up another pile of patches for optimizing the Linux kernel to run as dom0 and as a Xen guest. For the past three years, it has been my job to try to merge those patches into the upstream Linux kernel. We finally made it possible to use the upstream kernel without any Xen specific patches as base for SLE in Linux kernel 4.4.
The primary reason for the large amount of patches needed in the Linux kernel for support stems from the primary design goal of Xen. It was introduced at a time when x86 processors had no special virtualization features, and Xen tried to establish an interface making it possible to run completely isolated guests on x86 with bare metal like performance.
This was possible only by using paravirtualization. Instead of trying to emulate the privileged instructions of the x86 processor, Xen-enabled guests had to be modified to avoid those privileged instructions and use calls into the hypervisor when a privileged operation was unavoidable. This, of course, had a large impact on the low-level operating system, leading to the large patch amount. Basically, the Linux kernel had to support a new architecture.
Although they still have some advantages over fully virtualized guests with some workloads, paravirtualized guests are a little bit problematic from the kernel’s view:
-
The needed pvops framework limits performance of the same kernel running on bare metal.
-
Introducing new features touching this framework is more complicated than it should be.
With virtualization support in x86 processors available for many years now, there is an ongoing campaign to move away from paravirtualized domains to hardware virtualized ones. To get rid of paravirtualized guests completely, a new guest mode is needed: PVH. Basically, PVH mode is like a fully virtualized guest but without emulation of legacy features like BIOS. Many legacy features of fully virtualized guests are emulated via a qemu process running in dom0. Getting rid of using those legacy features will avoid the need for the qemu process.
Full support of PVH will enable dom0 to run in this mode. dom0 can’t be run fully virtualized, as this would require legacy emulation delivered by the qemu process in dom0 for an ordinary guest. For dom0, this would raise a chicken and egg problem. More on PVH support and its problems will be discussed later.
Last Year with Xen and the Linux Kernel
So, what has happened in the Linux kernel regarding Xen in the last year? Apart from the always ongoing correction of bugs, little tweaks, and adapting to changed kernel interfaces the main work has happened in the following areas:
-
PVH: After a first test implementation of PVH the basic design has been modified to use the fully virtualized interface as a starting point and avoid the legacy features.
This has led to a clean model requiring only a very small boot prologue used to set some indicators for avoiding the legacy features later on. The old PVH implementation was removed from the kernel and the new one has been introduced. This enables the Linux kernel to run as a PVH guest on top of Xen. dom0 PVH support isn’t complete right now, but we are progressing.
-
Restructuring to be able to configure a kernel with Xen support but without paravirtualized guest support: This can be viewed as a first step to finally get rid of a major part of the pvops framework. Today, such a kernel would be capable of running as a PVH or fully virtualized guest (with some paravirtualized interfaces like paravirtualized devices), but not yet as dom0.
-
ARM support: There has been significant effort with Xen on ARM (both 32- and 64-bit platforms). For example, support of guests with a different page size than dom0.
-
New paravirtualized devices: New frontend/backend drivers have been introduced or are in the process of being introduced, such as, PV-9pfs and a PV-socket implementation.
-
Performance of guests and dom0: This has been my primary area of work over the past year. In the following, I’ll highlight two examples along with some background information.
Restructuring of the xenbus driver
As a type 1 hypervisor, Xen has a big advantage over a type 2 hypervisor: It is much smaller; thus, the probability of the complete system failing due to a software error is smaller. This, however, is only true as long as other components are no longer a single point of failure, like today’s dom0.
Given this, I’m trying to add features to Xen that disaggregate it into redundant components by moving essential services into independant guests (e.g., driver domains containing the backends of paravirtualized devices).
One such service running in dom0 today is the Xenstore. Xenstore is designed to handle multiple outstanding requests. It is possible to run it in a “xenstore domain” independent from dom0, but this configuration wasn’t optimized for performance up to now.
The reason for this performance bottleneck was the xenbus driver being responsible for communication with Xenstore running in another domain (with Xenstore running as a dom0 daemon this driver would be used by guest domains or the dom0 kernel accessing Xenstore only). The xenbus driver could only handle one Xenstore access at a time. This is a major bottleneck because, during domain creation, there are often multiple-processes activity trying to access Xenstore. This was fixed through restructuring the xenbus driver to allow multiple requests to the Xenstore without blocking each other more than necessary.
Finding and repairing a performance regression of fully virtualized domains
This problem kept me busy for the past three weeks. In some tests, comparing performance between fully virtualized guests with a recent kernel and a rather old one (pre-pvops era) showed that several benchmarks performed very poorly on the new kernel. Fortunately, the tests were very easy to set up and the problem could be reproduced really easily, for example, a single munmap() call for a 8kB memory area was taking twice as long on the new kernel as on the old one.
So as a kernel developer, the first thing I tried was bisecting. Knowing the old and the new kernel version, I knew Git would help me find the Git commit making the performance bad. The git bisect process is very easy: you tell Git the last known good version and the first known bad version, then it will interactively do a binary search until the offending commit has been found.
At each iteration step, you have to test and tell Git whether the result was good or bad. In the end, I had a rather disturbing result: The commit meant to enhance the performance was to blame. And at the time, the patch was written (some years ago), it was shown it really did increase performance.
The patch in question introduced some more paravirtualized features for fully virtualized domains. So, the next thing I tried was to disable all paravirtualized features (this is easy doable via a boot parameter of the guest). Performance was up again. Well, for the munmap() call, not for the rest, (e.g., I/O handling). The overall performance of a fully virtualized guest without any paravirtualization feature enabled is disgusting due to the full emulation of all I/O devices including the platform chipset. So, the only thing I learned was that the paravirtualization features enabled would make munmap() slow.
I tried modifying the kernel to be able to disable various paravirtualized features one at a time hoping to find the one to blame. I suspected PV time handling to be the culprit, but didn’t have any success. Neither PV timers, PV clocksource, nor PV spinlocks were to blame.
Next idea: using ftrace to get timestamps of all the kernel functions called on the munmap() call. Comparing the timestamps of the test once run with PV features and once without should show the part of the kernel to blame. The result was again rather odd; the time seemed to be lost very gradually over the complete trace.
With perf I was finally able to find the problem: It showed a major increase of TLB misses with the PV features enabled. It turned out that enabling PV features requires mapping a Xen memory page into guest memory. The way this was done in the kernel required the hypervisor to split up a large page mapping into many small pages. Unfortunately, that large page contained the main kernel page tables accessed (e.g., when executing kernel code).
Moving the mapping of the Xen page into an area already mapped via small pages solved the problem.
What’s to Come
The main topics for the next time will be:
-
PVH dom0 support: Some features like PCI passthrough are still missing. Another very hot topic for PVH dom0 support will be performance. Some early tests using a FreeBSD kernel being able to run as PVH dom0 domain indicate that creating domains from a PVH kernel will be much slower than from a PV kernel. The reason here is the huge amount of hypercalls needed for domain creation. Calling the hypervisor from PVH is an order of magnitudes slower than from PV (the difference between VMEXIT/VMENTER and INT/IRET execution times of the processor). I have already some ideas on how to address this problem, but this would require some hypervisor modifications.
Another performance problem is backend operation, which again suffers from hypercalls being much slower on PVH. Again, a possible solution could be a proper hypervisor modification.
-
There are several enhancements regarding PV-devices (sound, multi-touch devices, virtual displays) in the pipeline. Those will be needed for a project using Xen as base for automotive IT.
This topic will be discussed during the Xen Project Developer and Design Summit happening in Budapest, Hungary from July 11 to 13. Register for the conference today.