I/O Architecture
In my earlier post, I discussed some of the main concepts, challenges and techniques that are essential for virtualizing a CPU or instruction set. To a large degree, these concepts and techniques are not new – they are really just variations on a theme that has been around for a long time in computer science. Virtualization is fundamentally about preserving the appearance of isolating resources, while actually sharing the resources harmoniously and efficiently. Perhaps the most obvious example was virtual memory – I can still remember being excited about virtual memory when I upgraded to System 7 on an Apple Quadra! Tricks with virtual memory would let you run applications without buying new RAM (although the loading performance was often horrible as a result…at least it ran).
Â
One of the most difficult and complicated areas of a modern PC is the I/O architecture. The I/O architecture governs how the processor and memory talk to the devices that make a computer interesting – keyboards, mice, network cards, GPUs and hard drives. It’s essential to remember that I/O is really a first class priority of a computer, because as a CPU architect, I/O moves at a glacial speed.Â
Â
Modern CPUs operate at around 3GHz, so a single cycle is only 0.33ns. In comparison, reading data from a disk takes around 5ms – or 15 million cycles! The engineering maxim of ‘make the common case fast, and the uncommon case correct’ is sometimes erroneously simplified into ‘ignore the uncommon case’ – and when I/O only happens every 15 million cycles, it’s pretty uncommon.
Â
While I/O may be uncommon, it is essential not only because it’s how the user communicates the system, but also because it can be overlooked – a lot of problems crop up in the I/O architecture. For example, one of the chief causes for blue screens of death in Windows 95 was not the I/O device architecture but device drivers. While Windows 95 did a great job of isolating each process in memory and preventing them from interfering with the OS, it gave device drivers access to everything in memory. So a rogue device driver could not only erase data belonging to other processes, but even the operating system kernel itself. In retrospect that was a horrible idea, and one of the reasons for the new device driver model in later versions of Windows.
Â
The ideal solution to dealing with I/O is actually quite simple, and unfortunately expensive. Ideally, all I/O devices would be fully cache coherent and use virtual memory, just like another CPU. In prior articles, I have explored cache coherency quite a bit (http://www.realworldtech.com/page.cfm?ArticleID=RWT121106171654&p=1 and http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032), and it’s probably worth brushing up a bit. The key is that coherency and virtual memory create a very natural sort of way to provide protection and sharing (the page tables typically protect memory areas and specify shared memory spaces). Unfortunately, the hardware cost of making the south bridge of the chipset coherent in the early 1990’s was absolutely prohibitive (especially when you consider that Intel’s Pentium required external chips for multi-processor cache coherency).
Fast forward to 2008, and I/O virtualization is really about providing some of the benefits of cache coherency and virtual memory to I/O devices, since the cost is more reasonable now than in 1990, thanks to Moore’s Law.
Both Intel and AMD have announced I/O virtualization efforts that focus on bringing I/O devices into the PC virtual memory system and providing isolation (no sharing yet, unfortunately). To review, modern PCs with CPU virtualization have two separate address spaces – host physical and guest physical. The host physical is the new address space created for hypervisors in both Intel’s VT and AMD’s SVM, while guest physical is what traditional guest OSes (e.g. Windows XP) access. In a normal PC without virtualization, guest physical corresponds to real hardware – in a PC with virtualization, guest physical must be translated by the CPU and hypervisor to host physical – which corresponds to real hardware.Â
Â
Unfortunately, guest OSes cannot directly access I/O devices or their drivers. If that were to happen, the guest OS might tell the I/O device to write data to memory where another guest or the hypervisor is stored. Instead, the hypervisor communicates directly with the hardware and must have a driver for any I/O devices in the system; and the guest OS talks to the hypervisor, which then communicates with the driver. This double translation process is slower than regularly accessing hardware, and the driver implications are also quite problematic and expensive. The four major operating systems (Windows, Linux, Solaris, OS X) already have most of the important device drivers and it makes a lot more sense to re-use those, than to try and re-invent drivers for all the different hypervisors.
Â
Intel’s and AMD’s I/O virtualization efforts solve these twin problems, along with several others. I/O virtualization relies on an I/O Memory Management Unit (IOMMU) in the chipset to translate the addresses used by an I/O device (which are now treated like virtual addresses for I/O devices) into host physical memory. The IOMMU also creates special protection domains, which are areas in the host physical memory that are assigned to a guest OS, a hypervisor or are used to DMA into a guest OS. A context entry table holds the assignments of I/O devices to protection domains. Physical memory in a protection domain can only be written by devices which are assigned to that domain as well; the IOMMU enforces this when it translates the addresses. So if a network controller tried to write data into a protection domain that was only used by the graphics card and a guest OS, the IOMMU would reject the network controller’s DMA request and avoiding any problems. By now, this should sound quite similar to the way that I explained page table for virtual memory in the previous article – and with good reason, it’s almost exactly the same problem. The IOMMU also deals with interrupts from devices and routes them to the appropriate guest OS or hypervisor.
Â
Of course, if the IOMMU had to translate every memory access by a device or guest OS, the performance would suffer. The IOMMU has it’s own translation cache, called the IOTLB, which functions exactly like a traditional TLB.Â
Â
There are several other hardware managed caching structures associated with the IOMMU, namely the context cache, page directory entry (PDE) cache and the interrupt entry cache. The context cache holds frequently accessed context entries (which map devices to protection domains). The PDE cache is used to accelerate page table walks (which are all done in hardware), and lastly the interrupt entry cache holds frequency accessed entries in the interrupt remapping table. Just like traditional caches, these structures must be invalidated when they are updated.
Â
Another problem that the IOMMU solves is letting older 32-bit devices (which can only see the first 4GB of memory) access the entire virtual memory address space. The IOMMU can simply translate the first 4GB of the device virtual memory to any region of the host physical memory.
Â
The first generation of I/O virtualization techniques lets I/O devices begin to participate in the virtual memory hierarchy and provides protection mechanisms so that devices can be directly accessed by guest OSes. This is a good initial step, judging by the evolution of similar technologies, the next step will likely be to enable a single device to be directly accessed by multiple guest OSes. However, a more interesting question is whether certain I/O devices, perhaps a future version of Larrabee, will be able to not just DMA, but actually participate in fully cache coherent communication – the ultimate level of evolution for an I/O device.














Add Your Comment