[MUSIC] So today we're going to look at virtualization in all its different forms. What we're going to find is that there's a range of virtualization mechanisms you can use. Some are much more efficient than others, some are much more secure than others. And which one you choose is really going to depend upon your application. What's exciting is that this is really a sort of evolving area. And over time, hardware and software have come together to improve both virtualization from a software point of view, and virtualization from a hardware point of view. So, it's a moving target. We're going to be giving you an update, if you like, and come back in a couple years, and it may well have changed. However, some of the principles will remain the same, and it's those principles I want to get over in this lecture and in the subsequent lectures, I'll give you some ideas about which directions people are moving to provide faster, more application oriented, if you like, virtualization. Virtualization that matches what people are wanting for their clients. So in this, let's take a look now at the range of different virtualizations that occur. We've got native, which is really sort of full virtualization. The codes are running on instructions. The instructions themselves are virtualized. They're in this sort of envelope that does the virtualization. The hardware is providing that sort of separation from one user to another user as they execute their machine all on the same hardware. And then we say it's hardware assisted, meaning that when you virtualize, what you're going to do is to take the hardware and switch from user mode into operating system mode from operating system into doing IO and so on, in this envelope, this encapsulation, this way of separating things and the hardware itself will support doing that. So if we're in the client, we drop into the operating system. What will happen is that there will be a special interrupt saying this particular VM wants to go from user space to an operating system space and down below somewhere there'll be a hypervisor. The hypervisor is going to do some context switching for that to make it all happen seamlessly. And the operating system will be executing in its own security zone, supported by the hardware it can do, say, update the page tables, do things like interrupts and all of the hardware will be oriented so that that particular virtualization thinks it has all of the different devices. It sees all of those events and so on, so that's hardware assisted and to build those sort of systems you need to go into the hardware and add additional instructions as you might find in the VT and so on. We're going to distinguish that from providing sort of software virtualization. Sometimes the software is helped by hardware. But most of this encapsulation, this virtualization, is being performed by the software. Para-virtualization is the first one that came along. In terms of sequencing what it was doing is provide hooks to replace the hardware assisted to give you that sort of virtualization for the operating system and for the client. Then we'll start talking about more sophisticated approaches which combine much more software with the virtualization concept. And we're going to be doing this at the operating system level. What we're going to do is to build operating systems that accommodate virtualization, do all of the things of para-virtualization, essentially, and give you a faster execution and they are thinner, lighter, than the para-virtualization and therefore execute faster. They may not give you so much security and we will look at various ones of those containers, very popular at the moment, and actually sort of a leading contender nowadays when you're building a system and you want to decide what sort of virtualization to use, containers really stick up their benefits and attract a lot of users. Jails are sort of old style Unix, mechanisms which prevented you, when you're running on Unix, to actually access different parts of the file system. They would lock your root file system somewhere. Change root was the mechanism used to actually implement jails and we will talk a bit about that. We're going to talk about zones, which are built out of these ideas. These were made popular by Solaris. We'll talk a bit about those because they're slightly different from containers and we'll also talk about Open-VZ and Virtuozzo and how that operates. So you end up, at the end of this set of lectures, this lesson, you'll end up with an understanding across the spectrum of possibilities of which ones match your best interests. So, the virtual machine simulates enough hardware to allow unmodified guest operating systems, ones designed for the same CPU to be run in isolation. There's a range of these and examples would be VirtualBox, Virtual PC, VMWare, QEMU, Win4Lin and XEN. And I'll talk a bit more about XEN later on. But essentially, if you look at the diagram, what we've got is the hardware, we're going to put a hypervisor on top of it. We're going to have applications and a guest OS and put each of those applications in guest OS. There's two there listed. We're going to have some management software that essentially takes the applications and guest OS and provides them, isolates them, provides them their own execution environment. So, in this scheme, the obvious thing is to interpret pretty much all of the instructions and this is really underlying, say, something like QEMU. It's clearly possible to take each of the machine instructions and interpret those machine instructions as to what it does and we do one-on-one all the way through. We can do it for the applications. We can do it for the guest OS and we can build a simulator for everything that happens on the system. But that would be inefficient. So, a lot of these machines, what they do is only simulate the instructions that are relevant, that are difficult, that might break out of the virtualization. And they execute the other instructions, things like add, multiply, divide directly on the machine using the machine architecture. And in this way, you can get orders of magnitude speed up on our machine simulation such that these virtual machines almost behave as fast as the software running on the real hardware. So in this scheme, what we're going to do is to go through and replace instructions that are sensitive to security, are sensitive to multiple users being on the box and you don't want people to know that it's multiple users. So we take those instructions and we're going to simulate the effect of those instructions on the box. All right, if we delve down into a hardware solution for virtual machines, then you'll see all of the mechanism that the hardware would have to, or that a simulator would have to build in order to do what hardware is doing. And so that gives you a comparison. In this system, what we're going to look at is, what would actually be required to run a virtual machine on hardware itself, so you get a really fast execution. And we want to be able to have the application talk to a guest operating system and we don't want it to be interfered with by other virtual machines. And the classic sort of solutions to these in the PC, or in the cloud space, from Intel, the VT IVT instruction set that supports virtual machines, or the AMD virtualization AMD-V virtualization hardware in the AMD space. And examples of systems that actually use this are VMware Fusion, Parallels Desktop for Mac, and Parallels Workstation. What is happening in the diagram is that all of the difficult instructions that the applications in guest OS would execute, will be executed with instructions on those machines. There are additional extra instructions in order to make all of this mechanism work, and this additional software that makes it smooth so it will operate properly. Let's take an example. You're in the application and you want to do some I/O. Then what will happen is, you want the operating system to do some I/O operations for you. So you need to contact switch in to the operating system. In these hardware assisted mechanisms, what you're going to have is a hardware contact switch to go to an operating system level in the kernel. But it won't be the lowest level, it won't be in charge of all the I/O. What it will do, instead, is it'll have enough privilege to effectively execute as your operating system. And when it wants to do I/O, it will execute instructions, but those will get interpreted by another layer, lower down, which is, again, going to be a Hypervisor. But those will take the requests, the I/O requests, and map those I/O requests into actual I/O operations on the machine. So at the lowest level, you are going to interpret, essentially, the difficult operating system instructions that would break the illusion of virtualization. And then moving up, when you go from an application into that operating system, you'll be doing an interrupt or a trap into the operating system. That trap is going to have the unique properties of taking you to the operating system code, not the Hypervisor underneath. It's going to redirect or make sure your interrupt actually goes to the operating system, not to the Hypervisor. So all of this is built into the hardware, and what it would do is to completely separate out that guest OS in the applications from all the other applications and guest OSes on the machine. Now in terms of just efficiency, it can be very efficient, but you now have the guest OSes really operating in their own virtual memory with their own caches, with their own data. And as you're accessing all of that material, you're basically sort of isolated from everything else, you can't share any of those values. So if it's an operating system piece of code or there's an operating system data structure In the guest OS, it's not going to be shared with anything else and so there's some inefficiencies that occur. Similarly, when you transfer from the application into the operating system and then into the Hypervisors to I/O, each of those transfers is going to be a little bit of overhead. With partial virtualization as opposed to the hardware virtualization, what we're going to see in these machines, in these computer machines, is that we don't have a full instruction set to actually do the virtualization, to interpret jumping into an operating system or the operating system wants to do I/O. Instead of what we're going to find is, we replace some of those difficult instructions with software code. We may have, in the virtual machine, and in the underlying hardware, support to do that, and we may find sort of support for maintaining the address spaces. But essentially, a lot of the operations that were mechanized before aren't. The opportunities, however, for sharing and for finding caches and sharing caches is much higher because we now no longer have completely separate domains of execution of the applications and operating systems. So there are different ways to do this, and one of the ways would be to simulate the hardware. Another way is that we go in and we modify the operating system, the guest operating system that we were there talking about. And we insert it to the guest operating system APIs to do those difficult operations that might endanger the notion of virtualization. So as we said, when the application, say, does I/O, it is going to make a request to the operating system, and the operating system is then going to go through and do the I/O While you can change the way the operating system calls the I/O to make it a procedure call to a hypervisor to do that. And similarly, when you actually go from the application into the operating system, you might actually replace that transfer into a call to the hypervisor to change what's going on. So for example if you're worried about protection, you could change the protection of the application as you switch into the operating system. So we get some additional terminology coming along. We have inside this hypervisor, we have requests coming in to actually perform the sensitive operations of the operating system, and those are called hypercalls. So we have a hypervisor, we have hypercalls into the hypervisor and then what we do is to adjust what's going on. Now if you have exceptions, you have interrupts and other things, you have other issues going on that need mapping and there will be extra code and no one listens to actually do the transfer to make sure your virtualization is a solid virtualization. Examples of this is XEN, KVM, Win4Lin, 9x, they are all examples. If you look at that diagram, what you see is that you have an application, you have a Modified Guest OS and now you have the Hypervisors under it. The Hypervisor's supporting all of those. But obviously if the operating systems are modified, you can obviously share some parts of the Guest OSs between those applications, but other parts which have been modified you may not be able to. There is more room for sharing. There is a cost, there's the cost of these hypervisor HyperCools that goes on. And that will add to the overall time. So an alternative route rather than have modified operating systems like Zen and so on, is to actually change the operating system to support different types of physical servers and provide an interface on the operating system that basically enables a virtualized set of servers to run. In this diagram, what you can see is how to create services on top of the operating system. The operating system providing libraries to those servers which actually provide the virtualization for each of those servers. So the server now will be responsible for interpreting the sensitive requests of the private server. The operating system itself will try and avoid program, to try and avoid making conflicts, or having conflicts between the different virtualizations. So this approach is used in parallel workstation, in Linux-VServer, OpenVZ, Solaris Containers, FreeBSD Jails, and change root. It is less secure than the previous examples. Clearly if there's a problem with the operating system or a problem with the interaction between the physical server and the operating system, there is nothing to start intrusions and other problems occurring inside this system. However, if it's all coded correctly, what it can do though is to remove all of those levels of interpretation, all of the HyperCools. And what you end up with is a much more traditional operating system that we know how to optimize, we can execute very quickly. So there's some plusses to this, and there's some difficulties with this. Now, what we all find is that in losses circumstances these plusses actually outweigh all of the security issues. For example, let's suppose you are building a server and it has multiple different threads, multiple different pieces of the services as in say a cloud. Because it's all operating in the same domain with the same security essentially they are all working for the same organization. It's not clear you need the protection between all those different pieces, of all those different servers, all the different containers for what's going on. So in that case this might be a much, much better solution than going ahead and building a real virtual machine, either with hardware or with the interpretation. So that's another reason for looking at this whole spectrum. That there are different reasons for doing different things. Let's map these together. On the left-hand side, what we going to do is list the hypervisors, what their overheads and so on are. And on the other side, not sure which side, I guess I chose the left side for the hypervisor, the right side for the containers. And as we compare that, first thing you come up is, well, you have hardware and then you're building from that hardware, many virtual hardwares and on top of that you're running many operating systems. Now, compare that with the containers and this sort of building the interpretation in the operating system approach. There, you've only got one real hardware, you've got no virtual hardware. You've got one kernel, you've got many user space instances. And what's giving you the effect of user space separation is really the kernel. Let's go back to the left-hand side. The hypervisor is very versatile, it can run lots of different operating systems. So, you could have Windows running at the same time as Apple OS X, as same time as Linux all in the same box. This is not possible on your container side. The containers, they could be tuned to take applications written for specific operating systems. The libraries might be different, but on the whole, they're not going to be able to support effectively all the different operating systems you get because there's some really radical differences sometimes between the operating systems. You do however get a difference between these two hypervising containers on how much density, what's the effectiveness of the all the bytes you're using? In the terms of the hypervisor you've got all these interpretation, you don't need extra code in the operating system to do things, you do have the HyperCools though. You are going to get performance, you are going to get scalability on the other side, on the containers side. What's going to happen is you don't need so much code, and the applications don't need so much code to do the sort of things they're doing. So you got a higher density if you like, you can get more apps into a set of container systems than you can on a hypervisor with its apps. And you get natural page sharing because of the normal caching and other operation inside the system. Your page sharing is going to work for you. It's not going to work for you with the hypervisor. Lower down as we look at the features of trying to support multiple machines. You are going to have hardware in hypervisors that will accelerate some of the difficult tasks that you've got to do. So if you're doing context switching you are going to have hardware to support that if you're going to play with virtual memory mappings, you are going to have a hard way to do that. On the other side with the containers you have less flexibility, less possibilities of changing things and the result because you're running effectively, we just sort of simple operating systems that could bit modified. You get almost native performance and there's not really much observable overhead. So containers they're going to share host OSs and drivers, they're going to have small virtualization layers that make sure everything's distant. They naturally share pages. They're not completely secure. But on the other hand, they're getting very very close to being a good protected environment. The hypervisors' have separate OSs. They have virtualized hardware, their hardware emulation requires additional state. They have trouble sharing OS pages so they often don't do that. And they are much, much better at trying to separate. They feel like the worlds of applications and the other applications because there is effectively a large wall between them. A container's more elastic than hypervisors so if you want to create containers, since they have such a small footprint in the way they're built, they're using less software, they don't have so much impact on the performance. You can get many more containers in a machine and on a core than you can with hypervisors supporting operating systems in virtual memories. The container slicing of the operating system is ideally suited to the sort of things that we're going to find in Cloud. So when you build, let's say application support on the Clouds for big data then you may implement that with lots of threads on lots of different machines, lots of different calls all in the same machine. But you can utilize all those calls much more easily because they can actually just sort of slice quickly from one task to another. They don't have to go through all that mechanism that we were describing with virtual machines because of the nature of the application. So that sort of system is much more mappable, from a cloud point of view, onto containers. The hypervisors, I won't say it's the only advantage but the slide does. Only advantage in infrastructure is the support. It's supported by different, it allows you to have different operating system families on one server. So if you really do need Windows to be talking to Linux, you can do that all on one machine. There's a sort of interesting case, but it's not a really demanding case for, shall we say, not going down the container route. So mapping all this out, we could compare increasing efficiency with increasing isolation. Increasing isolations along the bottom, efficiency is the height. So if we want things that are very efficient, Then Linux and Windows are great. If we want to isolate application software, one from the other, we can go to something like a security kernel. One of those is called SELinux. And you can see it's slightly lower performance. What it does is to check everything that goes on communication-wise between the applications and the rest of the machine. In terms of SELinux, it's using label systems to do that. I won't go into that here. But those labels, interpreting labels, has a little bit of overhead. Similarly, you can put BSD jails up, and when you put BSD jails up, what's happening with a jail is that it's checking your access to file systems to make sure that you don't exceed the authorization to access everything. You'd be constrained to a piece of the file system and it's going to check that every time you make a file system access. Not just check it from the access tables but it's actually going to check it from what's open, what's in your tables, what's in the cache and so on. So BSD jails would also impose a little bit of overhead. Then you get to the container systems these are the OpenVZ Solaris 10. Those would be faster. What they're doing is sort of really threading between multiple instances. They're keeping their memory's separate, and as I said, this is all done by having the library a little bit more sophisticated. A little bit more checking on what's going in the communications, through the operating system, also checking and keeping the identities separate. So, the VServer, OpenVZ Solaris 10, they're all having more overhead than SELinux. To move across we get to the Xen and VMWare type of implementations where what you are doing is partial virtualization. You're interpreting the sense of operations that is being executed. So when, as I said before, when the clients wants to do I/O you're going to check it's dropping into its operating system. You're going to check again when the operating system wants actually to do the I/O operation and all of that is going to add to your costs. It gives you more isolation, but again, it tax some of the performance. And then you can have the machine efficiency, you can put extra machine instructions in there, virtualize everything. And you can build systems that are really very, very difficult to break the isolation of but you're going to pay a penalty. You're going to have more hardware, you're going to have extra instructions. You're going to have extra code to manipulate that extra instructions to keep track of all that security. And so it's, again, not as efficient as you would like, especially from a point of view of how much hardware you're using to execute the instructions. So looking at it from a feature viewpoint, the hypervisor allows you to have multiple kernels, allows you to, the hypervisor was the partial of full virtualization. Allows you those multiple kernels, allows you those to load arbitrary modules that's still going to be protected because it's in a sort of enforced protection. Allows you to give administration through the operating systems that run to somebody. Allows you might live migration and you do have a different, well actually there's going to be a difficulty anyway. When you want to update, when you have to change the system, modify it, put maybe, fix a bug or fix a security problem, you have to worry how you going to do that in an environment that has lots of threads and applications running. On the container side, you can't have multiple kernels loading arbitrary modules, it's not possible. You can have a local administrator. You can have live migration and in some systems like Zap you can actually do the system update. It's a lot easier because there's not so many systems to actually update, there’s only the centralized system. [MUSIC]