Hypervisor From Scratch ? Part 5: Setting Up VMCS Running Guest Code
A virtual machine (VM) in this case, conceptually, is separate code running on the same physical processor isolated to its own environment. The first phase of testing a VM is loading a few assembly instructions into a VM. Launching the VM will execute these instructions in the context of the VM. After the instructions are executed, the VM will exit, giving execution back to the kernel. This process, gleaned from the manual, should be:
Hypervisor From Scratch – Part 5: Setting up VMCS Running Guest Code
Lastly, I needed an EPT. The EPT is the Extended Page Table, which is a separate page table that the host uses to translate from the guest physical addresses to the host physical addresses. I had recognized when reading the chapter on EPT, that the form of the EPT and a 4 level page table (which was already implemented for the kernel) are very similar. The only difference was the flags for each entry in the page table itself. For the time being, I copied the entire original page table code, renamed all the PageTable references to ExtendedPageTable references and changed the entry flags to the correct EPT entry flags for READ, WRITE, and EXECUTE and it just worked (to my surprise). Lastly, the EPT is set in the VMCS so the VM knows which page table to use when translating addresses.
In code, this looked fairly simple as well. We match (similar to select) based on the reason the VM exited, then we read from the processor which guest physical address caused the EPT Violation. Using the found address, the bytes at that guest physical address are retrieved (more on that funny situation below), and then write those bytes to the required page.
The beauty of resetting a VM is that resetting every page in the VM itself isn't necessary (this idea came from this talk). The EPT provides a "dirty bit" that is set if the page at a particular EPT entry was modified. Using this information, a simple page table walk tells the kernel which pages need to be reset, causing us to bypass many expensive page copies.
On powerpc using book3s_hv mode, the vcpus are mapped onto virtualthreads in one or more virtual CPU cores. (This is because thehardware requires all the hardware threads in a CPU core to be in thesame partition.) The KVM_CAP_PPC_SMT capability indicates the numberof vcpus per virtual core (vcore). The vcore id is obtained bydividing the vcpu id by the number of vcpus per vcore. The vcpus in agiven vcore will always be in the same physical core as each other(though that might be a different physical core from time to time).Userspace can control the threading (SMT) mode of the guest by itsallocation of vcpu ids. For example, if userspace wantssingle-threaded guest vcpus, it should make all vcpu ids be a multipleof the number of vcpus per vcore.
If the guest performed an access to I/O memory which could not be handled byuserspace, for example because of missing instruction syndrome decodeinformation or because there is no device mapped at the accessed IPA, thenuserspace can ask the kernel to inject an external abort using the addressfrom the exiting fault on the VCPU. It is a programming error to setext_dabt_pending after an exit which was not either KVM_EXIT_MMIO orKVM_EXIT_ARM_NISV. This feature is only available if the system supportsKVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality inhow userspace reports accesses for the above cases to guests, across differentuserspace implementations. Nevertheless, userspace can still emulate all Armexceptions by manipulating individual registers using the KVM_SET_ONE_REG API.
The host will set a flag in the pvclock structure that is checked from thesoft lockup watchdog. The flag is part of the pvclock structure that isshared between guest and host, specifically the second bit of the flagsfield of the pvclock_vcpu_time_info structure. It will be set exclusively bythe host and read/cleared exclusively by the guest. The guest operation ofchecking and clearing the flag must be an atomic operation soload-link/store-conditional, or equivalent must be used. There are two caseswhere the guest will clear the flag: when the soft lockup watchdog timer resetsitself or when a soft lockup is detected. This ioctl can be called any timeafter pausing the vcpu, but before it is resumed.
Set up the processor specific debug registers and configure vcpu forhandling guest debug events. There are two parts to the structure, thefirst a control bitfield indicates the type of debug events to handlewhen running. Common control bits are:
Sets the exception vector used to deliver Xen event channel upcalls.This is the HVM-wide vector injected directly by the hypervisor(not through the local APIC), typically configured by a guest viaHVM_PARAM_CALLBACK_IRQ. This can be disabled again (e.g. for guestSHUTDOWN_soft_reset) by setting it to zero.
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicatessupport for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It configuresan outbound port number for interception of EVTCHNOP_send requestsfrom the guest. A given sending port number may be directed back toa specified vCPU (by APIC ID) / port / priority on the guest, or totrigger events on an eventfd. The vCPU and priority can be changedby setting KVM_XEN_EVTCHN_UPDATE in a subsequent call, but but otherfields cannot change for a given sending port. A port mapping isremoved by using KVM_XEN_EVTCHN_DEASSIGN in the flags field. PassingKVM_XEN_EVTCHN_RESET in the flags field removes all interception ofoutbound event channels. The values of the flags field are mutuallyexclusive and cannot be combined as a bitmask.
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicatessupport for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets theper-vCPU local APIC upcall vector, configured by a Xen guest withthe HVMOP_set_evtchn_upcall_vector hypercall. This is typicallyused by Windows guests, and is distinct from the HVM-wide upcallvector configured with HVM_PARAM_CALLBACK_IRQ. It is disabled bysetting the vector to zero.
Used on arm64 systems. If a guest accesses memory not in a memslot,KVM will typically return to userspace and ask it to do MMIO emulation on itsbehalf. However, for certain classes of instructions, no instruction decode(direction, length of memory access) is provided, and fetching and decodingthe instruction from the VM is overly complicated to live in the kernel.
If userspace wishes to set up a guest topology, it should be careful thatthe values of these three leaves differ for each CPU. In particular,the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAXfor 0x8000001e; the latter also encodes the core id and node id in bits7:0 of EBX and ECX respectively.
This guide includes some basic virtualization theory as an introduction, and gives some examples of how a hypervisor might use the features that it describes. It doesn't cover the operation of a specific hypervisor, or attempt to explain how to write your own hypervisor from scratch. Both subjects are beyond the scope of this guide.
A preferred alternative, which is usually used to improve performance, is to enlighten the Guest OS. By making the Guest OS aware that it is running in a VM, and by providing virtual devices that are designed to have good performance when being emulated in the hypervisor and accessed from a Guest OS, a Guest OS can achieve good performance, even for I/O.
A system that uses virtualization is more complex. Some interrupts might be handled by the hypervisor itself. Other interrupts might come from devices allocated to a Virtual Machine (VM), and need to be handled by software within that VM. Also, the VM that is targeted by an interrupt might not be running at the time that the interrupt is received.
In the next article, we will dive right into building the MTRR map and cover the page attribute table (PAT). I will be provided prefabricated structures and explaining the initialization of EPT in your hypervisor. We will cover identity mapping, setting up the PML4/PML5 entries for our EPTP, allocating our various page directories, and how to implement 4KB pages versus the 2MB pages. In addition to that, the detail will be provided on EPT violations/EPT misconfigurations and how to implement their VM-exit handler. The easiest part will be inserting our EPTP into our VMCS. Unfortunately, the next article will only be configuration and initialization, and the following article will provide different methods of monitoring memory activity.
This provides a simple way to prevent a host from scanning and activating logical volumes that are not required directly by the host. In particular, the solution addresses logical volumes on shared storage managed by oVirt, and logical volumes created by a guest in oVirt raw volumes. This solution is needed because scanning and activating other logical volumes may cause data corruption, slow boot, or other issues.
Specific Host(s) - The virtual machine will start running on a particular host in the cluster. However, the Engine or an administrator can migrate the virtual machine to a different host in the cluster depending on the migration and high-availability settings of the virtual machine. Select the specific host or group of hosts from the list of available hosts.
Suspend workload if needed - Allows the virtual machine to migrate in most situations, including when the virtual machine is running a heavy workload. Because of this, virtual machines may experience a more significant downtime than with some other settings. The migration may still be aborted for extreme workloads. The guest agent hook mechanism is enabled.