Home
        Diplomarbeit Lightweight Virtualization on Microkernel
         Contents
1.     26    3 3 Virtual Machine Monitors           Linux    Application E Windows Application         Linux Application    MApp      Virtual Machine E  Unprivileged  EE  Privileged           Figure 3 8  VMware Workstation running two virtual machines on a Linux host     second component runs as an application and is used to translate IO operations of the  VM to host primitives  VMApp  ISVLOL    VMware Workstation has a fixed set of virtual devices that may be used by the guest   Host drivers are used as a back end  A set of custom drivers can be installed into guest  operating systems to make it aware of virtualization and use more efficient IO interfaces    An example VMware Workstation running on a Linux host is illustrated in Figure 3 8   In this example it drives two virtual machines  In a VMware setup  both the hosting  monolithic kernel and the VMDriver component belong to the TCB     27    4 Design    Innovations in operating systems often imply breaking backward compatibility  However   the availability of applications is a crucial prerequisite for the adoption of an OS  As the  development of complex applications is usually a slow and expensive process  providing  a stable interface is an important aspect in the development of new operating systems    Linux is a free  open source operating system kernel  It is used in many applications  in the embedded  desktop and even server domain  During the years many companies  started to use Linux and add to its already large ap
2.   6 1 2 Custom Benchmark    I devised a small custom benchmark to inquire about the performance of the memory  virtualization  It stresses the Linux system call interface by creating a huge number of  processes that do few computation  In compliance with its file name  I will call it the  bench sh benchmark  Bench sh and its sibling test sh can be seen in Figures 6 2 and 6 3    This benchmark causes Linux to send a huge number of IPIs to reschedule and migrate  processes  and is therefore ideal to measure the performance of the IPI implementation   Process creation and destruction also stresses the memory subsystem  address spaces  need to be created and destroyed  Additional computation is needed for synchronization  in the file system layer of Linux  Again  I compared native Linux  Karma Linux  L4Linux  and KVM Linux    If the benchmark is executed with one virtual CPU  its runs faster in a VM established  by Karma  than it does in native Linux  I think that this is a result of scheduling  anomalies caused by the virtual clock  which is currently running slightly slower than the  physical clock  and the difference in the amount of injected timer interrupts correlates  with the speedup             bin sh  FILE  1  count 0  while    count    le 100000    do  count      count  1   rm    r    FILE txt  touch  FILE txt  dd if  dev zero bs 500K count 3 of  FILE txt 2 gt  dev null  gzip  9  FILE txt  cat  FILE txt gz  gt   dev null  rm    rf  FILE txt gz  done       Figure 6 3  te
3.   Guest Virtual Memory    Guest Page Table    Guest Physical Memory  rason age mo  Host Virtual Memory    Host Page Table    Host Physical Memory                   Address Translation   Logical Address Translation  using Page Tables    by the Software MMU    Figure 2 1  Illustration of a shadow page table that maps guest virtual to host physical  addresses     shortcuts were implemented to increase performance  However  such shortcuts deviate  from the x86 interface and do not work in the general case    Operating systems have to be in control of the memory management of the machine  and rely on segmentation and paging to achieve memory protection  For security reasons   a VM is not allowed to manipulate host page tables and segments  Page table and  segment manipulations are privileged operations that cannot be delegated to a VM  because it would enable the VM to manipulate kernel memory an thus seize control   Therefore  VMMs on x86 have to emulate the memory management unit  MMU     The guest memory is usually virtualized by multiplexing host address spaces in a way  that provides the VM with the illusion of being in control of the memory management   This requires the VMM to be involved in any memory relevant operation of the VM  and to keep track of mappings of VM memory to host memory  In order to do this  the  VMM has to implement an additional page table that maps guest virtual addresses to  host physical addresses  Such a page table is called a shadow page table  an 
4.   I could then fix the problem by  modifying Linux to set different cache attributes    Letting the guest OS drive a device directly is security sensitive if that requires DMA   as described in Section 4 5 3    Therefore  I also implemented another way to command the hard disk  In that scheme   I did not map the device memory into the VM   s address space but only into the VMM   s   I modified the Linux driver to call the VMM to both read values from device memory and  write commands to device memory  This has the advantage that the VMM can verify the  command  and the VM is removed from the system s TCB  This scheme can be extended     5l    5 Implementation       to hand DMA commands out to an external server  which would remove the VMM from  the TCB as well  However  external analysis comes at the cost of performance  as will  be discussed in the evaluation chapter     5 3 6 Summary    In my solution Linux has access to the most important peripheral devices  In my work I  support the following devices     Serial Line Interface A virtual serial line is provided with an interface that translates  to the log interface of L4Re  Both the vcon Fiasco interface or any other service  implementing the log interface can be used as back end     Network Interface Card In this setup Linux contains a stub driver to make use of the  multiplexing capabilities of the Ankh NIC multiplexer  In my measurements I  found that this setup is able to saturate the physical interface  At the time of
5.   segment descriptor tables and the interrupt descriptor table  It also calls BIOS code  for system information such as the layout of the systems physical memory  This system  initialization is done in real mode to prepare the system execution in protected  32 bit   mode    Once in protected mode  the OS sets up the initial page tables and turns on paging   Thereafter  it discovers devices and initializes them with the corresponding drivers  It  then proceeds to start the first user process    The initial boot process of PC hardware is essentially unchanged since the first PC  that had an 8086 processor running 16bit code  This means a lot of the complexity of  the boot process stems from these early systems and has the only function to prepare  the system for protected mode    For a VM we can remove this legacy code  When a VM is set up  the host system is  already up and running  The VMM sets up the system  initializes the virtual devices   sets up the physical memory and configures the CPU for guest mode  This means the  VMM can do both the tasks of the BIOS and the boot loader and start up the system in  protected mode    Doing so dispenses with the need for a BIOS emulation and thus reduces the complexity  of the VMM  For Intel VT systems  booting the system directly in protected mode also  obviates an instruction emulator    The Linux kernel supports booting from machines that employ the EFI   for machine  initialization and booting  EFI boots systems in protected mode 
6.   source code of the software because doing so would violate its trade secrets  But without  being able to fully analyze the software  how can company A be confident that the  software will not leak the confidential information  What is needed here is confinement   a property of a system that enforces that a process cannot leak information to third  parties  So if company A uses a system that implements confinement  it can be confident  that the software bought from company B cannot leak the confidential information    Such problems can only be mitigated with an adequate access control mechanism    Early L4 microkernels implemented a simple access control scheme  clans and chiefs   Every task belongs to a clan  and may freely communicate with any task therein  One  task of the group is the chief  Communication between tasks of different groups must pass  through their corresponding chiefs  IPC is dispatched and redirected by the microkernel   which places a significant performance burden on inter clan IPC  Therefore  despite  being flexible  clans and chiefs is no longer actively used    In current microkernels such as OKL4 and Fiasco a new mandatory access control  mechanism is used  object capabilities  With object capabilities access to an object is  granted only if the task holds the appropriate capability  A task may grant an access  permission to another by mapping the corresponding capability  Similar to memory pages   capabilities can be unmapped without the mappee s 
7.   this writing the Ankh NIC multiplexer was experimental and I was not able to do  reproducible benchmarks     Graphical User Interface A graphical user interface  GUI  is provided with an interface  to the L4Con protocol  both L4Con and DoPE can be used as back ends  The  Linux stub driver implements the Linux framebuffer abstraction  As such it also  supports the X11 window system for graphical output     PCI Bus I added a stub driver that allows Linux to make use of virtual PCI busses as  provided by the  O server     Hard Disk With no device multiplexer for block devices in place  I chose to implement  support for accessing a physical hard disk  This access must be restricted to one  VM  Allowing the VM to make use of DMA transfers is sensitive with respect to  security  and adds the VM to the TCB of all applications  Therefore  I implemented  two methods to drive the device  Either the VM accesses the device   s IO memory  directly  or these accesses are handed out to the VMM  By handing the device  commands out to an external server  the TCB might be further reduced     5 4 Staged Virtualization    Support for a second stage VM that does faithful virtualization is done with KVM   Running KVM in a VM upon the Karma VMM required the following adaptions   Karma implements a back end  which provides the guest with a hypercall interface to  initiate a world switch  to create and destroy a VM and to set up the VMs memory   On initialization of a second stage VM  KVM does a hy
8.  5 2  5 3  5 4  9 0    6 1  6 2  6 3  6 4  6 5  6 6  6 7  6 8  6 9  6 10  6 11  6 12  6 13    of Figures    Shadow page table   2 2 2  2  on nn nn 12  Nested Paging      oaoa aaa ee 13  LATIN ste Ge ee he ee ee a a we SO E E E e 20  User mode Linux    2    CH non 21  KENN  32 ne ee ee ee Be re aE 23  Lguestl is aseo a bad ea Be he ee ee E G 24  NOVA zZ ers Weare a or mw Ra een da ee ee E 25  EVMI gp cee ee ee Se hs owe Re  ws ee ge he a gee Eee 3 25  KVMEDA    2 20 han ee ER a Pe AGE A F 26  VMware Workstation    2    2222 mo nn nn 27  Comparison System Calls    oaa 30  Setup with two VMs    2 2 2    on nn 31  Comparison of Architectures   22 2  2 Comm 32  TCB of different hard disk access schemes      2    2 22 2    36  Staged Virtualization     soe 22 2 nn nenn 38  Envisioned system architecture    o    o    e    o    39  Control Loop   sy a a rd RR or A o er 43  A comparison of the boot process of PCs and VMs                46  Split driver setup  2 2  2  CC nn nn 49  Ankh network multiplexer with two clients                   51  Output of diffstat  22  2 Co a 53  Compute Benchmark               e    56  benchishl si    arco ou Oe a nn a a ae 57  LESS u  ica a a lh ee 57  Bench sh benchmarkl       2 22 2 non nn 58  Kernel Compile Script  ss i sock petaus e k e    59  Kernel Compile    aaa aaa ee 60  Kernel Compile Overhead            nn 60  Vmexit reasons during kernel compile                 2     61  Kernel compile with idle VMs          0    o    e      20 0004 62  Ker
9.  6  SUMMATY  s a u u an hae a a ek Be  5 4 Staged Virtualization    2    ee ee  5 0  SUMMALy  won eo ae e EE a a we ee Gees  6 Evaluation  6 1 Performance Benchmarks           0  0 0000  eee eee ee  6 1 1 Compute Benchmark    2    22 o        e      e        6 1 2 Custom Benchmark                 2000000500 G  6 1 3 Kernel Compile            0  e     6 1 3 1 Vmexit Reasons               2    0 04  6 1 3 2 Idle VMS                     ee    6 1 3 3 Handing IO Memory Accesses to the VMM          6 2 Variation Benchmarks    222 Co on n nn  6 2 1 Impact of Different Page Sizes       o    o    e            6 2 2 Impact of World Switches            o                0084  6 2 3  Screen Retrestl   o ceos 3 aa se ne BR en ee    XII    29  30  30  32  33  34  34  35  35  37  37  37  38    41  41  41  42  42  44  44  45  46  48  48  49  49  50  50  52  52  53    Contents       6 3 Staged Virtualization   2 2 222 non onen 66  7 Real time Tasks in Virtual Machines 67  1 1    Backsround  A Bahr ee re ma i k 67  7 2 Case Study  Xenomai    aoaaa a 68  7 3 Single Unmodified RT Guest      aoaaa a a 68  7 4 Multiple Slightly Modified RT Guests                           69  7 5 Modified RT Design       o    nn nn 70  1 6  SuMMAary  ocre a a a ea we Ss 72  8 Outlook and Conclusion 75  8 5 Outlook oe  20  ra al a ee a a a ed 75  8 2  Conclusion     ni a da d 77  Glossary 79  Bibliography 81    XII    List    2 1  2 2    3 1  3 2  3 3  3 4  3 9  3 6  3 7  3 8    4 1  4 2  4 3  4 4  4 5  4 6    5 1 
10.  I did this in the hope that an object oriented design will both simplify the  implementation and help to produce clean and easily extensible code  The resulting  VMM  which will be called Karma  can serve as a starting point for further research    In this chapter  I describe details of the VMM implementation  I will begin by  explaining the core system  including core virtualization  Then  the system environment  is presented  which includes platform virtualization  infrastructure needed to boot Linux   as well as SMP support  In the next section  I present how device virtualization is done   The chapter will be completed with a description of staged virtualization     5 1 Core System    The Fiasco microkernel has support for the hardware virtualization extension SVM  In  this chapter I will describe how the interface for this extension looks like and how it  can be used to implement a VMM in unprivileged user land  First  I will describe the  coordination of CPU virtualization and how the VM is provided with memory  looking  at both the hardware and the Fiasco interface  Having provided this knowledge  I will  explain the workings of the control loop     5 1 1 Hardware Virtualization Interface    In x86 hardware assisted virtualization the whole CPU state is duplicated for a VM  A  world switch from host to guest mode is initiated with the vmrun instruction and left  with an vmexit as described in Chapter 2 2 3 2  I will now give more details on hardware  virtualization an
11.  In both L4Linux and Karma we use a thread to periodically refresh the whole screen   This seriously degrades the overall performance  To show the impact on performance  I  measured a kernel compile in L4Linux with and without the refresh thread  The results  can be seen in Figure 6 13  The refresh thread causes the performance to drop by about  33 percent  These numbers apply to Karma as well  because the mechanism is the same     65    6 Evaluation         Kernel Compile Time  L4Linux with Refresh Thread 18m 25s  L4Linux without Refresh Thread   12m 11s       Figure 6 13  Kernel Compile in L4Linux with  and without the refresh thread     Kernel Compile  800            E Karma  RR HB Nested VM                 000  e een ee    500   ee en    AO 1 sen een    J00 ae lll     gt       Time in Seconds    a re A A       100 eses sees      3HA MMS          1 2 3    Number of CPUs    Figure 6 14  Comparison kernel compile in nested and native KVM     A possible optimization is to keep track of the dirty state of the memory pages that  contain the images within the framebuffer  A refresh would then only refresh those  screen areas  whose dirty state has changed     6 3 Staged Virtualization    The performance of the KVM port as described in Section 4 6 can best be compared to  native KVM  I did the kernel compile measurements with the same kernel  compiler and  environment configuration as described in Section 6 1 3  Bach VM was equipped with  420MB memory  I did not enable the Karma 
12.  It defines the layout of  physical memory as well as the initial values of segment registers  Information about the  32 bit boot process can be found in the implementation of the Lguest virtual machine  implementation in arch x86 lguest  as well as in the official kernel documentation        l extensible firmware interface    45    5 Implementation       PC Boot Process VM Boot Process  Fixed Entry Vector    BIOS    y    MBR  Boot Loader  OS Initialization Modified OS Initialization    Os OS    Figure 5 2  A comparison of the boot process of PCs and VMs     The VMM has to do all system configuration that is normally done in real mode on  behalf of the VM  This includes setting up the VM   s initial segment descriptor table   GDT   Linux requires all segments to have the base address zero and the range of 4GB    The information that is normally queried from the BIOS can be provided to the kernel  through a data structure  It should include the video mode of the built in graphics card  and the layout of physical memory  Additionally it includes the kernel s boot arguments  and configuration that is normally provided by the boot loader    An initial user land may be loaded from a file system image  which is loaded into  physical memory  Such a file system image is usually called a ramdisk  For Linux the  ramdisk is placed into the uppermost memory region    A comparison of the boot process of PCs and VMs can be see in Figure 5 2     5 2 2 SMP Support    In this section I will i
13.  It is up to the VMM to implement back ends to the virtio  device interfaces  By using the virtio interface  I could benefit from existing drivers   but would have to implement complex mappings to the L4 device manager   s interfaces   which would significantly increase the complexity of the VMM     35    4 Design       c        e        EE Part of TCB AE Application driving devices with DMA  CoD Not part of TCB    Figure 4 4  TCB of an L4 application where  a  the VM drives the hard disk   b  hard  disk commands are mediated by the VMM and  c  hard disk commands are  handed out to an external server     Instead  I opted to adapt existing L4Linux stub drivers  Because these already fit well  to the device manager   s interfaces  the translation to IPC should be much easier  which  would help in keeping the VMM simple    Unfortunately  not all devices can be shared using L4 device managers  For example   there is currently no device manager for hard disks available  For such devices one VM  in the system may be allowed to directly operate on the physical hardware    Secure device discovery is enabled by the JO server  which presents its clients with  a virtual bus infrastructure  IO allows device access to be restricted to clients in a  configurable fashion  It has an interface to map device memory into client address spaces  and grants access to IO ports  In this thesis  I want to use IO to present the guest kernel  with a virtual bus containing only the hard disk  The guest 
14.  Similarly  the VM   s memory must be constructed using the microkernel   s secure  mechanisms  which ensure that a VM can only access memory that has been allocated  for its operation  In a high level view  VMs are protection domains similar to tasks   Therefore  the same memory construction mechanisms that are used for tasks also apply  to VMs  VMs are allowed to communicate with their VMMs only    Peter and colleagues  PSLWO9  were able to implement secure VMs by adding only  500 lines of code to the Fiasco microkernel  which is an insignificant increase of the TCB    As described in Section 2 2 5  object capabilities are a mechanism that allows systems  that adhere to the principle of least authority  With support for object capability based  access control  a VMM can be built in user land that has the minimal set of permissions   With both object capability and secure hardware virtualization support Fiasco is a viable  choice for this thesis    Coordination of the VM execution  platform and device virtualization are up to a  VMM that is implemented in a task  The VMM runs as an ordinary task and is subject  to the strong security properties of the microkernel    For added security  different critical Linux applications can be run in separate VMs   It is therefore desirable that VMs are isolated from each other  In a scenario where the  VMM is shared among several VMs  a malicious VM can attack the VMM and thereby  render all its VM siblings compromised  This scenario can
15.  actual implementation  is out of scope for this thesis and part of future  more elaborate work     72       7 6 Summary          VM              inn  iti      unprivileged  privileged    RT Xenomai Domain Linux Domain    unprivileged    privileged    RT Xenomai Domain       Linux Domain    Dispatcher       Admission          unprivileged  Microkernel Scheduler    y Aung Decision RT Thread non RT Thread  Notification                  gt  Dispatch Operation    Figure 7 3  Real time VM setup    73    8 Outlook and Conclusion    8 1 Outlook    Optimizations Further performance increases can be achieved with the following two  optimizations     e In the current implementation  updates to the IC  such as masking an unmasking  of interrupts  are done via hypercalls  each of which causes execution to leave the  VM  Instead  the IC stub driver could update the IC state in a VM accessible  shared memory data structure  which is queried by the IC back end     e In this work  the Fiasco world switch system call returns on every host interrupt   regardless of the VM state  The measured numbers  see Section 6 1 3 1  show that  there are 10 times more preemptions than interrupt injections  As an optimization   a flag in the calling threads UTCB could indicate the availability of virtual  interrupts  On physical interrupts  the world switch system call would return only  if this flag is set  which would decrease the amount of surplus VMM involvement     In a more advanced scenario  event th
16.  be avoided if a VMM drives  one VM only  Consequently  I will employ one VMM per VM as illustrated in Figure  4 2  which shows a setup with two VMs    Figure 4 3 gives an impression of the setup of Linux  a   L4Linux  b   KVM L4  c   and the envisioned architecture  d   Linux runs in privileged mode  with its applications  running unprivileged  L4Linux and its applications run in user land on top of the  microkernel  KVM L4 uses L4Linux and Qemu to do faithful virtualization  The  envisioned architecture is similar to KVM L4 in that it employs a VM  Contrary to    31    4 Design       a  b     Guest Mode    DU       Unprivileged Host Mode    Fiasco asco    Privileged Host Mode                 E  Unprivileged Mode Panenan Virtual Machine  i L4 Infrastructure  EL  Privileged Mode D ar loader Baer   Figure 4 3  Architectural overview of  a  Linux   b  L4Linux   c  KVM L4 and  d  the  envisioned system    KVM L4  it depends neither on L4Linux nor Qemu  which promises a much reduced  resource footprint     4 3 Para Virtualization    The VMM has to provide the VM with  virtual  devices  platform and device virtualiza   tion   One option is to provide the VM with virtual duplicates of physical devices  In  general  this requires both an implementation of the device model  and an instruction  emulator  as described in Chapter 2 2 5   Another option is to provide a custom device  interface  Such an interface may use efficient shared memory communication and obviate  an instruction em
17.  called the VMM  Such hypercalls are done by stub  drivers to send commands to device managers     43       5 Implementation       Hit When the guest executes a Hlt instruction  the CPU is yielded to other  activities that are ready to execute    Error An error condition occurred  Possible errors include invalid VM states and  shutdown events     5 1 4 Virtual Interrupts    The control loop can inject only one interrupt per vumrun  However  after a leaving VM  execution  multiple event threads can receive notifications from their device managers  and flag them as pending interrupts  Whereas the the control loop injects the first  interrupt immediately  the next pending interrupt can only be injected as soon as the  VM is left again  Thus  if the VM is left upon physical interrupts only  the interrupt  latency for virtual interrupts can be rather large  Even worse  the VMM could receive  at a faster rate  than it can inject them  which would result in outstanding interrupts  that are never injected    Therefore  depending on the number of pending interrupts  two different injection  schemes are used  If one interrupt is pending  it is injected via the SVM virtual interrupt  injection mechanism  If two or more interrupts are pending the first interrupt is injected  as virtual interrupt  and virtual interrupt intercepts are enabled  The virtual interrupt  intercept causes a VM exit at that exact point in time before the virtual interrupt is  taken by the guest  e g  when the gue
18.  can be  achieved  A virtual machine can be moved away from an overloaded or faulty server   Fault tolerance may also be improved by check pointing virtual machines  If the virtual  machine fails  it can be restored from an earlier checkpoint    Operating system developers may use the inspection mechanisms provided by vir   tualization to develop and debug systems  The same inspection mechanisms can be  used to monitor the virtual machines behaviour and alert the operator on any misbe   haviour  CNO1     End users use virtualization to run applications written for different operating systems  concurrently  Virtual machines may also be used to run outdated operating systems  developed for architectures that are no longer physically available  This may enlarge  the lifespan of important legacy applications  Furthermore  virtualization enables legacy  applications to run on new operating systems  which may be a way to provide a migration  path and thus foster the adoption of the new operating system    In the following chapters  I will give an introduction to virtualization including  definitions on terms that will be used throughout this thesis     2 2 1 Virtualization Basics    One early attempt at properly defining the term virtualization was done by Popek and  Goldberg in 1974  PG74   As the basis for their considerations they used a processor  architecture with two different modes of execution  Privileged mode that has the  full set of instructions available and unprivilege
19.  clients  L4Linux and a VM running  Linux     the client should be able to directly access  As described in Section 5 3 1  I implemented  both the stub driver and the back end to make use of this mechanism    The AHCI interface relies heavily on DMA both to send commands to the device and  to receive data from the device  DMA requires special handling because it requires host  physical addresses  The VM does not know about host physical devices  Therefore I  implemented a hypercall that makes the VMM query the data space provider for the  host physical address of the guest physical address zero  base address   Linux uses this  hypercall once on during initialization  and saves the base address  Later on  when  setting up a DMA transfer  Linux uses this base address to calculate the host physical  addresses without VMM involvement    When I mapped the device memory directly into the VM to allow the Linux driver to  command the device I encountered a problem  Linux accesses this memory through the  nested paging address translation  To correctly use device memory  writes must not be  cached  Although I made sure that the memory was mapped correctly on the host side   with the cache disable bits   the writes done by the Linux drivers were cached  and the  device did not work  I could verify that caching is a problem by modifying the driver  to explicitly flush the appropriate cache line  using the clflush instruction  after each  IO memory write and the device worked as expected
20.  ea ee ea Bee kr  AP r p e II  3 3 3 NOVA OS Virtualization Architecture                  33d KAM scaricare Ce ri  330 KVMEDAL ssd terea bee a Ee Ae en be be ee  3 3 6 Hardware Assisted Virtualization for the L4 Microkernel        3 3 7 VMware Workstation          0 0  0 0 00 0000000    XI    Contents       4 Design  4 1 Requirements    2 2 2 2  2  Como  4 2 Architecture     4 3 Para Virtualization         2 2 oo oo  4 4 Security Considerations   22 22 2 nn nn  4 5 Virtual Machine Monitor     4 5 1 Core Virtualization    22 Co Co oo   4 5 2 Platform Virtualization   2222 Co oo none   4 5 3 Peripheral Devices   22 22  2  m o    nn  4 5 4 System Environment     4 5 5 Multiprocessor Support          e     4 6 Staged Virtualization    2    2    a 000002 ee ee  AT SUMMAary  a  se kak ae ok ee a ek ee a a a  i aa  5 Implementation  5 1 Core System 2 2    ee  5 1 1 Hardware Virtualization Interface                4    5 1 2 Fiasco Hardware Virtualization Interface                5 1 3 Control Loop    2 2 2 2    nn  5 1 4 Virtual Interrupts           0  0 2 0 0 000200008    5 2 System Environment a s c e sop Cm nen  521     Boot  LOader   lt  pose ea wi re RS nr  5 2 2 SMP Support        65 sea eae ee eo ew  5 3 Peripheral Devices      aoaaa ee  5 3 1 Design Pattern              mn une  5 3 2 Serial Lime   ss sos 284 2  ss aaa a a a a E a  5 3 3 Graphical Console     2 22 Co onen  5 3 4 Network Interface   2 22 2 CC mn ee  5 3 5  Hard Disk   ze pa pe bee Haka ER ES Bere we ae ne  5 3
21.  for their companionship and their helpful remarks    I also want to thank my parents for funding a large part of my studies  Stefanie  deserves special thanks  because she rarely got to see me due to my work commitment     Contents    1    Introduction  1 1 Structure    2    a  Background  2 1  ASEGURA e ina ee re Be A a  2 1 1 Building Decomposed Systems with Encapsulated Components    2 1 2  Access Control   a gene a aison ah  2 2 Virtualization     222 2 oo on  2 2 1 Virtualization Basics     2222 2 oo oo   2 2 2 Nomenclature    2 2  2 Hm one   2 2 3 Virtualization of the x86 Architecture                  2 2 3 1 Software Virtualization                00   2 2 3 2 Hardware Virtualization                  0   2 2 4 Platform Virtualization   222222 oo om   2 2 5 Peripheral Device Virtualization                     2 3 Discussion   Are VMMs Microkernels Done Right            2    2 4  SUMMALY  vis gia a ac eed a eee an GS ae a eR ee    Related Work    3 1 Micr  kernels  essa dee bai Pa Soe  amp  Sw BA a a E  Bull Fiasco   soe ca bok ie ca a ee ww ee nad  3 1 2   LAKa Pistachio      42225 24 2 2 28  2 2 8 ua sen  3 1 3  EROS essre Re a   3 2 Rehosted Operating Systems   2 2 aooaa  CA A ee ee we ee ee ee o  3 2 2 OK Linux aka Wombat  ss sa s sm mos a k e na Oh   Eua        3 2 3 User Mode Linuxl  ser 22222 a a e a es  324 Pre virtualizationl     a ea a onen  3 2 5 SUMMARY  ua e ae a a a ba ee a eS   3 3 Virtual Machine Monitors   222  2  e ea a a a a  FA Xen  a ae ae ee a a aa
22.  machine monitors micro   kernels done right  ACM SIGOPS Operating Systems Review  40 1  95 99   2006  15    S  Hand  A  Warfield  K  Fraser  E  Kotsovinos  and D  Magenheimer  Are  Virtual Machine Monitors Microkernels Done Right  15    Intel  Intel Multiprocessor Specification  http    www intel com design   pentium datashts 242016 HTM  Online  accessed 14 10 2009  46    Adam Lackorzynski  L4Linux Porting Optimizations  Diploma thesis  Uni   versity of Technology Dresden  2004  19    Jochen Liedtke  Improving IPC by kernel design  In Proceedings of the  fourteenth ACM symposium on Operating systems principles  pages 175 188   ACM New York  NY  USA  1994  7  17    Jochen Liedtke  On microkernel construction  In 15th ACM Symposium on  Operating System Principles  SOSP   Citeseer  1995  7  17    Jochen Liedtke  Toward real microkernels  Commun  ACM  39 9  70 77   1996  7    ARM Limited  ARM Security Technology Building a Secure System  using TrustZone   Technology  http   infocenter arm com help topic   com arm doc prd29 genc 009492c PRD29 GENC 009492C_trustzone_  security_whitepaper pdf  Online  accessed 21 01 2010  44    J  LeVasseur  V  Uhlig  Y  Yang  M  Chapman  P  Chubb  B  Leslie  and  G  Heiser  Pre virtualization  soft layering for virtual machines  In Computer  Systems Architecture Conference  2008  ACSAC 2008  13th Asia Pacific   pages 1 9  2008  22    B  Leslie  C  van Schaik  and G  Heiser  Wombat  A portable user mode  Linux for embedded systems  In Proceedings o
23.  model  to be implemented  Another source of complexity comes from the means needed to  safely emulate device memory  In x86  device memory may be mapped straight into  the physical memory space of the machine  It is then accessed using regular memory  operations  A write to device memory is usually unbuffered and interpreted by the device  as a command  which is immediately acted upon  To provide the VM with the illusion  of direct device memory access  the VMM has to recreate its behaviour  However  there  is no way to trap on memory accesses other than page faults  Therefore  the VMM has  to make sure that accesses to device memory regions in guest memory cause a page  fault and thus enable the VMM to gain control  It may then emulate the instructions  accessing the memory region  update the device models accordingly before returning to  the VM  This requires the VMM to provide an instruction emulator  which comes with  the risk of bad performance and increases the VMM complexity  In the literature this  technique often called full  or faithful virtualization    Device operations often consist of many commands that update the device state  incrementally until the actual operation is executed  Because in faithful virtualization  each IO operation has to be handled by the VMM  such behaviour causes significant  numbers of world switches  thus causing substantial performance penalties    A different approach is to present the VM with a custom device interface that does  not r
24.  more of the  scheduling facilities will be moved from the guest VMs to the microkernel    In the envisioned design  all scheduling decisions are taken at the host  Just like in an  ordinary RT OS  the host is the only entity in the system that has a global view on all  RT tasks and can take informed decisions  Support for scheduling decisions that require  intensive knowledge about the synchronization facilities and resource access protocols  provided by the guest require further research  which is out of scope for this work  The  steps needed to make a job actually run are done by the guest s dispatcher  I will now  detail the design with a description of its workings    Before entering them into RT execution mode  the RT guests register each of their RT  tasks with the host  who does an admission test encompassing the RT tasks of all guests   For any admitted task  the host establishes an identifier  which is also brought to the  guest s notice    In RT operation mode  the host scheduler picks the most eligible job and generates  an upcall  to the job   s VM  The job   s identifier  which accompanies the upcall  allows  the guest s dispatcher to find the corresponding job and resume its execution  Job  completion would be signaled to the host with a hypercall    The envisioned design allows multiple RT VMs  Therefore  separate VMs could be  used to isolate RT tasks from one another  Because the host scheduler is provided with  enough information  and a defined communicatio
25.  next experiment  I increased both slowdowns by factor 10 to simulate analysis  in a external server  which requires IPC  and is therefore slower than analysis in the  VMM    Whereas both experiments use pessimistic numbers  the measured numbers indicate  that the overall slowdown is small  and the proposed setup is feasible    Below are the kernel compile times with the same setup like in 6 1 3 for comparison    For the kernel compile benchmark  the additional world switches resulting from the  hypercalls that instruct Karma to do IO memory accesses  read and write  on behalf of  the VM  do not impose a significant performance degradation  Even command analysis  in a separate task is feasible and the performance degradation is small     62    6 1 Performance Benchmarks       700    600    500    400    300    Time in Seconds    200    100                E  0                    1 2 3    Number of CPUs    Figure 6 10  Comparison of kernel compile different types of direct hard disk access      0  is the kernel compile in VM with direct hard disk access  whereas in   1  and  2   Linux has no access to IO memory and instructs Karma via  hypercalls  In  1  Karma spends 10000 cycles before writing the command  and 3000 cycles before reading values  to simulate analysis of the commands   In  2  Karma waits 100000 cycles before writing and 30000 before read  to  simulate command analysis by a separate task     63    6 Evaluation          450       400    350    300    250    200    
26.  on top of microkernels will be presented  Chapter  3 introduces important examples that are in use today  and discuss their strengths  and weaknesses  In Chapter 4  I will present a design that is of less complexity and  offers better performance than previous solutions  Chapter 5 gives insight into the  implementation of a prototype that I developed  This prototype will be evaluated in  Chapter 6  In Chapter 7  I present three solutions on how real time  RT  tasks can be  supported in the future  The thesis will be rounded up with a short outlook on future  projects that are enabled with this work  as well as a conclusion     2 Background    In this chapter I will first give an overview of the importance of security in modern  computing  Thereafter  I will introduce microkernels as a key technology on the way to  more secure computing  The discussion of the security aspect will then be completed by  an overview of access control  I will then proceed by introducing virtualization  The  chapter will then conclude with a short discussion on virtual machine monitors versus  microkernels     2 1 Security    Computers are taking up new importance in our lives  People use their computers to  store and manage their private data such as digital pictures  diaries and even do banking   Companies rely heavily on the digital work flow and store huge amounts of mission  critical data  The confidentiality of data  meaning that it may only be revealed to  authorized persons or systems  has
27.  protection domains within the kernel    These flaws are deeply embedded within the OS  which makes attempts to fix them  difficult  especially if backward compatibility has to be ensured  Therefore  it is unlikely  that the deficiencies of existing OSes will be removed in the near future  Instead  new  systems will grow up that will follow new architectural paths    Because of their architectural merits  microkernels are likely to serve as a foundation  for new systems  In microkernel based systems  the amount of code running in privileged  mode is small enough to be thoroughly audited or even validated using formal methods   OS services  such as device drivers and protocol stacks  are implemented in user land    1 Introduction       with address spaces providing hardware enforced isolation boundaries  In such a system   faults are contained  i e  a faulty component such as a failing device driver  cannot bring  down the whole machine  Accordingly  microkernel based systems allow applications  with tiny trusted computing bases    Instead of ACLs or derivatives thereof  modern microkernels use capability based  schemes for access control  which foster the adherence to the principle of least authority  at system level  That is  applications can be configured with the minimal set of privileges  needed for their correct operation  An attacker who seizes control over an application  is therefore also restricted to the privileges that were selectively granted  The obvious  goal th
28.  s TCB     36    4 6 Staged Virtualization       4 5 4 System Environment    On x86 systems  OS kernels are usually loaded by a boot loader  which typically operates  in real mode  It is installed in a predefined location and started directly after BIOS  initialization  Its task is to prepare the machine for the OS  bootstrap  according a  boot protocol  Upon receiving control  the OS queries the BIOS for system information   initializes itself and switches to protected mode  which is then used throughout system  operation    I decided not to support real mode code  which is no restriction for most standard OS  apart from booting  However  with no support for real mode code  no standard boot  loader can be used  Instead  it is the VMM   s duty to bootstrap the guest  Guests are  started directly in protected mode  BIOS calls are available in real mode only  which is  not available in the VM  Instead  BIOS information such as the physical memory layout  has to be provided by the VMM     4 5 5 Multiprocessor Support    One of the goals of this thesis is to make multiple CPUs available to a VM  SMP   As  described in Section 4 5 1  a virtual CPU is implemented with a thread running the  control loop    To run multiple CPUs  the VMM employs multiple control loops  each with its own  thread  If control loops are mapped onto different physical CPUs  the VM can make use  of real concurrent execution    In SMP systems  CPUs communicate with one another using inter processor interrupt
29.  solution supports VMs with  multiple CPUs  In my measurements SMP scaled well    The complexity of the solution is low  with the VMM comprising about 3900 SLOC  and the required kernel patch consisting of only about 2700 SLOC modifications  The  kernel patch could be ported to a new kernel version in about one hour  which indicates  that the maintenance costs are low  The tiny VMM implementation lends itself to being  reused by other researchers for their own projects    My solution is only applicable on machines that support hardware assisted virtual   ization  However  I expect that off the shelf desktop computers will be equipped with  processors capable of hardware virtualization in the near future  Today  the situation is  different in the embedded domain  Hardware virtualization is not available for example  on ARM  and we have to resort to L4Linux for such machines     77    Glossary    ABI  ACL    ACPI  AHCI  APIC  ASID  BIOS  DMA  GDT  GPL  Guest    Host  Hypervisor    IC  IDE    IPC  IPI    MMU    NIC  NTP    OS    Application Binary Interface   Access Control List  a mechanism for access control   which is employed in standard OSes    Advanced Configuration and Power Interface  a stan   dard for a common interface for device and power  settings    Advanced Host Controller Interface  an interface for  Serial Advanced Technology Attachment  SATA  con   trollers    Advanced Programmable Interrupt Controller  Address Space Identifier    Basic Input Output System  Dire
30.  sure that the isolation property holds  the workings of  the kernel have to be verified either per extensive testing  by code reviews  or even by  formal verification  Verification techniques available today scale poorly and are typically  limited to hundreds or at best thousands of lines of code  Hei05     The idea of building a system on top of a small kernel  a microkernel  came up in the  1980ies  Early solutions were built with Unix in mind  The most important representative  of such a first generation microkernel is Mach  Mach has a number of system calls that  geared towards supporting Unix  Moreover  its task and thread structures are pretty  much aligned to those of Unix  Inter process communication  IPC  is asynchronous   which requires in kernel complexity such as buffer allocation and queue management    At that time  first generation microkernels were pretty much en vogue in the systems  research community  Numerous different operating system personalities such as Unix and    2 1 Security       DOS were implemented on top of them  essentially proving the versatility of microkernels   While drivers remained inside the kernel  a new concept was introduced  user level  pagers  With this new concept memory management could now be done in user land   which helped to achieve much more flexibility  ABB 86     Unfortunately  Mach and other microkernels showed a significant overhead compared  to monolithic systems  Consequently  they did not catch on in the marketplace  
31.  to be enforced at all times because the disclosure  of confidential data may cause substantial economic damage  Similarly  the integrity of  data has to be ensured  For a company to rely on data  unauthorized modifications must  be disabled or at least be detectable  Another major issue is the availability of data  If a  trade in the stock market is delayed  the prices may already have risen  causing losses  for the trader  All in all we can say that the demand for security of the information we  manage on computers today has risen    Most of the operating systems kernels in use today  such as Microsoft Windows  Linux  and  BSD  are implemented as one single entity  Subsystems communicate by means  of procedure calls and shared memory  There are no practical boundaries between  subsystems  and we refer to such systems as being monolithic    An analysis of of the software stack of current monolithic operating systems reveals  severe problems concerning robustness and security  An attack to any subsystem such  as a network stack can be used to gain control over the full machine  Even the latest  incarnation of Windows is prone to these attacks as became visible to the general public  when Microsoft issued a security bulletin describing a vulnerability in its file sharing  protocol stack  Cor   A study on Microsoft Windows XP discovered that the majority of  crashes was caused by bad drivers  GGP06   Such crashes cannot be avoided beforehand  because drivers are written by thir
32.  to restrict the privileges of individual  program instances  Any application started by the user inherited all of his privileges   not only those needed to fulfill its task  Although this mechanism was sufficient at that  time  today the usage scenario is different  As computers are hooked up to the Internet  nowadays  users often face software the trustworthiness of which they cannot assess   Even worse  software can run without explicit consent of the user  Web browsers  for  example  automatically execute scripts downloaded from websites to enrich the web  experience  Although the usage scenario changed  the assumptions that underlay the  security mechanisms were never questioned  As such  modern systems have to cope with  attacks that they were not designed to resist    In addition to inadequate privilege mechanisms  the growing complexity in the kernel  becomes problematic  Standard OSes were devised at a time when people did not foresee  the complexity of today   s systems  Over time  the computer evolved into the capable  machines that we use today  With every innovation  the OS vendors had to implement  new features  thus adding to the OS es complexity  Complex systems tend to be more  prone to bugs  which can reduce the reliability of the system by their own or even be  exploited by attackers  Mechanisms that are employed to mitigate the risks of complexity  at user level are not readily applicable to kernel components  as address spaces cannot  be used to implement
33.  vesting the VMM with  device drivers that directly access the machine and are multiplexed by the VMM  or by  relying on services of the host operating system    The third approach is to give direct device access to the VM  However  great care is  needed to avoid security pitfalls  Direct memory access  DMA  allows devices to directly  read and write from main memory without OS intervention  bypassing its isolation  mechanisms such as paging  If the VM is allowed to directly drive devices using DMA   it must be counted to the TCB of the system  This huge increase of the TCB is not  wanted for most scenarios  Hardware innovations that mitigate this problem are already  in preparation  IO MMUs can be used to provide devices with a virtual view on main  memory and thus restrict access  AMDO7     Another solution would be to use the VMM as a proxy for direct device access  The  guest OS uses an adapted device driver that instead of directly writing to device memory  issues a call to the VMM  which can then inspect and validate the command and issue it  to the device memory on behalf of the VMM    Devices such as network interface cards  NICs  even support multiplexing in hardware  and can be used as is  as long as they do not facilitate DMA      2 3 Discussion   Are VMMs Microkernels Done Right     Hand and colleagues  HWEF  initiated an interesting discussion by boldly claiming that  virtual machine monitors   while striving for different targets   achieve many of the goals  of 
34. 700 SB800 SATA Controller    e Fiasco       Version  20 October 2009    e L4Linux         Version  12 November 2009      Linux Version  2 6 31      Graphics  VESA  Mode 0x117    e Native Linux         Version  2 6 31 5      Graphics  VESA  Mode 0x117    e Karma         Version  4 January 2010      Graphics  VESA  Mode 0x117    e KVM Setup         Nested Paging  Enabled       VGA  Disabled       Hard Disk Access  IDE       Version  KVM 84  Qemu 0 9 1      Host  Linux 2 6 31 PAE    55    6 Evaluation       Time Sources Time inside a VM is not guaranteed to be in compliance with the  wallclock time  Instead  it may drift  and therefore be inaccurate  In my measurements  I took such drift into account by using an external clock source    Linux running in a VM on top of Karma issues hypercalls to get time stamps  The  Karma VMM receives the hypercall and creates a time stamp from the CPU s time  stamp counter  A time stamp was taken immediately before and after the benchmark   and the delta denoted the benchmark measurement    In KVM  I used ntpdate immediately before and after the benchmark to make sure  clock drift inside of KVM are evened out  Ntpdate is a tool that synchronizes the wall  clock to an external server using the network time protocol  NTP      Structure In the next sections  I will present the measurements I did to evaluate  my work  First  I will present the benchmarks I used to give an impression of the  existing implementation  These benchmarks are concerned wi
35. Diplomarbeit    Lightweight Virtualization on  Microkernel based Systems    Steffen Liebergeld    27  Januar 2010    Technische Universit  t Dresden  Fakult  t Informatik  Institut f  r Systemarchitektur  Professur Betriebssysteme    Betreuender Hochschullehrer  Prof  Dr  rer  nat  Hermann H  rtig  Betreuender Mitarbeiter  Dipl  Inf  Adam Lackorzynski    Technische Universit  t Dresden       Fakutt  ttnformatik   AUFGABENSTELLUNG F  R DIE DIPLOMARBEIT  Name des Studenten  Steffen Liebergeld  Studiengang  Informatik  Immatrikulationsnummer  3013853  Thema  Leichtgewichtige Virtualisierung in Mikrokernsystemen  Zielstellung     Traditionelle Betriebssysteme sind aufgrund ihrer Komplexit  t und den damit einhergehend nicht  auszuschlie  enden Sicherheitsl  cken keine geeignete Plattform f  r Anwendungen mit hohen  Isolationsanforderungen  Mikrokernsysteme bieten mit ihrer potentiell kleinen vertrauensw  rdigen  Basis  Trusted Computing Base  eine Alternative  Entscheidend f  r die Anwendbarkeit von Mikro   kernsystemen ist dabei  inwieweit Komponenten voneinander isoliert werden k  nnen und unter  welchen Leistungseinbu  en diese Isolation realisierbar ist     Eine aufgrund ihres Nutzwertes sehr relevante Klasse von Komponenten sind komplette Betriebs   systeme  Ihre Isolation ist eine besondere Herausforderung  da Betriebssysteme sehr spezifische  Annahmen   ber ihre Ausf  hrungsumgebung machen  Grunds  tzlich sind Anpassungen auf drei  Gebieten notwendig  des Kernsystems beste
36. In Proceedings of the 9th conference on  USENIX Security Symposium  Volume 9  pages 10 10  USENIX Association  Berkeley  CA  USA  2000  11    Rusty Russell  Lguest  The Simple x86 Hypervisor  http    lguest ozlabs   org   Online  accessed 11 01 2010  24    R  Russell  virtio  towards a de facto standard for virtual I O devices  2008   35    skas mode  http   user mode linux sourceforge net old skas html   Online  accessed 23 09 2009  20  21    Udo Steinberg  NOVA Microhypervisor  http   os inf tu dresden de    us15 nova   Online  accessed 15 01 2010  24    Julian Stecklina  Remote Debugging via Firewire  Diploma thesis  University  of Technology Dresden  2009  76    J  Sugerman  G  Venkitachalam  and B H  Lim  Virtualizing I O devices on  VMware workstation s hosted virtual machine monitor  In USENIX Annual  Technical Conference  pages 1 14  2001  27    83    Bibliography       sys     Vog09        WB07     84    User Mode Linux SYSEMU Patches  http   sysemu sourceforge net    Online  accessed 23 09 2009  21    Dirk Vogt  L4ReAnimator   A restarting framework for L4Re  Diploma thesis   University of Technology Dresden  2009  76    N H  Walfield and M  Brinkmann  A critique of the GNU hurd multi server  operating system  2007  18    
37. LWO9     In Figure 3 7 you can see a setup with KVM L4 running a VM on top of the Fiasco  microkernel     3 3 6 Hardware Assisted Virtualization for the L4 Microkernel    In his Diploma thesis  Sebastian Biem  ller introduced hardware assisted virtualization  on top of an L4 microkernel  Bie06   He proposed an extension to the L4 API to support  efficient virtualization  For this purpose the VMM was split into two components  A  hypervisor that runs privileged and the monitor that runs in user land  The prototype  implementation used Intel s VT x virtualization extension and implements a software  memory management unit    The prototype implementation of this VMM  as presented in the thesis  does not have  multiprocessor support for virtual machines and supports only an L4 microkernel as  guest OS  It implements an interrupt controller  a timer device and a serial console   Similar to KVM L4  this work does not significantly increase the TCB of applications  running side by side  Conversely  it does not have a capability based system as substrate   Unfortunately no evaluation information about this work has been published     3 3 7 VMware Workstation    VMware Workstation is a hosted virtual machine architecture  It runs on commodity  operating systems such as Windows and Linux and supports virtualization both with  and without hardware extensions  VMware Workstation installs a driver into the host  operating system that implements the privileged part of the VMM  VMDriver   A
38. M  One scenario  is that a guest application might attack the VMM and thereby compromise its Linux    33    4 Design       kernel  Under the assumption that the world switch path is immune to attacks  the only  attack vector is the hypercall interface of the VMM  Configuring the VM such that the  vmmcall instruction is privileged and thus hypercalls are allowed to the the guest kernel  exclusively  this class of attacks can be avoided  Another interaction between the VMM  and guest applications are preemptions  However  these preemptions are transparent to  the VM and cannot be exploited    The guest kernel may attack the VMM though the hypercall interface  However  even  if it succeeds  the security situation of the remainder of the system is unchanged  That is  because the VMM is encapsulated and assumed to act on behalf of the VM  Because the  VMM exercises control over the attacker VM only  the only privilege that an adversary  could gain is access to its own VM  As the attacker runs in privileged mode in the VM   it holds full control over the VM regardless whether it controls the VMM or not    Another source of attacks are external attackers  The input to the VM is mediated by  device managers  It is up the device manager to do protocol multiplexing and to fend of  attacks on the protocol     4 5 Virtual Machine Monitor    The VMM coordinates the VM execution and provides it with resources such as virtual  devices and memory  It runs encapsulated in an L4 task and manag
39. PU states and guest initiated shutdown events   In the next refinement step  I implemented support for virtual interrupts  The back  ends of virtual devices contain a blocking thread  Whenever a device manager has new  input  it notifies this thread with an IPC message  or a User Irq  The thread then  registers the event at the virtual IC as a pending interrupt  a mechanism  which will be  described in Section 5 2  On each VM exit  the control loop checks the IC for pending  interrupts  If an interrupt is pending  it is injected into the VM  Interrupt injection will  be described in more detail in Section 5 1 4    Stub drivers issue hypercalls to send commands to driver back ends  Hypercalls are  issued by providing required information in registers and shared memory  and voluntarily  leaving the VM execution with the vmmcall instruction  The control loop handles  hypercalls by analyzing the predefined designation register and forwarding the command  to the back end component for processing  Driver back ends will be explained in detail  in Section 5 3 1     42    5 1 Core System             loop_ forever  begin  if interrupt_pending     inject_interrupt       vmexit_reason   14_vm_run_svm VMCB  registers   switch exit_reason   begin    case Vmmcall   send_command_to_driver_back_end    break   case Asynchronous_exit   break   case Hlt   wait_until_interrupt_pending    break   case Error   print_error_message    shutdown_vm    break   end  end       Figure 5 1  Illustration of 
40. Summary    Rehosting an operating system requires substantial work  both in adapting the OS to the  host OS   s interface and in implementing optimizations in the host OS to aid in rehosting    With lots of optimization applied  rehosted operating system achieve good performance   However  achievable performance is limited because of the overhead of mapping operations  to host mechanisms     3 3 Virtual Machine Monitors    3 3 1 Xen    Xen is a virtual machine monitor for the x86 architecture  It provides an idealized virtual  machine abstraction to which operating systems such as Linux  BSD and Windows XP  can be ported with minimal effort  BDF 03   Both systems with and without hardware  virtualization extensions are supported    Guest operating systems must be ported to run on top of Xen  The required modifica   tions are reasonably small for Linux  Only about 3000 lines of code had to be added   which is about 1 36  of the total x86 code base  Xen provides good performance  with  little virtualization overhead of only a few percent  BDFT03    Xen is designed to be in full command of the machine  with virtual machines running  in less privileged mode  The Xen kernel is a combination of a VMM and a hypervisor   It is usually referred to a hypervisor    The idealized virtual machine abstraction implemented by Xen has a number of  restrictions compared to the full machine interface  A guest cannot install fully privileged  segment descriptors as segmentation is used to protec
41. Time in Seconds    150  100    S0                   Bench sh KernelCompile    Pagesize    Figure 6 11  Comparison Performance of Bench sh and a kernel compilation with 4K and  with 4M pages  superpages      6 2 Variation Benchmarks    In this section  I will present three benchmarks  In these benchmarks  I tweaked system  parameters and measured their impact on the overall performance  I did this to get an  impression where the system can be optimized     6 2 1 Impact of Different Page Sizes    An interesting aspect is the page granularity of the host address space that is used as  guest physical memory  With nested paging  two address translations need to be done   From guest virtual to guest physical and from guest physical to host physical  Both  translations are cached in the translation lookaside buffer  TLB   To reduce the TLB  pressure  address spaces used for nested paging are augmented with a address space  identifier  ASID   The host has ASID zero  whereas VMs can be assigned their own IDs   With ASIDs  VM page translations do not evict TLB entries of the host    If the host address space used as guest physical memory is built with super pages   4MB   less work is required to translate from guest physical to host physical  To show  this effect  I measured both bench sh and a kernel compile in a VM with 3CPUs and  512MB Ram  both with 4K pages and 4M pages  The results can be found in Figure  6 11    The impact of a fine grained host address space on the guest   s pe
42. VMM   s refresh thread  and I disabled VGA  output in both KVM and nested KVM  The measured numbers can be see in Figure  6 14    The KVM port is in an early state  whereby little optimization has been applied  In  this setup  each KVM world switch requires two physical world switches  One to issue  the vmrun hypercall and another to switch into the second stage VM     66    7 Real time Tasks in Virtual Machines    Electronic control units have changed the way how complex systems are built  They find  application in various fields ranging from commodity multimedia players  over industrial  installations  to safety critical airplane controls  In these scenarios  the logical correctness  of a computation alone is not sufficient  The utility also depends on timely availability  of the result    Complex systems such as cars contain dozens of controllers  which poses a problem  with regard to cost  weight  power consumption and reliability  Device manufacturers  therefore strive to consolidate applications that used to run on dedicate control units  onto fewer compute nodes  However  running on the same node gives rise to the problem  of erratic interference  A crashing controller task must not bring down other tasks   Likewise  all tasks on a node must be executed in a timely manner that allows them to  meet their deadlines  The OS running the node must therefore enforce strong spatial  and temporal isolation  Informally  temporal isolation is ensured if the only reason for a  jo
43. a microkernel based system  Their arguments are that VMMs establish a more versatile  interface than microkernels do and mitigate the need for fast IPC  The microkernel  community was quick to respond and refute most of Hand   s arguments  HULO6     However I think that the discussion is worthwhile because despite comparing apples  with oranges it sheds light on important similarities of both microkernels and VMMs   While microkernels surely represent the smallest possible kernels  VMMs may also be  small in size compared to monolithic kernels  Both systems multiplex memory and CPU  time for their clients  The difference lies in the interfaces they present  VMMs present  the full machine interface  whereas microkernels provide an abstracted interface    Virtualizing an operating system alone does not automatically provide stronger security   VMMs can only provide isolation as strong as separate physical machines do  However  it  is worth looking at lessons learned from microkernel systems research  With microkernels  it is possible to build applications that have a tiny TCB     15    2 Background       Given the premise that commodity operating systems are inherently insecure and it is  not possible to make them more secure without compromising application compatibility   how can we use the untrusted applications of a commodity operating system for tasks  that demand strong security    The solution may be a combination of a microkernel and a VMM in a split application  scenar
44. a port of Linux onto a modern second  generation microkernel undergoes an overhead compared to native Linux of only 5 to  10   HHL 97      2 1 2 Access Control    Microkernels establish isolation between tasks with separate address spaces and allow  those components to communicate with IPC  To enforce an access control policy  media   tion of IPC is needed  In the following I will describe some of the challenges therein    Each software component is a possible target for attackers  and   if compromised    also  puts the data it can access at risk  Therefore  in a secure system  a software component  should be given the minimal set of access permissions that are needed for its correct  operation  principle of least authority     Under certain circumstances processes can be fooled into using the permissions they  have in an unforeseen way  The problem was first described with a compiler that was  granted the permission to write to a file A for accounting purposes  If the name of file A  is handed to the compiler as the target for debugging output  the compiler will overwrite  the accounting data  The problem here is not that the compiler exercised an operation  it was not allowed to  but that it was fooled into using its permissions in the wrong way   This problem is known as the confused deputy problem  Har88     Imagine a company A such as a bank that wants to buy software from another company  B to process confidential data  Now company B does not allow company A to view the
45. al view  the Karma scales well for multiple CPUs    In my measurements KVM performed worse than expected  I measured the virtualiza   tion overhead of KVM to be about 15 percent  I expected the overhead to be about 8 to  10 percent  which are caused by the switches between the VM and Qemu  The measured  overhead can be a result of the different setup  for example  KVM used IDE to access  the hard disk  whereas Karma  L4Linux  and native Linux used SATA  I suppose that  with more thorough tuning the KVM performance can be significantly improved     59       6 Evaluation       800    700    600    500    400    300    Time in Seconds    200    100    18     16     14     12     10     8     Overhead in Percent    6     4     2     0           M Native  E Karma    M L4Linux  E KvM                         1 2 3    Number of CPUs    Figure 6 6  Kernel compile times          EI Karma  E L4Linux  __   E Kvm                         Number of CPUs    Figure 6 7  Kernel compile overhead relative to native Linux  No data is available for    60    L4Linux on 2 and 3 processor setups     6 1 Performance Benchmarks          CPU 10  1  2  Hypercalls 85275 12206   11617  Preemptions 1583361   191852   194254  Hit 8736 9591 9250    Injected Interrupts   21549 19608   19710    Figure 6 8  Vmexit reasons during kernel compile     6 1 3 1 Vmexit Reasons    An interesting aspect is why and how often the VM execution is left during a real world  load scenario  Therefore  I measured the kernel comp
46. and  general interest in microkernels ceased  Lie96     At that time  Jochen Liedtke devised a new generation of microkernels  Contrary  to the designers of first generation microkernels  he started from scratch following the  rationale        A concept is tolerated inside the microkernel only if moving it outside  the kernel      would prevent the implementation of the system   s required  functionality      Lie95     He proposed three basic abstractions  tasks  threads and IPC  Tasks are used as isolation  domains and are usually implemented using address spaces  All activity inside of tasks  is called thread  Lie95   Threads communicate with each other using synchronous and  therefore unbuffered IPC  which proved to be much faster than asynchronous IPC as  used in first generation microkernels  Lie94     According to Jochen Liedtke a microkernel avoid implementing policy wherever pos   sible to retain flexibility  Therefore  user land can run multiple policies in parallel   Such flexibility allows the implementation of a number of different operating system  personalities  each with its own policy  and they all can run side by side  Lie96    Moreover  interrupts and exceptions are translated by the microkernel into IPC allowing  hardware drivers to be implemented in user land and therefore to run encapsulated with  respect to faults and errors  Lie96     Memory management is done in user land as well  Address spaces are constructed in a  hierarchical fashion  Page faults a
47. art up  the VMM allocates a memory region of appropriate size and access  permission  and maps it into the virtual machine to be used as guest physical memory     34    4 5 Virtual Machine Monitor       During the whole VM lifetime  the VMM has access to VM  physical  memory  which  can be used for efficient shared memory communication between the VM and the VMM    An alternative to allocating all memory at once is demand paging  Whenever the VM  accesses a page that has not been used before  the VMM receives a page fault and maps  a fresh page into the VM    In this thesis I chose the former approach because it allows the mapped memory region  to be contiguous in physical memory  Whereas this allocation scheme is more wasteful  on system memory  it simplifies the address translation if the guest is allowed to directly  drive DMA enabled devices     4 5 2 Platform Virtualization    As described in Chapter 2 2 4  the VMM has to implement platform devices such as an  interrupt controller  IC  and a timer device    For the design of the IC  it is reasonable to align it with the high level abstractions of  Linux  If done so  the Linux operations correspond directly to IC ones  As for a virtual  device  interrupts are flagged by helper threads in a shared data structure  and delivered  to the VM by the VMM   s control loop    The timer thread periodically blocks with a timeout  upon which it flags the timer  interrupt at the virtual IC     4 5 3 Peripheral Devices    For a VM to b
48. ated on the physical hard disk  I ran the kernel compile twice  and measured the second cycle only  Thus  I can ensure that all caches used by Linux  are warm and the influence of hard disk latencies is minimized  In all scenarios  the VMs  were equipped with 420MB of main memory    The performance of the Karma VM is compared to a natively running Linux and  L4Linux  All measured kernels had the same kernel configuration  with the exception of  Karma and L4Linux specific settings  All configurations used a framebuffer device for  output  Additionally  I measured native KVM  whose performance should closely match  that of KVM L4  The results can be seen in Figure 6 6  For clarification  the overhead  compared to native Linux is depicted in Figure 6 7     58    6 1 Performance Benchmarks                bin sh    tar xjf linux    2 6 30 5 tar bz2  cd linux    2 6 30 5   make i386_defconfig   make    j 3   make clean      time   make    j 3  gt  dev null 2 gt  amp 1    time       Figure 6 5  Script used for measuring the run time of a kernel compilation  The   time  binary is a custom binary which does a hypercall to make the VMM take  a time stamp  The difference of the two printed timestamps is the actual  run time  In KVM   time was replaced by ntpdate  which synchronized the  VM   s clock to an NTP server  and the time stamps were taken after the  synchronizations     The Linux X11 framebuffer drivers do not refresh areas of the screen  for example to  keep track of mouse poi
49. b to miss its deadline is a contract violation  e g  a worst case execution time  WCET   overrun    on its part  In particular  the misbehaviour of other jobs will never result in  deadline misses of conforming jobs    A microkernel enforces isolation between components  It can therefore isolate RT  tasks that are placed in different protection domains  As spatial isolation is not sufficient  for RT workload  appropriate measures have to be taken to guarantee timely execution   All components between the hardware and the actual application have to be designed  with that goal in mind  Besides the microkernel s scheduler  a VMM is of particular  interest in that respect    The remainder of this chapter is structured as follows  I will first analyse what  constitutes a real time OS  RT OS  and supplement it with a case study of Xenomai  a  typical representative  Afterwards  I will present three designs for RT VMs     7 1 Background    Instead of consisting of one continuous computation  RT tasks are made up of jobs  which  become ready for computation at different instants of time  The timely execution of  jobs can only be guaranteed  if they have well known characteristics  such as a bounded  worst case execution time  or minimal time between releases    With that information at hand  the OS can check a priori for a given set of RT tasks  whether a feasible schedule exists  admission   A task is only accepted for processing if  its admission does not preclude a feasible schedu
50. be referred to as the hypervisor  All  code controlling the VM execution will be called the VMM  VMM and hypervisor must  not necessarily be different entities  but can be combined  The combined system will  also be called hypervisor     2 2 3 Virtualization of the x86 Architecture    The x86 instruction set includes a number of sensitive instructions that are not privileged  and do not trap when executed in unprivileged mode  Therefore privileged code of a  VM cannot run in unprivileged mode without loss of equivalence  Therefore the x86  instruction set is not virtualizable  R100     The remainder of this chapter will focus on virtualization on such a platform  and  will be followed with an overview of hardware extensions that were created to aid  virtualization     2 2 3 1 Software Virtualization    There are essentially two approaches virtualization on architectures that are not vir   tualizable  Either privileged code has to be emulated or the VMM has to analyze  guest privileged code on the fly and replace sensitive instructions that do not trap with  instructions that do  I will classify the first solution as emulation and binary translation   and the second one as patching    Unfortunately both techniques incur a serious performance deterioration  Emulation  of code is typically by factors slower than its native execution  Causes for this slowdown  are inherent  Under emulation the whole CPU state has to be duplicated in software  and the emulation of an instruction m
51. can then use its device  discovery algorithms as usual  Depending on the intended use of the system  the VM  may be allowed to access device memory directly    However  allowing the VM to directly access devices that employ direct memory  access  DMA  is problematic with respect to security  A malicious guest may use  DMA to manipulate physical memory of the host and thus seize control of the machine   Consequently  this particular VM belongs to the TCB of all applications in the system   In Figure 4 4 the TCB of an L4 application is illustrated for this setup  a     To mitigate this problem  the VM must not have access to device memory or IO ports   Instead  guest driver could be modified to issue hypercalls to the VMM  The VMM  could then validate the command and make sure it accesses only memory regions that  belong to the VM  After successful validation the VMM writes the command into device  memory  As illustrated in Figure 4 4  this removes the VM from the TCB of an L4  application  b     That still leaves the VMM in the TCB of the system  which might be prone to  attacks from the VM  To further increase security  this checking functionality might be  implemented in an even smaller external server  which further decreases the TCB  c     As discussed in Chapter 2 2 5  IO MMUs allow DMA main memory accesses to be  restricted in a secure fashion  Therefore  a VM running on a system using an IO MMU  can drive DMA enabled hardware without having to be counted to the system  
52. consent  Thus an access control  policy can be implemented  A user land can be built that implements the principle of  least authority  which means that all tasks have the minimal set of permissions needed  to fulfill its commission  MIO3      2 2 Virtualization       Object capabilities unify access permissions and designators  Thus systems using  object capabilities are less prone to the confused deputy problem than systems that  employ two different mechanisms as is needed for systems that employ Access Control  Lists  MI03     Communication also requires capabilities  Therefore the confinement problem is  solved  because a task can only communicate with tasks that it has been given the  capability  MIO3      2 2 Virtualization    In the last years virtualization has become popular in the low end server and even in the  consumer market for a number of reasons  most of which are due to inabilities of current  operating systems    The workload on typical server machines does not fully utilize the machine and  demand for computation time comes in bursts  Virtualization is used to consolidate  several underutilized servers into one physical machine  whereby energy is saved and  space in the data center is freed  Virtual machines can be easily deployed and removed   which gives the operators more flexibility  and allows product testers to set up a number  of machines each with different configurations    With live migration techniques improved load balancing and fault tolerance
53. ct Memory Access    Segment Descriptor Table  Gnu General Public License  All code running inside a virtual machine    Platform hosting a virtual machine   The privileged part of a virtual machine monitor  If  both VMM and hypervisor run in privileged mode   the combination is also called hypervisor     Interrupt Controller   Integrated Drive Electronics  a standard for a hard  disk interface    Inter Process Communication   Inter Processor Interrupt    Memory Management Unit    Network Interface Card  Network Time Protocol    Operating System    18    50    59    37    64    36    36    45    17    10    10  10    35    59    37    12    15  56    Glossary       PCI    RT    SLOC  SMP    SVM    TCB  TLB    UTCB    VESA    VGA    VM    VMCB    VMM    WCET    80    Peripheral Component Interconnect  a standard for a  bus system  which is deployed in almost every PC     Real time    Source Lines Of Code   Symmetric Multi Processing  a term often used for  machines with multiple CPUs    Secure Virtual Machine  a processor extension  employed in modern AMD processors to aid virtual   ization     Trusted Computing Base  Translation Lookaside Buffer    User Thread Control Block  a part of the thread  control block  which is accessible in user land from  within in the thread s address space    Video Electronics Standards Association  a standard  for a graphics card interface    Video Graphics Array  a standard for a graphics card  interface    Virtual Machine   Virtual Machine Contr
54. d by Julian Stecklina  in his diploma thesis  Ste09   Such a debugger would be a tremendous help for  system developers  Together with a snapshot mechanism this could be even more  useful  because a VM could be stopped at a certain point in execution  and different  scenarios could be played through while debugging     System Services A Linux system provides a number of services that would be useful    76    to L4 applications as well  A prominent example is the file system  which could  be used by L4 applications to store and retrieve data  In L4Linux applications  can communicate directly with the L4 environment  Such applications are called  hybrid tasks     Contrary to hybrid tasks  Linux applications in the VM cannot directly commu   nicate with the outside  Instead  the VMM would have to act as a server  L4  applications would request service using IPC with the VMM  The VMM would  implement a stub driver  which would react on incoming client requests  The  request command would be written to a shared memory location  and the guest  would be notified with an interrupt  The guest s stub driver would read the com   mand from the shared memory region and forward it to the appropriate service  application  This forwarding could be done with a special device file  for example   The application would answer the request by writing to the device file  The stub  driver would receive the answer  write it into the shared memory region and inform    the VMM  which would then send th
55. d how it is implemented in Fiasco    In AMD SVM  an in memory data structure  the virtual machine control block  VMCB    contains information about the virtual CPU  The virtual CPU s state is saved in the  state save area  The state save area does not save general purpose registers  which need  to be saved explicitly  Information that is used to control the behaviour of CPU in guest  mode is saved in the control area    The control area is used to control which instructions and events  e g  host interrupts   are to be intercepted  It also contains information about the cause of the VM exit    Memory virtualization is handled by nested paging  which works upon a regular host  address space that is used as guest physical memory  A guest address is translated from  guest virtual to guest physical using the guest s page table  Subsequently  it is translated  from guest physical to host physical using a host page table  Both operations are done  by the MMU and do not require VMM interaction     41    5 Implementation       5 1 2 Fiasco Hardware Virtualization Interface    The Fiasco hardware virtualization interface provides system calls to initiate a world  switch and to create a host address space to be used as a host page table for nested  paging    A new VM kernel object was devised  which inherits the memory management interface  of L4 tasks  Therefore  the address space for nested paging is created in the same way   like a task s address space   It is constructed by the VMM 
56. d mode that only allows a subset of  instructions  This setup is typical of microprocessors used in machines intended for  multi user operation     2 Background       They defined a virtual machine  VM  as an efficient  isolated duplicate of the real  machine that is established by a control program  which they call a virtual machine  monitor  VMM   According to their definition a VMM has three important charac   teristics  First  it provides an environment for programs that is essentially identical to  the original machine  Second  programs running in that environment shall have little  decreases in execution speed  Finally that the VMM is in full control of the system   s  resources    The first characteristic means that programs running in the virtual environment must  produce the same results like on a real machine  Temporal behaviour however cannot  be recreated because the timing of the program flow is altered by all interventions of  the VMM  The efficiency requirement demands that a statistically dominant subset of  the virtual processor   s instructions runs directly on the physical processor  The third  characteristic  resource control  means that the program running inside the VM should  be able to use only those resources that it has been explicitly granted access    Further definitions by Popek and Goldberg include     Trap  A trap is an unconditional control transfer from unprivileged to privileged mode  executed by the processor     Privileged instruction  A pr
57. d parties that may not provide the source code for  thorough analysis  A solution is to run drivers in an isolated domain  containing faults  and errors and therefore preventing a breakdown of the whole system  Such a system  allows a restart of failed drivers and can therefore continue its operation    In widely used operating systems isolation between applications is insufficient  To trust  an application we have to trust the whole software stack  which is unnecessarily large    2 Background       and makes applications depend on functionality they do not need  We need a system  that allows applications with a custom tailored trusted computing base  TCB     Studies have shown that even well engineered and reviewed code contains an average  of two bugs or possible attack vectors per 1000 source lines of code  SLOC   MD98   A  minimal Linux kernel already comprises 200 000 SLOC  standard configurations being  much bigger  PSLW09     Asides from the kernel  the X Server running on top of it contains 1 25 millions SLOC   all of which are executed with root privileges  and   if compromised   allows the attacker  access to all data on the system  Applications  such as web browsers are also quite  complex  For example Firefox comprises 2 5 millions SLOC  Any bug  such as a buffer  overflow  in this huge stack may be used not only to compromise the browser and the web  banking session therein  but can also escalate to compromise a user s data  PSLWO9     Because all OS subsystems r
58. dberg and Popek    Instead  the CPU state was duplicated with the introduction of a less privileged  processor mode that is called the guest mode in contrast to the standard processor  mode  which is now called host mode  A switch between guest and host mode is fairly  complex because the processor state that is not or only indirectly accessible  for example  segment registers  has to be stored and exchanged  A transition to guest mode is initiated  with the privileged instruction vmrun    The CPU can be configured to intercept certain instructions  faults and interrupts  while executing in guest mode  An intercept results in an unconditional fallback to host  mode  VM exit  that is augmented with information about its cause  Intercepts may be  used to let the host receive and handle interrupts to remain in control over the machine   They can also be used to let the VMM assist in operations that are not handled by  hardware virtualization  for example device virtualization   A vmmcall instruction is  available in guest mode to voluntarily initiate a VM exit    To aid in removing the performance bottleneck imposed by emulating the MMU   both AMD and Intel created an hardware extension that implements another stage in  address translation in hardware and thus mitigates the need for the VMM to intervene  on guest page table manipulation  This hardware extension is called nested paging  An  host address space is used as guest physical memory upon which the guest can build  addres
59. ds to be available as main memory for the VM     Limitations At the moment IO ports are not supported  writes to an IO port are  ignored  and reads return a fixed value  In Linux  applications can request the rights to  directly access IO ports from the kernel  An example is the hwclock application  which  can read the current time directly from the system s clock  In turn  it can set the clock  to another value by writing to its IO ports    In my solution such applications do not work  In the future  IO ports can be emulated   if required    In the current implementation  the KVM port is a proof of concept  and supports only  one second stage VM at a time  Support for multiple second stage VMs would require  the VMM to distinguish between individual VMs  which can be done by augmenting the  hypercalls used by KVM with a VM identifier     54    6 Evaluation    The goal of this work was to run unmodified Linux applications on top of the microkernel  Fiasco  I created a solution that is of low complexity  5 5   has a small resource footprint   5 5   and runs unmodified Linux applications  with few exceptions  5 5   An important  aspect of my work was that the overhead in terms of execution time should be comparable  with those of native Linux     System The measurements were done on the following system     e Machine         CPU  AMD Phenom 8450 X3  3 Cores  2100Mhz  128KB L1   512KB L2  and  2048KB L3 cache       Memory  4GB       Hard Disk Controller  ATI Technologies Inc  SB
60. e capability based systems  can be constructed     3 2 Rehosted Operating Systems    Implementing a rich user land on top of a microkernel is a challenge of its own as can be  observed with the Gnu Hurd project  WBO07      18    3 2 Rehosted Operating Systems       A pragmatic solution is the port of a monolithic kernel to run as a user land server  that executes unmodified applications  Such a port is called rehosting  The rehosted  kernel has to be modified to run as an application on the host OS  That means  it has  to be ported from the machine interface  where it is in full control  to the application  binary interface  ABI  of the host operating system  To run unmodified applications   the rehosted OS must behave like the original with respect to user land  For example  applications make assumptions about the layout of their address space and would not  run if it is altered  Similarly  they rely on the correct behaviour of synchronisation  mechanisms  Therefore a rehosted OS has to be in control of its applications  In contrast  to running on a real machine the rehosted OS must employ mechanisms of the host  For  example the rehosted OS must not manipulate hardware page tables directly  but has to  employ host mechanisms to create the address spaces of its applications    During the next paragraphs I will introduce three important examples of rehosted  operating systems  L4Linux  Wombat OK  Linux and User Mode Linux     3 2 1 L4Linux    L4Linux is a rehosted monolithic 
61. e reply IPC     8 2 Conclusion       8 2 Conclusion    In this thesis  I devised a new solution to run Linux applications on top of the microkernel  Fiasco  Linux applications are supported without requiring modification  I also succeeded  in porting KVM  and can therefore also run all OSes that are supported by KVM  For  example  the KVM port runs Windows   BSD and others  With these OSes  all their  applications become usable  which significantly enhances the range of available software    In this work Linux runs encapsulated and does not increase the TCB of L4 applications   The only exception to the encapsulation is hard disk access  which uses DMA  With  the VM driving DMA enabled devices  it must be counted to the TCB  I presented two  solutions to handle this problem  and remove the VMM from the TCB  The solutions  require device commands to be handed out of the VM  and and to be analyzed by an  external component  I measured the overhead of these solutions to be small  In the near  future all shipped computers will have IO MMUs  which allows DMA accesses to be  restricted  and thus resolves the problem entirely    My solution has a small resource footprint  of about 284KB for an individual VMM  instance  As with any virtualization solution  additional resources are needed for device  virtualization and main memory for the guest    My solution performs better than L4Linux and KVM in all benchmarks  The perfor   mance overhead is small for all but pathological cases  My
62. e strings  Instead of implementing the single character interface only and breaking  strings up to their characters  I implemented a hypercall that allows the VMM to print  the whole string directly from guest memory     5 3 3 Graphical Console    A graphical console is provided by the L4 device manager L4con  Clients get a shared  memory area that they can use to draw their graphical output  frame buffer   L4con  composes the final display from these shared memory regions  In L4con only one client  is visible at a time  and an indication about which client is active is drawn to a special  screen area that cannot be manipulated by clients  Operations such as screen refreshes  are initiated by clients over the IPC based protocol L4Re  Framebuffer    L4Linux contains a stub driver that implements the L4Re  Framebuffer interface   and thus acts as a client of L4con for graphics output  I used this stub driver with  modifications  As described before  the VM cannot make use of IPC directly  Therefore    49    5 Implementation       I split the L4Linux stub driver so that the IPC interface of L4con is implemented in the  proxy driver in the VMM and the driver logic resides in Linux    I mapped the shared memory region for graphical output  offered by L4con  into the  VM   s physical address space  and made it visible to Linux as device memory  Drawing  an image to the screen is done by the stub driver by writing the image data directly  into the device memory and then sending a co
63. e useful  it has to be provided with virtual devices  The L4 system  includes device managers  servers that multiplex access to hardware  For example Ankh  implements a virtual network switch  thereby multiplexing the physical network card   Similarly  we have a secure console that securely multiplexes the system   s frame buffer  and maintains a secure input indicator  which cannot be forged by clients  These device  managers consist of a device driver and multiplexing logic  Client communication with  device managers is done with IPC and shared memory  These services are a natural  choice for the back ends of virtual devices    The VMM makes use of to the device managers to present the VM with a custom  interface  The guest OS needs special drivers to communicate with the virtual devices   Because these drivers implement an abstracted interface only  I will call them stub  drivers  The VMM   s task is limited to translating the stub driver   s commands into IPC  messages  which are forwarded to the device manager    Asynchronous events  such as the availability of packets  are signaled using IPC or  asynchronously using User Irgs  For each device manager  the VMM employs a blocking  helper thread to receive the notification and flag the event as interrupt at the virtual IC    Virtio provides a para virtualization interface  which is used by Lguest and by KVM   optionally   It implements device drivers for a network card  a serial console and a  generic block driver  Rus08  
64. ed only slight modification   and achieves isolation of RT tasks provided they reside in different VMs  Using their  original admission checks  guests make sure that their tasks are schedulable under the  given conditions  CPU slowdown   The host has to determine a VM time slice length  that allows each guest to meet their deadlines  If chosen too long  then a guest might be  allocated CPU time at a point in time that is too late for it to meet its deadline  On the  other hand  if chosen too short  then every guest will meet its deadline but secondary  effects such as switch overhead may become so large that they have to be taken into  account  which will also negatively affect the schedulability    This design is limited to time driven RT job releases  Event driven job releases can  occur any time the external event  for example an interrupt  occurs  Even if the host  would know which event belongs to which VM and could inform the VM of the event   the latency until the job can actually be executed is bound to the fixed VM time slice  length  This latency can be big enough to make the job miss its deadline     7 5 Modified RT Design    I propose a different scheme whereby scheduling is done by the host scheduler  Compared  to the previously sketched approaches  more intrusive modifications to the guest OSes  are necessary    As with the two previous designs  I seek to reuse as much as possible of complex  infrastructure that RT OSes provide  Compared to the previous designs 
65. en is  to structure the system into many small components each of which carries  minimal privileges    The improvements on security and availability of microkernel based systems are paid  for with giving up on backward compatibility  Accordingly  only few native applications  will be initially available  which makes a migration path that allows gradual adoptions  of existing applications desirable    One solution is to establish an encapsulated legacy environment that can be instantiated  multiple times  A pragmatic way of creating such a legacy environment is to reuse a  standard OS kernel  Although an attacker may still succeed with its initial attack on a  target in a legacy domain  he finds himself encapsulated in the latter  That gives other  applications the chance to go unharmed  provided they reside in other domains    In the past  standard OS kernels were adapted to run on top of a microkernel  The  resulting three tier architecture   microkernel  OS kernel  application    had to run on  hardware that supported only two privilege levels  As a result  both OS kernel and  application run unprivileged with the OS kernel relying on microkernel services to  manage its processes  Security critical operations  such as memory management  have  to be mediated by the microkernel  Therefore  porting a standard OS kernel involved  in depth kernel modifications  which require intimate knowledge of the kernel  Due to  the microkernel involvement  rehosted kernels face inherent per
66. entering the kernel  from an application requires two privilege level and one address space switch in L4Linux  in contrast to one privilege level switch in native Linux  This is one example where  the L4Linux setup incurs overhead  L4Linux is a highly optimized rehosting effort  and  despite inherent performance disadvantages performs well in many workloads    L4Linux runs encapsulated and is binary compatible to Linux  An illustration of  L4Linux is given in Figure 3 1  L4Linux runs Linux applications side by side with L4  applications with a small TCB     19    3 Related Work             L4  Applications    L4 Infrastructure    Fiasco            EE  Unprivileged  CE  Privileged    Figure 3 1  L4Linux running as a user land server on top of the L4 microkernel Fiasco     3 2 2 OK Linux aka Wombat    Wombat is a port of Linux to the L4Ka  Pistachio microkernel  The basic architecture is  similar to L4Linux  with the Wombat kernel residing in its own address space  The main  emphasis on this port was to efficiently support Linux on L4 on embedded platforms   The work was done at NICTA  LvSHO5     The Wombat project is currently stalled  but its intellectual properties live on in the  Open Kernel Lab s OK Linux project  which runs on top of the OKL4 microkernel  The  currently available version of OK Linux is based on Linux 2 6 24 9  Unfortunately no  further information on the project is available     3 2 3 User Mode Linux    User mode Linux  UML  is a port of the Linux kernel 
67. erties  such as a tiny TCB and strong isolation between user land  components  a microkernel is an ideal basis for secure systems  In this thesis  I will  present a solution that uses a VM to run a Linux kernel on a microkernel    I will proceed by defining the requirements of this work  Thereafter  I will present the  overall architecture  The architecture will be refined in Section 4 3 with a discussion  about the device interface  Section 4 4 will introduce the different attack scenarios that  I consider  The design of the VMM will be introduced in Section 4 5     29    4 Design       a  b        i en    context switch    Linux    2    Fiasco    gt     o  Unprivileged Mode   E Piles mode    Virtual Machine           Address space switch          Figure 4 1  System call execution in  a  native Linux   b  L4Linux and  c  Linux in a  VM     4 1 Requirements  My work has to fulfill the following requirements     Functionality The whole range of Linux software shall be supported  Therefore Linux  applications must not need to be adapted  Interaction through standard interfaces  such as graphical use interfaces  mouse and keyboard as well as networking shall  be available     Isolation Linux applications must run encapsulated and not interfere with L4 applications   Security properties of the underlying microkernel system  as described in Section  2 1 1  must still hold     Resource Footprint The amount of resources that is needed to run the VMM itself  should be small     Perfo
68. es resources for  the VM  Also  it includes an implementation of platform devices such as an interrupt  controller and provides the VM with access to peripheral devices    The following sections will introduce the design of these components     4 5 1 Core Virtualization    CPU and memory virtualization are handled with hardware virtualization extensions   which are under control of the microkernel  The VMM uses the microkernel s interface  to command its VM    With para virtualization  the guest can voluntarily leave the VM execution with the  vmmcall instruction  This mechanism is used to implement calls to the VMM that work  similar to system calls  Register contents are used to select commands and to pass values  to the VMM  Additional information can be provided in shared memory  As stated  before  these calls are called hypercalls    Asynchronous communication from the system to the guest is done with virtual  interrupts  Once injected  the guest s execution is preempted and its interrupt handler is  called according to the guest s interrupt descriptor table  From that point on the regular  interrupt handling procedures apply    The VMM   s control loop coordinates the VM   s execution  It instructs the microkernel  to switch to the VM with a system call  Upon return  it checks the VM state to decide  on the next action  for example to react on a hypercall  The control loop also injects  virtual interrupts  It must not block as that would halt the VM execution    Upon st
69. esemble a physical device  Interfaces of physical devices are often inefficient  and  require incremental device state updates for complex device commands  which cause  numerous traps to the VMM and subsequent instruction emulation  Instead  a custom  interface can make use of efficient shared memory communication  which increases the  expressiveness of commands and thereby reduces the number of VMM interactions  The  interface used by the guest to command the device can be implemented using calls to  the VMM that resemble system calls used by applications to request services from the  OS  Instead of requiring the implementation of a device model of an existing device     14    2 3 Discussion   Are VMMs Microkernels Done Right        an abstracted device model can be created  which  because it is not restricted to the  limitations of the machine interface  may be of less complexity than a model of a physical  device    In summary  such a solution is of less complexity than faithful virtualization and may  provide better performance  BDF 03   However  providing the VM with an interface  that is different from the physical machine breaks the equivalence criterion of classical  virtualization and is therefore called para virtualization  Para virtualization requires  custom drivers for the guest OS to make use of virtual devices    Both faithful and para virtualization require the VMM to implement back ends to the  virtual devices  This can be done in several ways  for example by
70. example  of which is shown in Figure 2 2 3 1     2 2 3 2 Hardware Virtualization    During the last decade  together with the increasing power of CPUs  virtualization  technology gained significance  VMware was the market leader because they were the  only ones who could do efficient virtualization of x86  The entry barrier for other  contenders in the x86 market was exceptionally high  When the x86 was enhanced to  64bit  legacy mechanisms like segmentation were intentionally left out  Unfortunately  that also broke VMware   s virtualization technique  which used segments to protect the  hypervisor  pag     Intel and AMD  two major contenders in the x86 market  investigated  hardware extensions to aid virtualization and thereby lower the entry barrier to the    12    2 2 Virtualization       Guest Virtual Memory    Guest Page Table  Host Virtual Memory    Guest Physical Memory    Host Page Table    Host Physical Memory      Address Translation               using Page Tables    Figure 2 2  MMU virtualization without VMM intervention  Nested paging     virtualization market  They came up with similar  but differently named solutions  In  this thesis  I use the notation of AMD    Modification of the instruction set was not an option as it would break backward  compatibility  A possibility would have been to introduce a new processor flag that  if  set  causes all sensitive instructions to trap  This would enable VMMs implementing the  trap and emulate scheme as proposed by Gol
71. f the 6th Linux  Conf  Au   Canberra  2005  20    Bibliography        MD98      MI03      MMHO0S8     pag     PG74     pis        PSLWO9      R100     Rus     Rus08     ska     Ste     Ste09        SVLO1     Y K  Malaiya and J  Denton  Estimating defect density using test coverage   Rapport Technique CS 98 104  Colorado State University  1998  6    M S  Miller and C  Inc  Capability myths demolished  2003  8  9    D G  Murray  G  Milos  and S  Hand  Improving Xen security through  disaggregation  In Proceedings of the Fourth ACM SIGPLAN SIGOPS inter   national conference on Virtual Execution Environments  pages 151 160  ACM  New York  NY  USA  2008  23    How retiring segmentation in AMD64 long mode broke VMware  http     www pagetable com  p 25  Online  accessed 23 09 2009  12    Gerald J Popek and Robert P Goldberg  Formal requirements for virtualizable  third generation architectures  Commun  ACM  17 7  412 421  1974  9  10    L4Ka  Pistachio microkernel  http   14ka org projects pistachio    Online  accessed 22 09 2009  18    Michael Peter  Henning Schild  Adam Lackorzynski  and Alexander Warg   Virtual machines jailed  virtualization in systems with small trusted com   puting bases  In VDTS    09  Proceedings of the 1st EuroSys Workshop on  Virtualization Technology for Dependable Systems  pages 18 23  New York   NY  USA  2009  ACM  6  7  26  31  32  37    J S  Robin and C E  Irvine  Analysis of the Intel Pentium   s ability to support  a secure virtual machine monitor  
72. face  with the  exception of IPC  which is done by the in VMM proxy driver     5 3 5 Hard Disk    At the time this thesis was written  there was no working device manager for hard disks   Linux has a large driver base and therefore supports many devices out of the box  So  one way of supporting devices that do not have L4 device managers is to allow one VM  to drive the device  The driver VM could export a device interface to other VMs  but  that is out of scope for this work    A Linux driver communicates with devices either with IO ports or by reading and  writing directly to device memory  In this thesis I allowed the VM to access the hard  disk  which is connected to the system with serial ATA  SATA   The SATA controller  implements the AHCI interface  which makes use of memory mapped IO    To access the hard disk  Linux needs to be provided with device information  such as  the interrupt number and the location of the device memory    In the L4 system access to devices is managed by the L4 server JO  IO iterates over the  PCI configuration space and ACPI tables to find attached peripheral devices  It allows  the administrator to construct virtual busses that present clients with only the devices    50    5 3 Peripheral Devices       Ankh  Device  Manager          Network Interface Card       E    Virtual Machine  a  IPC User Irq   b  Virtual Interrupts    EI Unprivileged  c  Hypercalls  Privileged  d  Hardware Commands    Figure 5 4  The Ankh network multiplexer with two
73. formance disadvantages  compared to the native kernel    The recent addition of hardware virtualization support to commodity processors creates  the opportunity to reuse an OS kernel largely unchanged  That not only dispenses with  the need for laborious changes but also overcomes the performance disadvantages of the  previous approach  In previous work  virtual machines were a synonym of a duplicate of  a physical machine  faithful virtualization   Para virtualization  i e  virtual devices with  no corresponding physical implementation  was only employed if not doing so would  result in severe performance degradations  Even though faithful virtualization can host  unmodified OSes   an advantage that is of little importance for systems that are available  with their sources     its implementation introduces significant complexity  mainly due to  device emulation    In this thesis  I present a solution that uses virtual machines to run a slightly modified  standard OS kernel  and thereby avoids the costs involved with faithful virtualization   both in terms of performance and complexity  In a second step  I will show how func   tionality of the para virtualized OS kernel can be used to support faithful virtualization  as well     1 1 Structure       1 1 Structure    This thesis is structured in the following way  First  in Chapter 2  I introduce the benefits  of microkernels and their access control scheme  Second  technology that enables the  reuse of standard OS applications
74. he physical hard disk     5 3 1 Design Pattern    I will now explain a pattern that is common to all device driver stubs  In the following  sections I will give more details about the specific drivers    As described in Section 4 5 3  a stub driver is implemented in the guest operating  system  It contains all driver logic and implements the interface to the guest operating  system  More specifically  the stub driver glues together the logic of the corresponding  device manager   s IPC interface and the guest OS interface    In the Fiasco hardware virtualization interface the VM cannot directly execute IPC   Instead  the stub driver   s commands are mediated by the VMM    Therefore  each stub driver is accompanied by a driver back end in the VMM  thereafter  called proxy driver   which executes the IPC operations on its behalf  The driver back  end contains a blocking thread that waits for incoming notifications  Notifications can  be issued by device managers in the event of input  or by the microkernel to forward  interrupts from hardware devices     This thread will be referred to as event thread    Upon receiving a notification  the event thread registers a virtual interrupt as pending  at the IC  which is then injected into the VM by the control loop  To make sure that the  latency between a device notification and the subsequent injection is low  event threads  have a higher priority then the thread running the control loop    Once the guest OS receives the interrupt  i
75. hend aus Prozessor und Speicher  der Plattform  Timer   Bootup  etc   und der peripheren Ger  te  Bisherige L  sungen konnten Isolation erreichen  taten dies  aber auf Kosten von Komplexit  t und Leistungseinbu  en  Die Verf  gbarkeit von hardware   beschleunigter Virtualisierung verspricht eine starke Vereinfachung bei der Prozessor Speicher   Virtualisierung     Gegenstand dieser Arbeit ist es zu untersuchen  wie Plattform  und Ger  tevirtualisierung in Mikro   kernsystemen realisiert werden k  nnen  F  r einige Interaktionen  die das Gastbetriebssystem auf  physischer Hardware t  tigt  ist der Implementierungsaufwand im Systemvirtualisierer  dem soge   nannten Virtual Machine Monitor  unverh  ltnism    ig gro    Im Hinblick auf eine Vereinfachung der  Implementierung ist es m  glich  eine von der physischen Maschine abweichende Schnittstelle  zwischen Gastbetriebssystem und VMM zu w  hlen  Dabei ist darauf zu achten  dass die Gr    e der     trusted computing base    f  r Komponenten mit hohen Isolationsanforderungen nicht anw  chst  Der  Entwurf ist zus  tzlich so zu gestalten  dass zuk  nftige Erweiterungen  die die Kapselung von  Echtzeitgastsystemen erlauben  mit geringem Mehraufwand m  glich sind     Der im ersten Schritt erarbeitete Entwurf ist anhand einer prototypischen Implementierung zu  validieren  Als Gastsystem ist ein Betriebssystem zu w  hlen  das eine nennenswerte Anzahl von  Anwendungen unterst  tzt  Leistungscharakteristika sind anhand von Messungen im Ve
76. hot timers to reduce the  frequency of timer interrupts when there is little load  which would increase the time of  hibernation     6 1 3 3 Handing IO Memory Accesses to the VMM    To reduce the TCB of the system it is desirable to not let the VM directly execute  any DMA transfers  To do this  I implemented an alternative  The Linux AHCI driver    61    6 Evaluation          800       700             600                500    400    300    Time in Seconds    200    100                         1 2 3    Number of CPUs    Figure 6 9  Kernel compile with idle VMs     resorts to hypercalls  to make the VMM analyze the command and execute it on behalf  of the driver  To further decrease the TCB  device commands could be handed out to a  separate task    The actual implementation of the algorithms needed to analyze the device commands  is out of scope for this work  Instead  I slowed the hypercall down to get an impression  of how much performance such an analysis would cost  The slowdown was done with a  tight loop  The first experiment slowed all write accesses to the IO memory down by  10000 cycles  which simulates the analysis in the VMM  which would consist of copying  all commands out of the VM  analysing them and finally issuing them  Read accesses  are used to inquire the state of the device and previous commands  For read operations   little copying is needed  and the analysis should be straightforward  Therefore  I slowed  down read accesses by 3000 cycles only    In the
77. ht to the attention  of the RT VM  When the RT VM yields the CPU  other activities are given a chance to  run    Depending on the objective  the VM should not use all the CPU time and allow the  host OS and its  best effort  applications to run as well  The mechanisms required to    68    7 4 Multiple Slightly Modified RT Guests       unprivileged  privileged    RT Xenomai Domain Linux Domain    Scheduler Dispatcher      Adeos             Figure 7 1  Xenomai architecture    enforce that are rather simple  The kernel tracks what portion of time is not claimed by  the VM  If that fraction drops below a negotiated value over a longer period of time  the  VM is considered non complying and completely suspended  Therefore a misbehaving  RT VM cannot bring the rest of the system to a halt    If the RT OS signals idle periods  for example by executing the interceptable hit  instruction  no modifications are needed at all  However  a major drawback of this design  is that such a system can only accommodate one RT VM per CPU  Consequently  it is  up to the RT OS to isolate RT tasks from one another    The compliance on the VM   s part mainly hinges on the RT OS  If the RT OS does not  guarantee to comply with the contract that governs the relationship between the RT and  best effort tasks  the operator has to weigh the importance of RT tasks up against the  importance of the best effort tasks  Suspending the RT VM in the case of misbehaviour  is not an option for critical RT applicati
78. ile  the VM was configured to use  three CPUs  512MB Ram and to use the L4con Framebuffer interface  Additionally  I  measured how many interrupts are injected during the benchmark  The numbers are to  be seen in Figure 6 8  I counted all involuntary VM exits caused by host interrupts as  preemptions    Additionally  I wanted to know which device back ends received the most hypercalls     e IC  66572  e APIC  CPUO  11594  CPU1  8098  CPU2  7712    These numbers indicate that the IC interface and the APIC interface are used very  often  and may bring performance increases if they are optimized    Another conclusion to draw from these numbers is that there are much more preemp   tions than injected interrupts  Therefore  as an optimization the kernel world switch  path could be modified to only return to the VMM when there is a virtual interrupt  pending     6 1 3 2 Idle VMs    To estimate how much computation time is lost on idling VMs  I measured the kernel  compile with different numbers of idle VMs running side by side with the measured VM    The measured numbers are visualized in Figure 6 9  As expected  the overhead of  idle VMs is small  Interestingly  the kernel compile times did not increase linearly with  the number of idle VMs  Instead  for odd numbers of idle VMs the kernel compile took  longer than the same benchmark with one more idle VM  I guess that this slowdown is a  result of a scheduling anomaly    As an optimization  the VMs could be configured to use one s
79. in would have a higher  priority than non RT domains  Lower priority domains can consume only the fraction of  CPU time that the higher priority domains are idle    To support complex scenarios in a well defined manner  Xenomai implements  semaphores and mutexes for synchronization within the RT world  The synchronization  facilities are aware of scheduling related issues such as priority inversion  and take appro   priate measures to avoid them  Additionally  it provides streamlined communication  channels between the Xenomai and the Linux domain  This can be used for example to  analyse data accumulated by RT tasks in regular Linux applications  Ger     After the Linux kernel has booted the system  Adeos is brought into position by  loading it as a kernel module  With Adeos in place  additional domains can be set up   After Adeos has taken over  Linux can still run non RT applications  RT tasks are set  up with the help of the Linux domain  After they have been loaded  they enter into RT  execution mode  whereafter the RT domain schedules them    RT tasks run in privileged mode with no means of isolation among each other  Even  worse  any malicious or faulty RT task may bring down the whole system     7 3 Single Unmodified RT Guest    The simplest means to achieve RT capable VMs is to give special treatment to one  designated VM  An RT OS is used inside it to provide an appropriate environment for its  tasks  Scheduling relevant events like timer expirations are first broug
80. indows  Applications   Applications         Microkernel    E  unprivileged  E  Privileged       Figure 4 5  Staged Virtualization     established by KVM  which does faithful virtualization  and runs unmodified OSes will  be called second stage VM    To support staged virtualization  KVM needs to be modified  Instead of building guest  physical memory using Linux mechanism  and running the vmrun instruction itself  it  needs to use hypercalls  The VMM translates these calls into the appropriate L4 system  calls    Figure 4 5 gives an impression of this setup  The second stage VM is run in the  following way  VM execution starts in Qemu  which does its initialization  As soon  as possible  it switches from emulation to execution using hardware virtualization  It  therefore instructs KVM  a  to do a world switch  Contrary to native KVM  which does  the world switch itself  in staged virtualization  KVM does a hypercall to instruct the  VMM to do the world switch  b   The VMM then does the world switch on behalf of  KVM using the microkernel interface  c   On intercepts  the second stage VM is left   d   and the VMM resumes execution of the first stage VM  e   Upon receiving control   KVM analyzes the exit reason of the second stage VM  and switches back to Qemu  f  if  needed     4 7 Summary    The system architecture is illustrated in Figure 4 6  The VMM uses the microkernel   s  interface to control the VM execution  control loop  and memory  It uses L4 device  managers as bac
81. inux kernel  In Proceedings of the 2000  Linux Showcase and Conference  volume 2  2000  20    J  Dike  User mode Linux  In Proceedings of the 5th annual Linux Showcase  Es Conference  Volume 5  pages 2 2  USENIX Association Berkeley  CA  USA   2001  20  21    P  Gerum  Xenomai Implementing a RTOS emulation framework on  GNU Linux  http   www xenomai org documentation xenomai head   pdf  xenomai  pdf  Online  accessed 15 12 2009  68    A  Ganapathi  V  Ganapathi  and D  Patterson  Windows XP kernel crash  analysis  In Proceedings of the 20th conference on Large Installation System  Administration table of contents  pages 12 12  USENIX Association Berkeley   CA  USA  2006  5    81    Bibliography       Har88     Har02     Hei05        HHL 97      HULO6     HWF      Int     Lac04        Lie94     Lie95     Lie96        Lim      LUY  08      LvSHO5     82    N  Hardy  The confused deputy  or why capabilities might have been invented    ACM SIGOPS Operating Systems Review  22 4  36 38  1988  8    H  H  rtig  Security architectures revisited  In Proceedings of the 10th workshop  on ACM SIGOPS European workshop  page 23  ACM  2002  16    Gernot Heiser  Secure embedded systems need microkernels  USENIX  login    30 6  9 13  Dec 2005  6    H  H  rtig  M  Hohmuth  J  Liedtke  S  Schonberg  and J  Wolter  The  performance of u kernel based systems  In 16th ACM Symposium on Operating  System Principles  SOSP   pages 66 77  1997  8  18    G  Heiser  V  Uhlig  and J  LeVasseur  Are virtual
82. io  H  r02   The VM may communicate with secure applications on top of the  microkernel through a well defined secure channel    An example of such an application is secure online banking  The user runs a commodity  browser in a virtual machine  When the user is prompted to provide his private key to  authenticate himself  the VM sends a message to a small application running on top of  the microkernel  This secure application must show the information from the website  and allow the user to input his private key  encrypt it and send it back to the browser  in the VM  The browser than sends the encrypted data to the bank and the user is  authorized  Thus  the untrusted software stack in the VM has no chance to compromise  the authentication process for example by sniffing the private key     2 4 Summary    In this chapter I motivated why secure computing is of importance and introduced  microkernels as basis for secure systems  I then introduced virtualization and its building  blocks  This was completed by a discussion about the similarities and differences of  virtual machine monitors and microkernels  and why it may be a good idea to combine  both     16    3 Related Work    This chapter gives an overview on projects related to this thesis  Presented systems  include the microkernels Fiasco and L4Ka  Pistachio  Additionally  I will give a brief  introduction to the capability based EROS operating system to show that a secure and  robust operating system can be built using 
83. ivileged instruction may only be executed in privileged pro   cessor mode  If its execution is attempted while the processor is in an unprivileged  mode  the execution must cause a trap     Sensitive instruction  The term sensitive instruction refers to any instruction that has  influence on the authority of the control program in one of the following ways     e changes the processor mode without trap   e attempts to alter the amount of resources available   e produces different results when run in different processor modes or in different  locations    With these definitions in place they draw the following conclusion        For any conventional      computer  a virtual machine monitor may be  constructed if the set of sensitive instructions for that computer is a subset  of the set of privileged instructions     PG74      In other words  a computer architecture is virtualizable if all sensitive instructions  trap    If an architecture is virtualizable  a VMM can be built in the following way  Guest  code  both privileged and unprivileged  runs in unprivileged mode on the host  When  sensitive instructions trap  the VMM steps in to emulate them and thus recreate their  native behaviour  Thereafter it returns control to the guest  This scheme is called trap  and emulate     2 2 2 Nomenclature    All code running inside the VM will be called the guest  as opposed to the host  which  hosts the VMs     10    2 2 Virtualization       All host code running in privileged mode will 
84. k ends for virtual devices  The guest s stub drivers communicate  using hypercalls  which the VMM translates into IPC messages that are sent to the  corresponding device managers  Asynchronous system events are received by blocking  threads  flagged at the virtual IC  and eventually injected into the VM by the control  loop     38    4 7 Summary          Control Loop  Thread   L4 Console   IPC YHypercall  Interrupt    Device Thread    Figure 4 6  Envisioned system architecture     In summary  the VMM contains the following components     Boot Loader The guest is bootstrapped  provided with required system information   and finally started by setting the instruction pointer to the guest   s kernel entry   Instead of using a standard boot loader  these operations are done by the VMM     Control Loop The VMM runs one control loop per virtual CPU  The control loops  coordinate the execution of the VM  handle hypercalls  and inject virtual interrupts     Platform Devices Platform devices  the global interrupt controller  APIC  and a timer  device are implemented directly in the VMM     Peripheral Devices Peripheral devices are accessed using a custom hypercall based  protocol that uses shared memory  The VMM maintains threads that receive  external events and flag them at the IC  Guest stub driver   s commands are forwarded  to the corresponding device manager    39    5 Implementation    The implementation of the VMM started with the decision to use the C   programming  language 
85. kernel  Its basis is the Linux kernel and it runs as a  user land server on the Fiasco microkernel    L4Linux executes in its own address space  and runs its applications in separate L4  tasks  L4Linux applications can be forced to give control to the L4Linux server with  a system call that has been enhanced for this specific purpose  Lac04   Similarly  the  Fiasco microkernel was enhanced in a way that it supports special threads that are not  allowed to do system calls directly  but whose faults  exceptions and system calls are  reported to their corresponding pager  L4Linux uses these special threads to run its  applications and remain in their control  To administer client address spaces  L4Linux  uses L4 map and unmap operations  which increases the costs for memory management  compared to native Linux    Native Linux maps itself in the upper gigabyte of its applications address spaces  An  entry into the Linux kernel  for example a system call  requires a privilege level switch  only  In L4Linux  the upper gigabyte of the application s address space is occupied by  the Fiasco microkernel  When an application wants to voluntarily enter the L4Linux  kernel  it executes an INT 0x80  like in native Linux   which is intercepted by Fiasco   The microkernel synthesizes a message on behalf of the application and switches to the  L4Linux address space to deliver the message  L4Linux receives the message and can  now run the operation as requested by the application  In summary  
86. le    During execution  the OS can monitor the behaviour of individual jobs  in particular  whether it stays in the limits that were specified during its admission  A job that misses    67    7 Real time Tasks in Virtual Machines       its deadline can cause subsequent jobs of other tasks to miss their deadlines as well   Therefore  if a contract violation is detected  the OS may take actions such as suspending  the wrongdoer and thus allowing all compliant tasks to meet their deadlines    It is the responsibility of the scheduler to pick the next job to run among all runnable  ones  Scheduling decisions are taken at events such as release events or blocking of  tasks  After each scheduling decision  a dispatch operation makes the chosen job run   In classical systems  both scheduling and dispatching are tightly integrated within the  scheduler     7 2 Case Study  Xenomai    Xenomai is a system that combines a real time executive and a standard OS  Its architec   ture is depicted in Figure 7 1  It is built upon Adeos  Adaptive Domain Environment for  Operating Systems   a small component that has exclusive control of interrupts  Adeos  runs entities  which are called domains  Domains are fully preemptable and ordered  according to their priority  Adeos presents them with events  for example interrupts   according to their order    A domain is free to take appropriate measures when presented with an event of interest   In particular  it may decide to keep executing  An RT doma
87. mestamp counter and APIC timer value two times in a row and  comparing the difference in both values  While that algorithm works great on real    47    5 Implementation       machines  it did not work in the VM  The APIC clock granularity is limited  because it  can only be a multiple of the host timer granularity  During the timer calibration  a host  interrupt may occur  Host interrupt handling increases the timestamp counter of the  CPU  but may be short enough that the APIC timer does not change until control is  given back to the VM  Because host interrupts cannot be disabled by the VMM  timer  calibration in the VM gives arbitrary results even though the APIC timer has a fixed  frequency  Therefore I disabled the timer calibration in Linux and instead provided it  with a fixed value     5 3 Peripheral Devices    Peripheral devices are needed for usability  For example  it is desirable for the VM to  have access to the network  hard disk and to use a graphical console  including keyboard  and mouse for input  To support these devices  I made use of existing device managers  that multiplex the device functionality  L4Linux contained stub drivers for these device  managers  which I used as a starting point for my implementation    In the next section the main design pattern of the device driver stubs will be introduced   I will then give more details on the implementation of the serial line  L4con and Ankh  driver stubs  Thereafter  I will explain how I let the VM access t
88. mmand to the proxy driver to refresh the  screen area  The proxy driver forwards the command to L4con to execute the refresh  operation  A similar scheme is used for console input  Input events are stored in a shared  memory region  and the proxy driver s event thread translates the notifying IPC into a  virtual interrupt  which is then injected into the VM to be processed by the stub driver    The L4Re  Framebuffer interface used by L4con is also used by DOpE  which is a  console multiplexer that draws clients in floating windows  Therefore the VMM can  make use of both console multiplexers without modification     5 3 4 Network Interface    Network interface cards are multiplexed by Ankh  Ankh provides each of its clients with  a shared memory area for in  and outgoing traffic  Each client gets its own MAC address  and all routing between clients and the outgoing interface is done transparently by Ankh   IPC is used both by Ankh to notify clients of new incoming data and by clients to notify  Ankh of outgoing data  Figure 5 4 depicts Ankh multiplexing a network card with both  an instance of L4Linux and a VMM with Linux in a VM as clients    When I started this thesis  no Ankh stub driver was available for L4Linux and I started  the implementation of the stub driver by using the stub driver of Ankh   s predecessor as  template  Like for the graphical console  I made the shared memory region visible to  Linux as device memory  The stub driver implements the Ankh client inter
89. n channel with the guest s dispatcher is       1 An upcall may be implemented as a injected interrupt     70    7 5 Modified RT Design       in place  event driven RT jobs are supported  Furthermore  the design is much more  flexible than the other two designs and allows up to 100 percent CPU utilization    This design places the following requirements on the VMM  It needs to implement  an hypercall for the admission of guest RT tasks  and it also needs to be able to relay  scheduling decisions to the guest s dispatcher with a bounded delay     Conceivable Implementation I now detail how such a scheme could be implemented  on top of the current Fiasco microkernel  In the current interface  kernel scheduling  contexts are bound to threads  Therefore  upon admission  the VMM creates one thread   RT thread  with the scheduling parameters given by the guest for each RT task  The  microkernel schedules a job by running the associated thread  It is up to this thread  to trigger further actions that eventually lead to the job to run in the VM  Once an  RT thread is activated  it writes its identifier into a known location in guest memory   Then it signals the scheduling event to the guest by marking an interrupt for injection  and initiating a world switch  The guest interrupt handler recognizes a dispatch event   Instead of activating the scheduler    as it would do in an unmodified system  it directly  sets the dispatcher in motion  That  in turn  reads the identifier  finds the c
90. nd the  guest kernel  Because the VMM   s timer interrupt is driven with timeouts of the host   the guest timer interrupt granularity is limited by that of the host  In this thesis  I  configured Linux to use 100HZ timer granularity on a host Fiasco with 1KHZ granularity     44    5 2 System Environment       Instead of implementing a stub driver  I adapted the existing i8253 timer driver to  communicate with the back end instead of the physical device     5 2 1 Boot Loader    In this chapter I will first describe the boot process of a typical PC  Thereafter  I will  describe how Linux boots in Karma    After powering the system up  the processor starts execution from a fixed state  The  processor starts in real mode  16 bit  at the fixed address Oxffff0  At this address  it  starts to execute the basic input output system  BIOS   which is custom tailored to the  machine  The BIOS does initial checks  finds and initializes installed devices  It then  copies a block from the specified boot location  e g  the first hard disk  into memory   This block is called the master boot record and contains the boot loader  After loading  the master boot record the system starts the boot loader therein    The boot loader sets up the system according to the boot protocol of the operating  system  It then loads the operating system kernel and finally leaves the system to the  operating system s boot code    The operating system then further configures the system for its needs  It sets up the
91. nel Compile with different types of hard disk access schemes        63  4K versus AM Pagesize     oaaao e    64  Impact of World Switch System Call              o          65  Kernel Compile in L4Linux with  and without the refresh thread        66    XV    List of Figures       6 14 Kernel Compile within a second stage VM                   66  7 1 Xenomai architecture    22  22 ee ee 69  7 2 RT task control loop       2    a 72  7 3 Real time VM setup          ee ee 73    XVI    1 Introduction    In the past two decades  computers have evolved into an ubiquitous tool  They are used  all over the world for everyday work  for managing finances and for communication   Unfortunately  computer systems are less reliable than they should be  They are  vulnerable to attacks that   if successful    may result in the disclosure of confidential  information  which  in turn  may lead to serious financial losses  An examination of  actual attacks reveals that often weaknesses of the operating system  OS  are to blame   In many cases  attackers compromise a single application and from there widen the  exploit to the whole machine  While an OS cannot protect individual applications from  being compromised  it should prevent the attacker from reaching out further    The issues that plague OSes are a consequence of their development history  Early on   users primarily ran applications they knew and trusted  Under the premise that the user  trusts its application  there was no need for the OS
92. nel based  systems  There are a number of attack scenarios that are of no concern in this thesis   Attacks on the Linux kernel originating from malicious Linux applications or from the  network may target guests in the VM as well  Although virtualization might be used to  increase security  for example with analysis of the VM   s behaviour  these scenarios are  out of scope of this work    A VM running on a microkernel based system poses a potential threat both to the  microkernel and to its applications  For this thesis I assume that the microkernel itself is  resistant to attacks  This assumption draws on the microkernel   s properties as described  in Section 2 1 1  Especially for VMs  I assume that the mechanism used to construct the  VM   s protection domain is valid and does not allow the VM to access memory without  permission  The kernel ensures that it receives physical interrupts and thus remains in  control at all times    I also assume that the microkernel mechanisms work to protect tasks from each other   The secure memory construction mechanism ensures that tasks can only access memory  to which they were granted permission  and the capability system is correct and does  not leak access rights  Furthermore  I assume that the peers used for services are well  behaved  and safeguard the interests of all their clients    Another class of attacks involve the VMM  Because the VM can only communicate  with the VMM  attacks originating from the VM are limited to the VM
93. nes of code  which makes it easy to modify and extend  The hypervisor functionality  is implemented with a kernel module  and as such can be loaded at run time  A tiny  launcher application initiates new VMs and implements the virtio interface for para   virtualized device access  Rus   The VM is controlled by the Lguest kernel module  A  setup of Lguest is depicted in Figure 3 4     3 3 3 NOVA OS Virtualization Architecture    NOVA is an attempt to create a secure virtualization environment with a small TCB  from scratch  It consists of NOVA  a novel microkernel with hypervisor functionality  a  microhypervisor   which aims to be small and efficient  NOVA supports capability   based access control  and has a minimal interface for hardware virtualization  On top of  NOVA runs a multi server user environment  Aside from drivers  it also runs the VMM  Vancouver  which does faithful virtualization  Vancouver runs entirely unprivileged   Each VM is coordinated by its own instance of Vancouver  Ste  The NOVA setup is  illustrated in Figure 3 5     3 3 4 KVM    The Kernel based Virtual Machine  KVM  is a virtualization extension to Linux that  enables it to act as a hypervisor  It strives to implement faithful virtualization and uses a    24    3 3 Virtual Machine Monitors       i L4 Applications Linux Applications        Vancouver    Vancouver        Jova           Virtual Machine  ___  Unprivileged  E  Privileged    Figure 3 5  NOVA OS Virtualization Architecture  running a VM wi
94. nnot be ensured by the current design because the RT threads cannot influence their  preemption  Thus  a solution has to involve the microkernel  The world switch system  call could be enhanced with two additional parameters  The ID and a memory address  inside the VM  where the ID should be written  Before executing the world switch  the  kernel would check whether the VM has enabled interrupts  which is an indication that  the previous job was dispatched  and temporarily disable concurrency  for example by         This mechanism assumes that the previous VM resumption has made all the VM state available  i e   it does not hold parts of it on its kernel stack where it is hard to recover from  Only with the last  state is it possible to resume the VM    3 Job completion is signalled with a hypercall  which can be detected by the RT thread by analyzing  the VMCB     71    7 Real time Tasks in Virtual Machines             while task active     wait_for_release     while job_not_done        write_job_id     world_switch          Figure 7 2  RT task control loop    using a lock  and write the specified ID into guest memory before executing the world  switch   Figure 7 3 gives an impression of the targeted system architecture     7 6 Summary    In this chapter  I introduced RT support for virtual machines  I discussed several  architectures and presented a sketch of the implementation of a solution for real time VMs  that is flexible and does not compromise on CPU utilization  Its
95. nter movements or video playback  Therefore  both L4Linux and  Karma employ a thread  which refreshes the whole screen repeatedly to make sure all  drawing operations become visible  For this benchmark  I disabled the periodic redraw  operation because it is not needed for console output and seriously degrades performance   I will revisit the refresh overhead in Section 6 2 3     I expect the virtualized Linux to have only slightly reduced performance compared to  native Linux  Overhead occurs because of the execution time that is needed for both the  VMM and its transitions to the Fiasco microkernel to execute a world switch  Since both  the VMM and the Fiasco world switch path have low complexity I expect this overhead  to be small    The KVM setup is included only to give an impression of the performance of a well   established virtualization platform  However  the measured setup differs in important  properties  Contrary to all other measurements  which use AHCI for hard disk access   the benchmarks in KVM used IDE  Furthermore  I disabled graphical output and instead  used a serial line  Because of these differences  the numbers measured in the KVM setup  are not directly comparable to the other benchmarks    The measurements met my expectations  With one CPU  Karma Linux ran the kernel  compile with a negligible performance overhead  For SMP  the overhead is slightly larger   which results from the IPI implementation  which leaves room for optimization  In a  more gener
96. ntroduce how I implemented support for multiple CPUs    I will follow the naming conventions that Intel laid out in the MP specification  Int    The system boots from the bootstrap processor that is elected either by the BIOS or by  the hardware  The bootstrap processor is used to boot the operating system  All other  CPUs in the system are called application processors and are activated by the operating  system    An SMP system consists of a bootstrap processor and one or more application pro   cessors  Each processor in the system has its own advanced programmable interrupt  controller  APIC     The APICs have their own IDs and are able to send inter processor interrupts  IPIs   to other APICs in the system  They also deliver interrupts from the system   s peripheral  devices    On x86 systems with multiple CPUs  the BIOS provides the OS with information  about all available processors and their status  enabled  disabled   This data is either  stored in a data structure called the MP table  or in ACPI tables    The PC boots with the bootstrap processor  During boot time  the operating system  reads an in memory data structure  the MP table  or ACPI tables to find other processors     46    5 2 System Environment       and initializes the APIC  It then calibrates and activates the APIC timer and sends IPIs  to activate the application processors    The Linux in the VM has the same boot sequence  The VMM starts up with one  control loop  which is used as the bootstrap process
97. object capabilities  I will also present the  rehosted operating systems L4Linux  Wombat OK Linux and User Mode Linux as well  as important examples of VMMs     3 1 Microkernels    In this chapter I will introduce two second generation microkernels  Beforehand  I will  shortly revisit the major design features that constitute second generation microkernels    When we refer to second generation microkernels  we usually mean microkernels of the  L4 family  The L4 specification was devised by Jochen Liedtke and includes three basic  abstractions  Address spaces are represented by tasks and used as isolation domains   Activity inside of tasks is abstracted as threads  Communication between threads is  done with synchronous inter process communication  IPC     Address spaces can be constructed with three operations  Lie95l     Grant One task may grant a page to another task if the recipient agrees  The granted  page is removed from the granter s address space and included into the grantees  address space     Map A task may map a page to another task on agreement  Thereafter  the mapped  page is visible in both the mapper s and the mappee s address space     Unmap A task may unmap a page without consent of the mappee  The page will be  recursively unmapped from all address spaces that contain mapping of the page     With these operations address spaces can be constructed recursively by user land servers   In an L4 system a special task  sigma  initially owns all physical memory  Up
98. ol Block  a data structure con   taining both the virtual CPU s state  state save area   and settings to control its execution  control area    Virtual Machine Monitor    Worst Case Execution Time    50    37    33    64    75    52    59    67    Bibliography     ABB 86  Mike Accetta  Robert Baron  William Bolosky  David Golub  Richard Rashid      AH    AMDO7      BDF 03      Bie06      CNO       Cor      Dik00      Dik01      Ger      GGPo       Avadis Tevanian  and Michael Young  Mach  A New Kernel Foundation for  UNIX Development  pages 93 112  1986  7    A  Au and G  Heiser  L4 user manual  School of Computer Science  7    I  AMD  O Virtualization Technology  IOMMU  Specification  AMD Corpo   ration  2007   IB    P  Barham  B  Dragovic  K  Fraser  S  Hand  T  Harris  A  Ho  R  Neugebauer   I  Pratt  and A  Warfield  Xen and the art of virtualization  ACM SIGOPS  Operating Systems Review  37 5  164 177  2003  15  22    Sebastian Biemueller  Hardware supported virtualization for the 14 micro   kernel  Diploma thesis  System Architecture Group  University of Karlsruhe   Germany  September 29 2006  26    P M  Chen and B D  Noble  When virtual is better than real  In Proceedings  of the 2001 Workshop on Hot Topics in Operating Systems  HotOS   pages  133 138  2001  9    Microsoft Corporation  Microsoft security bulletin ms09 050   http   www microsoft com germany technet sicherheit bulletins   ms09 050 mspx  Online  accessed 03 01 2010  5    J  Dike  A user mode port of the L
99. on that a  hierarchy of pagers can be built    In L4 IPC is the basic mechanism for communication between threads  IPC is  unbuffered and synchronous  Only if both sender and receiver agree and are ready  the  kernel does a rendezvous of both and delivers the message  This modus operandi proved  to perform well  Lie94    Upon this mechanism other techniques such as remote procedure  calls may be implemented     17    3 Related Work       3 1 1 Fiasco    The Fiasco microkernel is a project at TU Dresden that arose from an effort to create  a real time capable microkernel  It is written in the high level programming language  C    and runs on x86  x86 64  ARM and PowerPC platforms  A Linux user land  port is also available for development purposes  Support for symmetric multi processing   SMP  is available  and the kernel has been shown to offer good performance with  a port of Linux   HHL 97   Its sources are provided under the terms of the GPL to  encourage community participation  Originally  it implemented the L4 specification and  has since been used as a research vehicle to explore and evaluate evolutions in the field  of microkernel research  It includes refined support for standard monolithic operation  systems and is platform for a ported version of Linux  Fiasco is co developed with its  run time environment    Fiasco sports support for secure virtual machines  which is another embodiment of  protection domains alongside tasks  Contrary to tasks  its interface does no
100. ons  However  availability is also an important  part of security  secure best effort tasks must not be brought to a halt by the VM     7 4 Multiple Slightly Modified RT Guests    Whereas running RT tasks in a VM shields the remainder of the system  RT tasks can  still interfere with each other for the common case that the RT OS does not provide  adequate isolation  VMs suggest themselves as protection domains  which leaves the  problem of scheduling more than one of them in a timely manner    One solution is to run multiple slightly modified guest OSes  which are scheduled  round robin with fixed shares of CPU time  The guest OSes are given the illusion of  running on a slower CPU    The guest OS signals timeouts that it needs to implement its internal schedule to the  host  which is in command of the system timers  Release timeouts  marking the point in  time when an RT task gets ready  are accepted without modification  The guest may  use timeouts to provision against WCET overruns  These timeouts are calculated by the  guest under the assumption that the job runs on a slower CPU  In reality  it is run at  full native speed within the guest   s time slice  The difference in computing time needs to    69    7 Real time Tasks in Virtual Machines       be taken into account by the host  which is done by shortening the provided job length  with the slowdown factor  In the same way compute timeouts need to be shortened    Such a design allows multiple modified RT guests that ne
101. or  Inside the VM  Linux does all  initialization as described before    There are two solutions to provide Linux with the information about how many   virtual  CPUs are available  Either the VMM creates an MP table  like described  before   or Linux retrieves this information with hypercalls  I decided to use the former  solution because the algorithm to create an MP table is simple  and I did not have to  alter the early boot process of Linux  which can be tricky    Once the bootstrap processor has done its initialization it activates the application  processors  On a bare machine processor activation is done with IPIs  which are  sent by the bootstrap processor   s APIC to the target processor   s APIC  On x86  the  application processor is initialized in 16 bit mode  which I wanted to avoid  Therefore   I paravirtualized the activation of application processors with a hypercall  The VMM  creates a new APIC instance and a VMCB that contains the CPU state of the application  processor that is started  The application processor starts up in protected mode  and its  segment and interrupt descriptor tables are set up in the same way like on the bootstrap  processor  The VMM then creates a new thread that runs a control loop using the newly  created VMCB    The APIC implementation is similar to the IC implementation  An APIC can trigger  an IPI at another APIC  both direct IPIs as well as broadcasts are supported   The  control loop belonging to that APIC polls for interrupts a
102. orm of automated OS rehosting that attempts to achieve the  performance of paravirtualization with little porting expenditure  Pre virtualization  allows an adapted operating system binary to be run on bare hardware as well as on a  number of different hypervisors such as L4 and Xen     21    3 Related Work       To achieve good performance the guest operating system kernel is modified  Modifica   tions are largely automatic  and include the rewrite of assembly code to pad sensitive  instructions  the analysis of high level source code to find sensitive memory accesses   as well as structural modifications such as the introduction of function call overloads  and an empty region within the guests virtual address space  A hypervisor can decide  whether or not to modify the guest kernel to make it benefit from para virtualization   Possible run time modifications include the injection of VMM code and the replacement  of sensitive instructions as well as replacement of kernel functions  The injected VMM  code can  for example  implement mode switches of the virtual CPU  batch page table  updates  and issue efficient calls to the hypervisor where appropriate    A prototype implementation supports L4 kernels as well as Xen as back end  and  supports the Linux operating system as guest  This implementation already showed  good performance  with only 5 5  performance degradation with the Xen back end and  14 6  with the L4 back end  both compared to native Linux  LUY  08       3 2 5 
103. orresponding  job and makes it run  The RT thread   s control loop is depicted in Figure 7 2   A return from the world_switch   system call can have two reasons  Either the  job is completed  or the job has been preempted  When a VM is scheduled after an  preemption  control is not transferred back to it  Instead  its VMM is given control   That is different from the execution of threads  which are first class kernel objects with  respect to scheduling and are not controlled by a third party such as a VMM  For them   preemptions are transparent and thus not directly noticeable  Preemptions occur for  example when a higher priority job is released and is immediately scheduled  Scheduling  decisions are taken at the host  and are visible to the RT thread only with a return from  the world_switch    Therefore  after returning from the world_switch    the control loop  has to check whether the job is completed  or it has been preempted in favour of another  job  In the latter case  the execution of the unfinished job has to be resumed  To that  end  the job   s identifier is made visible to the guest and a dispatch event is injected    Because there is only one location for the task ID  which is used by all RT threads  a  preemption between writing the ID and entering the VM gives rise to a race condition   As a result  the guest might receive the wrong ID and thus dispatch the wrong RT task   Supplying the ID and the world switch must therefore be atomic  Unfortunately  that  ca
104. percall to make the KVM  back end acquire a fresh VM capability  A world switch into the new VM is done with  a hypercall as well  The hypercall is augmented with the address of the second stage  VM   s VMCB and the general purpose registers  Using this information  the back end    52    5 5 Summary             44 files changed  2697 insertions     133 deletions            Figure 5 5  Output of diffstat    executes the Y_vm_run_svm system call  which makes Fiasco switch the CPU to the  second stage VM  Like in ordinary VMs  Fiasco ensures that it remains in control at  all times  Therefore  the second stage VM   s execution is left upon physical interrupts   Thereafter the Karma control loop takes control and returns to the first stage VM    Initially the second stage VM s address space is empty  Upon VM page faults  KVM  does a hypercall to map a page into the VM    I could do the adaptions to KVM rather quickly  because the points in the KVM code  that needed to be adapted  are the same as in KVM L4     5 5 Summary    Complexity I created the VMM with its future use in mind  Therefore  I wanted to  keep its overall complexity low so that other people can easily understand the application  logic and start extending it  An indication of the complexity of an application can be  given with its total of lines of code  My VMM implementation has about 3800 lines of  code  The line count alone can be misleading  so here are more numbers     e Lines of code  3800     Files  24 heade
105. plication basis with their software   Because of this variety of available software I chose Linux as the target interface of my  work    On top of Fiasco  Linux applications are supported with two projects  L4Linux and  KVM L4  L4Linux  see Chapter 3 2 1  is a version of Linux that has been rehosted  to run directly on the L4 API  The rehosted Linux kernel has inherent performance  disadvantages compared to native Linux  As illustrated in Figure 4 1  a kernel entry  in native Linux  a  requires only one privilege level switch  In L4Linux  b  the same  system call needs to be mediated by the microkernel  requiring two privilege level and  one address space switch    Virtualization extensions recently added to mainstream processors simplify the imple   mentation of virtual processors  When the guest runs in a VM  the microkernel is  not involved with the guest s operations such as system calls  as shown in Figure 4 1   c   Therefore  the performance disadvantages of rehosted kernels related to processor  virtualization do not apply to OSes running in VMs    Virtualization extensions are used on top of L4 with KVM L4  Its VMs already yield  good performance  However  KVM L4 relies on the services of L4Linux and Qemu  and  thus has a substantial resource footprint    In this thesis I want to make Linux applications usable on top of a microkernel  They  should run with as much performance as possible  T he solution shall require less resources  than KVM L4    For its security prop
106. re exposed to pagers  special programs that implement  a memory management policy  A secure mechanism allows page mappings to be created   and an unmap operation ensures that memory can be revoked any time without the  mappee   s consent  Memory management operations except unmap require consent and  are therefore piggybacked onto IPC    The first implementation of such a second generation microkernel written by Jochen  Liedtke was called L4  Lie96   Today the name L4 is used for a family of successors that  implement this interface  AH     Microkernels of the second generation proved to be a suitable platform upon which  highly decomposed  secure systems can be built  PSLWO9   Microkernel based systems  are regarded as being highly modular  providing good fault isolation  being flexible and  tailorable  Lie96     Microkernels impose a slight overhead compared to monolithic systems  Instead of  calling subsystems with method invocations  communication is done using IPC  An IPC  call from one user land server to another requires switching from user privilege level  to kernel privilege level  possibly switching address spaces and switching back to user  level  Thus microkernels introduce transition overhead as well as increasing the TLB  and cache footprint  Because there are a number of servers communicating with each    2 Background       other to provide the operating system   s functionality  IPC performance is very important   H  rtig and colleagues were able to show that 
107. reads could write the virtual interrupt into  the UTCB  and the microkernel would inject them into the VM  Thus  the control  loop s duties would be reduced to hypercall and error handling  However  such a  solution comes at the cost of increased kernel complexity  For the KVM port  an  additional flag needs to be introduced  which disables virtual interrupt injection  and instead returns to the control loop  Otherwise  virtual interrupts intended for  the first stage VM would be injected into the second stage VM     e The KVM port could be optimized to not return from the second stage VM into the  first stage VM on every physical interrupt  A return into the first stage VM is only  needed  if virtual interrupts need to be injected  After leaving the second stage  VM  the control loop would check for pending virtual interrupts  and would resume  the second stage VM immediately if none are pending     Future Work This work opens up a range of possible projects  I will now introduce  the most important ones that I have in mind     Device Manager for Block Devices Currently the hard disk can only be accessed by  a single VM  but it cannot be shared directly among VMs  Sharing would be  possible for example using a network file system that is exported by the VM with  write access and used by all other VMs  A solution for direct disk sharing would  be to implement a device manager  It would include the hard disk drivers  and  export a block based interface to clients  The multiple
108. refore inaccessible to its applications  which solves the security problems  Additionally   the patch obviates signal delivery on system calls and thus improves performance  considerably  The patch is called SKAS  which is short for Single Kernel Address  Space  ska     Another patch further improves performance by enhancing the ptrace mechanism with  a new command that allows system calls to be canceled  This patch is called User Mode  Linux Sysemu  sys       Both patches have not yet found their way into the main Linux source tree and are  currently outdated    UML uses virtual devices which are constructed from software resources provided by  the host  Dik01   UML may use any host file as a block device and attach to any host  network interface    UML was created to show that the Linux system call interface is sufficient to host  itself  It is now used as a tool for operating system developers  who want to make use of  general purpose debuggers  It can also be used for sandboxing and jailing hostile code  as well as untrusted services  Another application is consolidation of physical machines   Ports of UML to other host operating systems may be used to provide the Linux interface  on such systems    A running instance of UML is illustrated in Figure 3 2     Linux User mode Linux  Applications Applications    User mode Linux       EE  Unprivileged  O  Privileged    Figure 3 2  User mode Linux running on top of Linux     3 2 4 Pre virtualization    Pre virtualization is a f
109. rformance is about  18 percent for the bench sh benchmark  which is rather large  For the kernel compile  benchmark the difference is much smaller  about two percent      64    6 2 Variation Benchmarks          700    600    500    400    300    Time in Seconds    200    100             1 2 3    Number of CPUs    Figure 6 12  Kernel compile benchmarks with a custom slowdown of the world switch  path  The red bars denote zero slowdown  light green denotes a slowdown  of 5000 CPU cycles and the dark green bar denotes a slowdown of 10000  CPU cycles     6 2 2 Impact of World Switches    An important aspect in virtualization is how fast a switch between guest and host and  vice versa can be executed  To get an impression how the world switch influences the  overall system performance  I measured the kernel compile with an artificially degraded  world switch path  The VM used 3 CPUs  512Mb Ram and the framebuffer interface    The measured numbers are depicted in Figure 6 12  and indicate that the world switch  path complexity does not have a significant impact on the performance of complex tasks  such as a kernel compile     6 2 3 Screen Refresh    One of my goals was to support X11  graphical  applications  Because X11 can make  use of framebuffer devices X11 applications are supported with the L4con interface   However  the X11 framebuffer drivers do not update the screen for example after the  mouse pointer moved  Similarly  windows containing video playback are not refreshed  
110. rgleich zu  anderen Ans  tzen und nicht virtualisierten Systemen zu ermitteln     verantwortlicher Hochschullehrer  Prof  Dr  Hermann H  rtig  Betreuer  Dipl  Inf  Adam Lackorzynski  Institut  Systemarchitektur  Professur  Betriebssysteme   Beginn  01  08  2009   einzureichen  31 01  2010    Da Dur    Dresden  20  08  2009 Unterschrift des verantwortlichen Hochschullehrers    Erkl  rung    Hiermit erkl  re ich  dass ich diese Arbeit selbstst  ndig erstellt und keine anderen als die  angegebenen Hilfsmittel benutzt habe     Dresden  den 27  Januar 2010    Steffen Liebergeld    Abstract    Microkernels were invented as a foundation for systems that can be tailored and adhere  to strict security requirements  As with any new system  application availability is crucial  to its adoption  Reusing a standard OS kernel on top of the microkernel is a pragmatic  way of inheriting an OS es API and with it its applications    In the past  standard OS kernels were ported to run on top of microkernels  It  turned out that substantial efforts were needed to implement the equivalent of a CPU   which was the model the OS originally assumed  In addition  this approach incurs  substantial run time overhead  Many OS activities such as context switches are security  sensitive and have to be implemented with microkernel provided abstractions  Due to the  microkernel involvement  rehosted kernels suffer from inherent performance disadvantages   Virtualization technology promises better performance 
111. rmance Application performance should match their native performance as closely  as possible     Complexity The solution shall be of low complexity  This keeps the costs for maintenance  low and helps future researchers to adapt it for their own projects  It can improve  the security because small systems are suited to a more thorough revision process     4 2 Architecture    As described in Section 2 2  virtualization requires core  platform and device virtualization   Core virtualization is concerned with CPU and memory virtualization  both of which  are done by hardware extensions    Following Jochen Liedtke   s microkernel rationale  all uncritical functionality has to  be implemented in user land  see Section 2 1 1   Core virtualization is concerned with    30    4 2 Architecture          E  Unprivileged   2777777777  Virtual Machine  EE  Privileged    Figure 4 2  Setup with two VMs     security critical operations  The umrun instruction to switch the CPU to guest mode is  a privileged instruction and thus can only be executed by the microkernel    The microkernel must retain control of the machine at all times  More specifically  it  has to receive and handle all system interrupts  For example  if the guest were able to  consume the system timer interrupt  the host would not be able to schedule and thus  loose control over the machine  That is why the microkernel must enforce that host  interrupts occurring during VM execution are intercepted and delivered to the host   
112. rs  13 implementation files    e Classes  21    An important aspect is the complexity of the modifications done to the Linux kernel   The code encompasses the stub drivers  adaptions needed to make use of DMA transfers  using the physical hard disk and the modifications I made to the start up process of  application CPUs  see Section 5 2 2     The complexity of the patch to the Linux kernel is visualized with the output of diffstat  in Figure 5 5  The stub drivers comprise about 2300 lines    I was able to port the patch to a new Linux version in about one hour  I think that  the effort to keep the virtualized Linux up to date is reasonably small     Resource Footprint Karma VMM needs 284KB of memory for an instance     which is  an improvement compared to KVM L4  KVM L4 requires L4Linux and Qemu  which at  a rough estimate need about 16MB of memory on their own        3 The resource footprint of 284KB has been measured for an instance with 3 virtual CPUs  Each virtual  CPU requires memory on its own  Therefore  instances with more than 3 virtual CPUs would require  slightly more memory     53       5 Implementation       Additional resources are needed to run the L4 device managers  There is one device  manager instance per physical device  Every device manager needs resources on a per  client basis  The L4con secure console for example allocates a memory area with the  size of the framebuffer that each client uses to draw its graphical output  Additionally   some memory nee
113. s   IPIs   Operating systems make use of IPIs to start processors  trigger remote function  execution and to do synchronization    For this thesis  I decided to equip each virtual CPU with an advanced programmable  interrupt controller  APIC  back end  which contains facilities for IPI delivery     4 6 Staged Virtualization    Many operating systems do not come with source code access and can therefore not  be adapted to run directly in a VM established by the VMM  Microsoft Windows is  the most prominent example  It is in wide use and gained a huge amount of available  commercial software over the years  Another class of operating systems are legacy OSes  that are no longer supported by its vendors but still run important legacy software    It is desirable to make this huge body of software available  Faithful virtualization   the recreation of an existing physical machine  is a solution to this problem as it enables  virtualization of unmodified OSes    Peter and colleagues showed that KVM can be ported to run on top of the Fiasco  microkernel  PSLW09   and that its performance is on par with native KVM  This KVM  port requires an instance of L4Linux and Qemu    I plan to do a similar setup that  instead of requiring L4Linux  uses a VM with a  para virtualized Linux  To describe this  I will use the following notation  The VM  that runs the modified Linux instance will be called first stage VM  whereas the VM    37    4 Design       First Stage VM Second Stage VM    Linux W
114. s spaces  The address resolution is done by the MMU  and involves parsing of  the guest page table to translate a guest virtual address to the guest physical address  and subsequently parsing of the host page table to find the corresponding host physical  address  see Figure 2 2      13    2 Background       2 2 4 Platform Virtualization    Virtualizing the CPU and memory management falls short of virtualizing a whole machine   An operating system relies on the services of a number of tightly coupled devices that  need to be implemented in the VMM because of performance considerations    Such devices include an interrupt controller and a timer device  Both are of less  complexity compared to peripheral devices and can therefore be implemented without  increasing the VMM complexity significantly     2 2 5 Peripheral Device Virtualization    Providing the VM with peripheral devices such as a keyboard  a mouse  network interface  cards  hard disks and graphics cards is essential to the usability of the VM  In contrast to  platform virtualization  which has to be implemented in software by the VMM  peripheral  devices may be virtualized in several ways    One solution would be to fully emulate an existing device  To do so  the VMM has to  implement a full device model and to intervene on any device access  Such a solution  allows guest operating systems to run unmodified by using their native drivers    Such a solution increases VMM complexity  because it requires the full device
115. se calls to  the VMM  using the vmmcall instruction  hypercalls   Hypercalls will facilitate shared  memory communication to increase the expressiveness of individual commands and thus  significantly reduce their number    16bit real mode is a relic from the early days of x86  It is used by all current OSes for  their early set up only  During operation  protected mode  32bit  is used exclusively   Whereas AMD SVM supports hardware assisted virtualization of real mode code  only  the most recent Intel processors with Intel VT have that capability  With the prospect  of future Intel VT support  I decided to not support real mode code  As with 16 bit  mode  the BIOS is a remnant of former times  Its only purpose is to provide a backward  compatible environment during the boot up stage  With VMs starting directly in 32bit  mode  the BIOS is rendered unnecessary as well    A number of OSes cannot be modified  either because they are distributed without  sources  or their license does not allow modification  These OSes cannot be adapted to  the target VM and therefore cannot run para virtualized  The most prominent example  is Windows  which runs a huge variety of commercial applications  In a second step I  facilitate Linux functionality to establish virtual machines that do faithful virtualization   I will later revisit fully backward compatible VMs in Section 4 6     4 4 Security Considerations    In this section I will shed light on the security implications of VMs on microker
116. sing asynchronous IO rings   and interrupts are delivered using an event mechanism  Because Dom0 has full device  access  it belongs to the trusted computing base of all VMs  MMHO0S8     A typical Xen setup is depicted in Figure 3 3 1  Xen runs in Ring 0  with a para   virtualized guest driving the host devices  Dom0   It also runs a VM with a para   virtualized Linux  and a VM with an unmodified Windows  using hardware virtualization    The DomU VMs do not access physical devices  A device access of the DomU running  Linux is initiated by the Xen device driver  a   The Xen device driver communicates using  an efficient protocol  b  with a management application in Dom0  c  which multiplexes  the device for all DomUs  The management application then uses mechanisms of the  guest OS in Dom0 to access the physical device  d     If Dom0 drives devices with DMA memory accesses  both the Xen kernel and the OS  running as Dom0 belong to the TCB     23    3 Related Work       Linux  Application    EE  Privileged       Figure 3 4  Linux running a para virtualized Linux in a VM using the Lguest hypervisor     3 3 2 Lguest    Lguest is an extension to Linux that allows it to act as a hypervisor by leveraging hardware  virtualization extensions  Contrary to KVM it does not support faithful virtualization   but runs para virtualized Linux kernels exclusively  For device virtualization it resorts to  the virtio framework  It is meant to be of minimal complexity and comprises only 5000  li
117. st enables interrupts   Upon virtual interrupt  intercept  the VMM injects the first interrupt via the event injection mechanism of SVM  and the second one as virtual interrupt  When only one interrupt is pending  it is injected  as virtual interrupt and virtual interrupt intercepts are disabled  With this scheme we  can make sure that the interrupt latency is short and all interrupts are injected     5 2 System Environment    My implementation draws on work done by Torsten Frenzel  a PhD student at the chair  for operating system research at TU Dresden  who is working on a VMM that targets  the ARM platform with the ARM TrustZone architecture  Lim   At the time I started  working on my thesis  he already had implementations for both a timer and an interrupt  controller  IC   I used his code to jump start my VMM implementation  and modified it  where necessary    The IC implementation consists of a stub driver residing in Linux and a back end  residing in the VMM  The back end consists of a shared data structure holding the  interrupt state  and access functions both for the interrupt threads to trigger interrupts  and for the control loop to poll for pending interrupts    The timer back end resides in the VMM and contains a blocking thread that uses  timeout IPC to get a notion of time  It periodically flags timer interrupts at the virtual  IC  which are subsequently injected by the control loop    Timer interrupts are the basis for scheduling and timeouts both in the host a
118. st sh    57          6 Evaluation          3 500                            M Native  EU Karma  3 000 E L4Linux  7  E KVM  A A EIER  3  5  o ANU eee ais ate SO ee MI ROSS SS EERE SIS SiS i eS a ie ER SSN NRE SEH a EOE   nr  S  8 E A er ar See ESF EEE Sake SS  E  1 000 es esas penosos E AAA A a GEIS E age   500 M            MN a TM o aan nn  0  1 2 3  Number of CPUs    Figure 6 4  Bench sh benchmark    The bench sh benchmark is a pathological case for L4Linux  which shows bad perfor   mance  The bad performance of L4Linux is caused by the frequent interaction with the  Fiasco microkernel to create tasks  to populate them with memory and subsequently to  destroy them     6 1 3 Kernel Compile    The performance of a Linux system can be measured in many ways  A benchmark that  stresses a lot of operating systems services is compiling the Linux kernel    In the measured scenario  1 used an install of the Linux distribution Debian in version  5  Lenny  as basis  The compiler version is gcc 4 3 1  The compilation time is measured  with an external clock  as described in in the previous section  and therefore does not  rely on virtual time  which may drift  The measured results can be seen in Figure 6 5    To reduce the performance impact of L4con  which is used for in  and output  and  therefore minimized the amount of text output of the kernel compile  Furthermore  I  disabled the Ankh network multiplexer  All data  including the compiler binary and the  kernel source code  are loc
119. t Xen from guest operating systems  and segments must not overlap with the top end of the linear address space  Updates to  page tables are batched and validated by the VMM  The guest OS must run at a lower  privilege level than Xen  Usually the guest kernel runs on Ring 1  with guest applications    22    3 3 Virtual Machine Monitors       i ii Linux Windows      Management   i  Application Application     Application E    E Virtual Hardware E      Manager       Modified Kernel        Li Windows Kernel          4 Virtual Machine E  Unprivileged   GANE  Privileged  Ring 1   E  Privileged  Ring 0        Figure 3 3  The Xen hypervisor running one VM to drive the hardware  Dom0  and  two VMs with no direct hardware access  DomU   Windows is run using  hardware virtualization  whereas Linux is run paravirtualized     in Ring 3  ring compression   using the host MMU to provide isolation between guest  applications    With the advent of hardware virtualization technology  Xen was enhanced to make  use of it  Therefore  Xen can now also run unmodified OSes    A Xen setup includes one para virtualized VM that runs the devices of the host  machine and is called Dom0  It also runs a management application  which can be used  by the administrator to manage VMs    All other VM instances   called DomU   have no direct hardware access  but commu   nicate with Dom0 for multiplexed device access  Devices for all VMs besides Dom0 are  implemented with a custom interface  Data is transferred u
120. t consist of  system calls and virtual memory  but of a virtual CPU and virtualized memory including  a virtual MMU  VMs are subject to the same address space construction rules as tasks    Recently support for object capabilities was added to the kernel  On top of it  a  capability based run time environment was developed that follows the principle of least  authority and aids developers in building applications  This run time is named L4  Runtime Environment  L4Re      3 1 2 L4Ka  Pistachio    L4Ka  Pistachio is a microkernel developed by the System Architecture Group at the  University of Karlsruhe in collaboration with the DiSy group at the University of New  South Wales  Australia  It is written in C   and runs on x86  x86 64 as well as PowerPC  machines  The microkernel provides SMP support and fast local IPC  pis      3 1 3 EROS    The Extremely Reliable Operating System  or EROS for short  is an attempt to build  a system that uses capability based privilege management for any operation without  exception  It has a small kernel and uses a single level storage system for persistence   which is transparent to applications  The EROS system is a clean room reimplementation  of its commercial predecessor KeyKOS written in the high level language C   for the  x86 platform  Address spaces are used for isolation and object capabilities for access  control  Early microbenchmarks show performance comparable with Linux    EROS and its predecessors showed that efficient and secur
121. t each VM exit  and injects  them into the VM  Therefore  to keep IPI latency low  the target processor has to be  forced to leave VM execution  which can be done by triggering a physical interrupt on  that processor    In Fiasco  IPIs are used to signal remote CPUs to deliver cross CPU IPC and User   Irqs  IPC is synchronous  and would require the sender to wait until the receiver is  ready  User Irqs instead are asynchronous and allow the sender to continue execution  immediately  I employ an additional event thread per CPU to wait for an incoming  User Irq  Because this thread runs directly on the target CPU  a User Irq sent to this  thread forces the target processor to leave the VM execution  Its control loop can then  immediately inject the IPI    The control loop of the bootstrap processor additionally polls at the IC for device  interrupts  Therefore  in this implementation interrupts of peripheral devices are injected  into the virtual bootstrap CPU only  Linux usually controls interrupt affinity  for example  to do interrupt load balancing  In this thesis  I decided to ignore the interrupt affinity  to decrease the APIC implementation s complexity  This is in conformance with the  physical APIC   s default behaviour  and in my testing Linux was able to handle it without  problems  However  support for interrupt affinity can be added in the future  if needed     APIC Timer Calibration Linux calibrates the APIC timer by disabling interrupts and    reading the CPU s ti
122. t executes the interrupt handler that belongs  to the stub driver  Figure 5 3 shows a comparison of a stub driver in L4Linux and the  driver setup in Karma        2 On Fiasco such notifications can be implemented either per IPC  or asynchronously with  User  Irgs     48    5 3 Peripheral Devices          E  Virtual Machine  a  IPC User Irq    C L4 Task  b  Virtual Interrupts     c  Hypercalls       Figure 5 3  One the left hand side there is a standard L4Linux kernel with a stub driver   The stub driver consists of the driver logic and the L4 IPC interface for  communication with the L4 device manager  On the right hand side both  Karma and a Linux VM are depicted  The stub driver in the Linux kernel has  been split so that the driver logic remains inside of Linux  and the L4 IPC  interface resides in the VMM  A custom hypercall based protocol connects  both parts     5 3 2 Serial Line    A serial line is the most simplistic interface for input and output  For a serial line only  two operations are needed  A read operation reads a byte from the line and a write  operation writes to the line  Availability is signaled to the client  Such a simple serial  line interface is implemented in Fiasco and is used for the console  I implemented a small  driver back end in the VMM to make use of this interface    The stub driver was derived from the L4Linux serial driver by Adam Lackorzy  ski   The Linux serial line interface has operations for IO of single characters  but also for  whol
123. th CPU virtualization   memory virtualization and overall performance  Second  I tweaked system parameters  and measured how they affect the overall system performance  I did this to get an  idea where further optimization might be needed  This chapter will be completed with  measurements that evaluate the performance of second stage VMs     6 1 Performance Benchmarks    In this section  I will evaluate the performance of Karma  First  I measured the  performance of the CPU virtualization with a compute benchmark  Second  I devised  a custom benchmark that stresses the OS interface as well as the performance of both  the memory virtualization and  for SMP  the IPI implementation  Third  I measured  the compilation of a Linux kernel from source to binary  to evaluate the overall system  performance     6 1 1 Compute Benchmark    To measure the performance of CPU virtualization  I used an application that uses a  brute force approach to find prime numbers  The measured run time is given in Figure 6 1   As expected  CPU virtualization imposes an insignificant overhead to compute intensive  workloads       Time   Percent       Native 5m 4s   100   Karma 5m 5s   99 67  Nested VM   5m 8s   98 70  KVM 5m 7s   99 02    Figure 6 1  Compute Benchmark    56    6 1 Performance Benchmarks               bin sh   sh test sh eins  amp   EINSPID      sh test sh zwei  amp   ZWEIPID      sh test sh drei  amp   DREIPID      wait  EINSPID  wait  ZWEIPID  wait  DREIPID       Figure 6 2  bench sh  
124. th Linux and one  with Fiasco                  Guest Applications      RE   Virtual Machine   gt   Unprivileged  CI  Privileged    Figure 3 6  KVM running a VM side by side with normal Linux applications     slightly modified instance of Qemu for emulation  KVM requires hardware virtualization  extensions and supports nested paging   In a KVM setup the full monolithic Linux kernel runs in privileged mode and   in  terms of security   counts to the TCB of all applications and VMs running on top of it   A KVM setup is illustrated in Figure 3 6  It runs one VM with the help of Qemu side  by side with Linux applications     3 3 5 KVM L4    Peter and colleagues ported KVM to run on an L4 system  The KVM kernel module was  ported to run in L4Linux  Its communication with the hardware was modified to make  use of Fiasco   s support for VMs  The microkernel is the only privileged component and  enforces isolation between VMs and applications  KVM L4 does not increase the system   s    25    3 Related Work       j Ferm     L4    Guest Applications   Applications    Guest Kernel  unmodified  i    Qemu    L4Linux KVM L4       E  Privileged    Figure 3 7  KVM L4 running one VM and L4 applications with a small trusted computing  base side by side     TCB  Therefore this setup allows applications with a tiny TCB to run side by side with  virtual machines  This setup is a huge improvement compared to native KVM in terms  of security  KVM L4 does not degrade performance compared with KVM  PS
125. the Control Loop     When an OS is idle  it executes the Alt instruction  which  on a physical machine   suspends the processor until an interrupt occurs  In the VM  the Hlt instruction is  handled by the control loop  which yields the processor for other activities that are  ready to execute  until a virtual interrupt is flagged as pending by an event thread  The  complete control loop is sketched in Figure 5 1    At the moment  yielding the CPU is implemented with a timeout IPC of Ims and  subsequent polling  which is problematic with respect to interrupt latency  In this  implementation interrupt latency can be as large as Ims  I also implemented an  alternative that uses asynchronous User Irgs  In this implementation  the control loop  blocks on a User Irq that can be triggered by event threads on incoming events  Once  the event thread has triggered the User Irq  the control loop immediately continues  execution  This implementation does not use polling  and in theory should keep the  interrupt latency low  However  in my measurements the first implementation performed  better  The reasons for this counter intuitive behavior are unknown and subject to future  research    In summary  the control loop has to handle the following exit reasons     1  Asynchronous Exits   Interrupt The VM was left because of a host interrupt  After the host OS returned  control to the control loop pending interrupts are injected into the VM   2  Synchronous Exits     Vmmcall The VM explicitly
126. to the Linux APT  The UML  binary is mapped into the address spaces of all its applications  UML remains in control  by using the ptrace system call trace facility to keep track of system calls as well as  signals to remain in control of faults and exceptions  Dik00   Being a full fledged Linux  port  it runs the same binary applications as the host  Dik01     For any UML application a tracing thread is employed that intercepts system calls   This mechanism allows only interception but not canceling of system calls  Therefore  the tracing thread nullifies the original system call by issuing the getpid   system call   which results in an unnecessary kernel entry  UML uses signals to force control to the  UML kernel during a system call or interrupt  Signal delivery and return are slow and  impose a noticeable performance deterioration  ska     To protect the UML binary from its applications a jail mode was introduced that  remaps the UML binary in a read only fashion while executing an UML application  and  remapping it with write permission on context switches  This modus operandi slows  down UML context switches considerably and does not protect the UML binary from  being read  which can be used by attackers to tell whether or not this attacked system is  running UML     20    3 2 Rehosted Operating Systems       To improve performance and security  a patch was developed that enhances Linux  support for UML  With this patch UML can run in a separate address space and is  the
127. ulator  A custom device interface requires custom drivers for the guest    KVM L4 does faithful virtualization  and therefore provides a guest with an exact  duplicate of a physical machine  including a complement of devices  As described in  Chapter 2 1 2  faithful virtualization requires instruction emulation and device models   Both are implemented in Qemu  which runs as a separate application  KVM L4 achieves  about 92 percent native performance  PSLW09    which is on par with native KVM    KVM and Qemu are widely used  and actively developed by a number of skilled  people all over the world  Therefore  I assume that the emulator has received careful  tuning  The remaining overhead of about 8 percent results from inherent costs for device  virtualization  1O intercepts  instruction emulation and update of device states  Many  workloads cause frequent device interaction  and are therefore limited by the performance  of device virtualization    Implementing device virtualization inside the VMM is one possible optimization that  spares the overhead incurred with switching to an external component  but the costly  IO intercepts and performance penalties caused by emulation cannot be avoided     32    4 4 Security Considerations       Therefore  to achieve better performance than KVM L4  I decided to go without full  hardware backward compatibility  Instead  I will provide the VM with custom device  interfaces  thereby avoiding emulation altogether  The device interface will u
128. un in privileged mode without isolation from one another   a monolithic kernel is not a suitable basis for secure systems  Furthermore  it is difficult to  retrofit modern access control schemes that would allow more fine grained access control   which might render a user   s data less prone to attacks  without breaking backward  compatibility  Asides  further technology  for example support for real time applications  is hard to add  as well    Instead we should seek to create an architecture that enables us to build decomposed  systems in which applications with a small TCB can be built and access permissions can  be controlled at a fine granularity     2 1 1 Building Decomposed Systems with Encapsulated Components    One proven way of dealing with the software complexity is modularization and restricting  interaction to well defined interfaces  However  if all components run in privileged mode  no improvement in security is achieved because there are no means of isolation between  them  While we have to trust all code running in privileged mode  the situation is  different for unprivileged code  Address spaces provide a boundary that can be used to  establish strict isolation between processes  Hei05   By moving subsystems into user land  processes  we can safely decompose the system in a way that faults and errors can be  confined to one subsystem only    The base system  or kernel  that establishes the required isolation between components  must be small  because to make
129. using map operations on the  VM capability    The VMM allocates a VMCB for the virtual CPU  and initializes it in a way that  the guest can be bootstrapped  A world switch is initiated with the l4_vm_run_sum  system call on the VM object capability  The VMM passes both the VMCB and the  general purpose registers of the guest  which are not handled by the VMCB  along with  the invocation  As described in Section 4 2  Fiasco enforces some of the values of the  VMCB for security reasons    The VM executes in the scheduling context of the thread that executed the Y_vm_   run_svm  Therefore  it behaves like the calling thread with respect to scheduling  Host  interrupts occurring during VM execution cause a VM exit  which are visible to the  VMM as return from the I  _vm_run_svm system call     5 1 3 Control Loop    I will now explain how the VMM   s control loop works  In an abstract view  a control loop  consists of a world switch and subsequent handling of VM exit conditions  Exit conditions  can either be asynchronous or synchronous  Asynchronous exits occur for purposes that  may be unrelated to the VM state  such as physical interrupts  Synchronous exits  happen either to handle VM state transitions or voluntarily by the VM  for example for  hypercalls    In the most basic control loop two exit reasons need to be handled  that is asynchronous  exits such as host interrupts and the basic synchronous exit reason of an error  Possible  error conditions are for example invalid C
130. ust produce the same result as when run on the  physical CPU  Typically  to emulate one instruction  a number of host instructions have  to be executed  a fact known as instruction inflation  Instruction fetch requires extra  effort inducing loop overhead that can account for a fair portion of the overall overhead   As such  emulation does not fulfill the efficiency criterion by Popek and Goldberg    Binary translation is a dynamic translation of guest binary code into host code that  takes place at run time  The result is host code that often is a subset of the x86  instruction set  for example user mode instructions only  Translation is done only when  code is to be executed  lazy   and cached to speed up subsequent executions  The cache  for translated code has to be invalidated upon writes  thus overhead arises from cache  management    In both emulation and binary translation host CPU registers have to be used both  to keep administrative data of the emulator and register contents of the guest forcing  register pressure that requires additional memory accesses to store and load register  content    Today s hardware is able to do concurrent computations for example in pipe lined  execution and simultaneous address translations  In emulation these operations are run  serially    Patching techniques achieve better performance than emulation but require thorough  analysis and replacement strategies that make the technology more complex  In practice    11    2 Background     
131. with less changes to the standard  OS  A port of KVM showed that virtualization on a microkernel based system is feasible  and standard OSes running in a VM perform better than rehosted OSes  However  this  KVM port in turn depends on a rehosted kernel    In this thesis  I present a solution that uses virtualization to host a slightly modified  Linux on the Fiasco microkernel  This solution shows a significant performance impro   vement compared to previous solutions  supports SMP  is well integrated and has a  substantially reduced resource footprint  I also show  how the para virtualized kernel  can be used to do faithful virtualization     Acknowledgements    I would like to thank Professor Hermann H  rtig for allowing me to work in the operating  systems group  I want to mention the tremendous help of my supervisor Adam  who  always had an open ear whenever I asked him for advise  Thanks go to Torsten Frenzel  as well  who provided me with ideas that helped me jump start my implementation  I  owe Michael Peter a depth of gratitude  because he provided me with numerous hints  for this thesis  A am grateful to Bj  rn D  bel  who wrote the Ankh network multiplexer  during this thesis  Thanks go out to the people that proofread the text and helped me  cope with the language  These people are Adam Lackorzy  ski  Torsten Frenzel  Michael  Peter and Jean Wolter    A good working atmosphere is crucial to my productivity  therefore I want to thank  the students of the student lab 
132. xing can be done in several    75    8 Outlook and Conclusion       ways  for example by providing each VM with its own virtual hard disk that is  represented by a file on the physical hard disk  Another solution would be to allow  write access to a single VM  and read only access to multiple VMs  This setup   whereas simple to implement because the device manager would not need to know  about file systems  is rather restrictive to the VMs  but would enable setups where  a rich user land is needed  which cannot be loaded using a ramdisk     Snapshots Snapshot functionality could be helpful for applications that need failure    resilience  1 could imagine important web servers to run in a VM  The VMM would  detect errors  for example caused by attacks  and reset the VM to a known  save  state  Snapshot functionality could be supported with Dirk Vogt s L4ReAnimator   as introduced in his diploma thesis  Vog09      Migration VM migration is a tool that is often used in load balancing scenarios  In a    data center  where there are multiple machines running the same microkernel based  system  both Karma and its VM could be migrated from crowded nodes to nodes  with little load  With the aforementioned snapshot mechanism in place  migration  is a simple as copying the snapshot data to the other node  and subsequently  restoring from the snapshot     Systems Debugger The VMM could be extended with a systems debugger  The a    debugger could be built upon the systems debugger create
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Super test Windows Phone 8.1 È completo, veloce e “smart”    2004 Lexus Owner`s Manual Supplement  Samsung CT-25M6W User Manual  User Guide - Big Fish Audio  ATRM TV CHANNEL MODE D`EMPLOI – USERS  Acicaptal  Trust BlackStream Keyboard DE  Cables Direct 3m RCA M/F  Menu [IMMAGINE]    Copyright © All rights reserved. 
   Failed to retrieve file