Home
        to get the file
         Contents
1.     an integer value between 0  errors only  and 4  full debug   debug_level   0     TCP port the server listens on  default port is 32638  listen_port   32638      absolute path to the directory in which to store the database  database_dir    var run xtreemfs dir    3 5 2 MRC Configuration File      an integer value between 0  errors only  and 4  full debug   debug_level   0      TCP port the server listens on  default port is 32638  listen_port   32636      absolute path to the directory in which to store the database  database_dir    var run xtreemfs mrc      absolute path to the file which is used as operations log  append_log    var run xtreemfs mrc dblog      interval in seconds between OSD checks  osd_check_interval   300    23         hostname or IP address of directory service to use  dir_service host   localhost      TCP port number of directory service to use  dir_service port   32638      flag indicating whether POSIX access time stamps are set    each time the files are read or written  no_atime   true      interval between two local clock updates  time granularity  in ms      Should be set to 50   local_clock_renewal   50      interval between two remote clock updates  in ms      Should be set to 60000   remote_time_sync   60000      defines whether SSL handshakes between clients and the MRC    are mandatory     if use_ssl   false  no client authentication will take place  use_ssl   false      file containing server credentials for SSL handshakes  ssl_server_cre
2.     the MRC permanently  Afterwards the volume is no longer available  and all data that might have existed on that volume before are lost   The calling syntax is similar to 1svol     rmvol  lt volume url gt     4 3 OSS Developers Guide    The prototype is built as a shared library which is linked to the applications  and manages all operations regarding shared memory nearly transparent in  the background  Applications are linked to the library by specifying the flag        loss    to the linker     Zt       4 3 1 Requirements   The following requirements must be fulfilled   e x86 Processor  currently only IA 32 mode supported   e Linux Kernel 2 6 or newer    e GNU C Compiler 4 1 2 or newer    libreadline5 devel 5 2 or newer    libglibc devel 2 6 1 or newer    libglib2 devel 2 14 1 or newer    4 3 2 Building the Library  At the top level of the OSS directory run the build script by typing    gt  make    After successful build  the library resides in the subdirectory build  Before  using the shared library the developer has to register it to the system  In  future versions the Makefile will allow to install and uninstall the library  automatically  For the prototype we prefer to extend the library path without  copying it to the system   s library directory  This is done by typing      gt  export LD_LIBRARY_PATH  lt path to osslib gt   LD_LIBRARY_PATH    4 3 3 Application Development    When developing applications using OSS  the developer explicitly defines the  memory region
3.    XtreemOS    Enabling Linux Information l Society  for the Grid echnologies    Project no  IST 033576    XtreemOS    Integrated Project  BUILDING AND PROMOTING A LINUX BASED OPERATING SYSTEM TO SUPPORT VIRTUAL  ORGANIZATIONS FOR NEXT GENERATION GRIDS    XtreemFS prototype month 18  D3 4 2    Due date of deliverable  30 NOV 2007  Actual submission date  30 NOV 2007    Start date of project  June 1  2006    Type  Deliverable  WP number  WP3 4    Responsible institution  BSC   Editor  amp  and editor   s address  Toni Cortes  Barcelona Supercomputing Center   Jordi Girona 29   08034 Barcelona   Spain    Version 1 0   Last edited by Toni Cortes   30 OCT 2007       Project co funded by the European Commission within the Sixth Framework Programme  Dissemination Level  PU   Public J  PP   Restricted to other programme participants  including the Commission Services   RE   Restricted to a group specified by the consortium  including the Commission Services   CO   Confidential  only for members of the consortium  including the Commission Services                                Revision history                                               Version Date Authors Institution Section affected  comments   0 1 25 09 07   Toni Cortes BSC Initial document structure and initial contents from   work documents   0 2 25 10 07   Jan Stender  BjAtirn Kolbek ZIB Architecture of Main components and code   0 3 25 10 07 Matthias Hess NEC Architecture of client layer   0 4 25 10 07 Michael Schoettner U
4.   For each volume held by the local Volume Man   ager  the Storage Manager provides access to the respective data   It translates requests for metadata modification and retrieval into  calls to the database backend     OSD Status Manager  Regularly polls the Directory Service for the  set of OSDs that can be assigned to a newly created file     Disk Logger Stage persistently logs operations that change file system  metadata  This allows the backend to work with a volatile in memory  representation of the database  while recording all changes in the opera   tions log  When the log becomes too large  a checkpoint of the database  is created from the current in memory representation  and the log is  truncated  Thus  the MRC can recover from crashes by restoring the  state of the database from the last checkpoint plus the current opera   tions log     Replication Stage replicates individual volumes in the MRC backend and  ensures replica consistency     The Replication Stage internally relies on the following components     Replication  Sends operations to remote MRCs which hold replicas  of the corresponding volumes through the Speedy stage        ReplicationMechanism  A specific  exchangeable  mechanism for  replication  It determines how operations are propagated to re   mote replicas  Currently  only master slave replication without  automatic failover is implemented     Remarks on Request Processing  Pinky can receive requests one by   one or pipelined  With HTTP pipelining 
5.  components  In  Section 3  we present the procedures needed to download  install  and con   figure the XtreemFS  Section 4 presents the interface applications can use to  access the file system  as well as a short developer guide for the OSS compo   nent  Finally  section 5 points out the topics on which we will work in the  next few months     2 Brief Description of the Prototype    In this section  we describe the current prototype of XtreemF S and the OSS   We first outline the functionality that is currently implemented and working        Afterwards  we present a global view on the file system architecture to finally  describe the architectural details of each component     2 1 Current Functionality    Before starting with the description of the current functionality in XtreemFS   it is important to clarify that the current version only supports parts of  the final functionality  e g   no replica management is available  and that  performance was not a key issue yet  The objective of the current release of  XtreemFS was to offer a global file system that could be used transparently  by applications  For the OSS part  the first major goal was to provide basic  sharing functionality     As already mentioned  volumes are the most high level abstraction used to  organize multiple files in XtreemFS  A volume contains a set of files organized  in a tree structure  following POSIX semantics  which can be mounted in a  Linux tree structure  Thus  the first set of functionality o
6.  in our architecture are the Metadata and  Replica Catalog  MRC   the Object Storage Device  OSD  the client   Access  Layer  the Replica Management Service  RMS    and the Object Sharing  Service  OSS   In the following paragraphs  we describe the different compo   nents and how they work together     Client  Access Layer  The XtreemFS access layer represents the interface  between user processes and the file system infrastructure  It manages  the access to files and directories in XtreemFS for user processes as  well as the access to Grid specific file system features for users and  administrators  It is the client side part of the distributed file system  and as such has to interact with the services of XtreemFS  MRC and  OSD  The access layer provides a POSIX interface to the users by using  the FUSE  1  framework for file systems in user space  It translates calls  from the POSIX API into corresponding interactions with OSDs and  MRCs  An XtreemF S volume will be simply mounted like a normal file  system  i e  there is no need to modify or recompile application source  code in order to be able to run applications with XtreemF S  In addition  to the POSIX interface  the access layer will provide tools for creating  and deleting XtreemFS volumes  checking file integrity  querying and  changing the file striping policies  and other Grid specific features     OSD  OSDs are responsible for storing file content  The content of a single  file is represented by one or more objec
7.  into requests on object level     File object stage  File object requests are translated into stripe object re   quests according to the striping policy  or RAID level in particular      17       Stripe object stage  In this stage the actual communication with the right  OSDs is executed  For each stripe the right OSD is determined and  requests are prepared and submitted to them  The answers from the  OSDs are analyzed  OSDs might answer with failure of the operation   a request to redirect to another OSD or with success  If the operation  was a success  the client knows that the data have been transferred to  the OSD successfully  in a write operation  or that a read operation  succeeded     2 3 5 OSS    Built as an event driven modular architecture  OSS separates the mecha   nisms that realize basic consistency from an extensible set of policy driven  consistency models  Furthermore  the service is independent of the under   lying transport protocol which is used for inter node communication  Thus  OSS consists of several modules for memory and page management  commu   nication  and consistency protocols  We will describe their responsibilities  and dependencies in the following paragraph  also shown in figure 7     Modules    Communication module  The communication module manages the en   tire communication of the nodes  It is used by the heap  page and  consistency modules to exchange data over the network  The module  transfers information as messages which are encapsu
8.  result   next step eS  DS  rain Stage MRC  if next step exista  _   execute next step  gt  Q   OSDStatusManager     Sepia ene eS aerate Care    FileAccessManager Speedy  else Brain  if metadata was changed    _VolumeManager      Q     StorageManager  log operation  4  Disk Logger Stage  CO  distribute LR  logged operation Q  q  Replication Stage  result  Q  Replication  L  ReplicationMechanism                                        Figure 2  MRC design    Stages  The internal logic of the MRC is implemented in four stages        Authentication Stage checks the authenticity of the user  It performs a  validity check of user certificates and extracts the global user and group  IDs     Brain Stage is the core stage of the MRC which implements all metadata  related file system logic such as file name parsing  authorization and  operations like stat or unlink  For each volume it stores the directory  tree  the file metadata and the access control lists  Internally it is  composed of several components     Brain  Contains the specific logic for each operation  It decomposes  the different calls into invocations of the other Brain Stage com   ponents     File Access Manager  Checks whether the user is sufficiently autho   rized for the requested operation  If not  it returns an error which  results in a corresponding notification to the client     Volume Manager  The Volume Manager is responsible for managing  all volumes which have been created on the local MRC     Storage Manager
9.  the clients can send more requests  while waiting for responses  Since the MRC does not guarantee that pipelined  requests are processed in the same order as they are sent  different requests  should only be added to the same pipeline if they are causally independent   Thus  certain sequences of requests such as    createVolume    with a subse   quent    createFile    on the newly created volume should not be processed in  a pipelined fashion  as the second request could possibly be executed in the  MRC prior to the first one  which would lead to an error     2 3 3 OSD    The OSD has four stages  Incoming requests are first handled by the Parser  stage which parses information sent together with the request  such as the  capability or the X Locations lists  The validity of capabilities and the au   thenticity of users is checked by the Authentication stage  The Storage stage  handles in memory and persistent storage of objects  The processing of UDP  packets related to striping is handled by the UDPCom stage  striping related  RPCs are handled by the Speedy stage  The internal architecture is depicted  in Figure 3     Stages  The main stages of the OSD will be described in the following  paragraphs     Parser Stage is responsible for parsing information included in the request  headers  It is implemented in a separate stage to increase performance  by using an extra thread for parsing     Authentication Stage checks signed capabilities  To achieve this  the  stage keeps a ca
10.  to all OSDs  5 send X NEW FILESIZE to client  ELSE IF objId   globalMax   object is extended THEN  send X NEW FILESIZE to client  END IF  END       Read requires a special treatment of border cases in which objects do not  or only partially exist on the local OSD        BEGIN read  objId   IF object exists locally THEN  IF object is not full A objId  lt  globalMax THEN  send object   padding  5 ELSE IF object is not full A objId  globalMax THEN     not sure if still is last object  IF read past object THEN  retrieve localMaxOSD1  N from all OSDs  globalMax    max localMaxzOSD1  N   10 IF objId   globalMax THEN  send partial object  ELSE  send padded object  END IF  15 ELSE  send requested object part  END IF  ELSE    14          Object is full  20 send object  END IF  ELSE     check if it is a    hole    or an EOF  IF objId  gt  localMax THEN  25 IF objId  gt  globalMaxz THEN     not sure if OSD missed a broadcast     coordinated operation for update  retrieve localMaxOSD1  N from all OSDs  globalMax    max localMaxOSD1  N   30 END IF  IF objId  gt  globalMax THEN  send EOF  empty response   ELSE  send zero padding object  35 END IF  ELSE        hole     i e  padding object  send zero padding object  END IF  40 END IF  END       2 3 4 Client layer    The XtreemFS component called client is the mediator level between an  application and the file system itself  The focus lies currently on a POSIX  compliant interface in order to support applications that are not specifically  
11. DUS Architecture of OSS   1 0 30 10 07 Toni Cortes and Jan Stender BSC and ZIB   Final editing   1 1 16 10 07 Toni Cortes BSC Updating reviewers comments   1 2 29 11 07   Florian Mueller UDUS Updating reviewer comments related to OSS  Reviewers     Julita Corbalan  BSC  and Adolf Hohl  SAP     Tasks related to this deliverable                       Task No    Task description Partners involved      T3 4 1 File Access Service CNR   BSC  ZIB  T3 4 3 Metadata Lookup Service ZIB    T3 4 5 Grid Object Management UDUS    T3 4 6 Access Layer for File Data and Grid Objects NEC   UDUS                  This task list may not be equivalent to the list of partners contributing as authors to the deliverable     Task leader             Contents    1 Introduction    1 1 Document Structure            0       0000 8    2 Brief Description of the Prototype  21 Current Tle Te 2c 4c ce hd A oe YE EY Ses  22 Main Architecture saose    00S a Woe ew ee a we we  2 3 Architecture of the Main Components                   3 Installation and Configuration  3 1 Checking out XtreemFS      26246 cee ee ee eee    Z RegurementS  lt  es r eee BR wee SYS eb OSS eb ESS OY  33 Bulking GERE cearo cede Sed de eS dee eS de  Oo Mmetallgtionm s s    on vee ee eee et ke oe Che oa Ses  Do Ca RGA AE EH ee DEE ERE       4 User Guide  4 1 Mounting the File System                  0 2    Oe T a ek Sey ee he ee ee BSE ee BEE ee  4 3 OSS Developers Guide        0    00 0  eee ens    5 Conclusion and Future Work  5 1 limitati
12. There are three tools that deal with volumes  mkvol  lsvol and rmvol   Anyone familiar with UNIX command line interfaces can guess their purpose     mkvol This tool is used to create a new volume on a given MRC  The syntax    of this command is    26       mkvol   a  lt access policy gt     p  lt striping policy gt    lt vol url gt      lt access policy gt  Specifies the policy how access to the volume  is controlled   1 Access is allowed for everyone  no ac   cess control at all   2 Access is controlled in a POSIX like  fashion    lt striping policy gt  This parameter has the generic form  lt name gt     lt size gt    lt width gt   For instance  the pol   icy string RAIDO 32 1 specifies a RAIDO pol   icy across one OSD with a stripe size of 32  kB  Right now  the only supported policy is  RAIDO    lt vol url gt  This is the location of the volume to be created  as an URL     The mkvol can be executed on any client and must not necessarily be  executed on the MRC itself  Permissions to create new volumes will be  checked for the user who executes this command     lsvol In order to list all the volumes that are available on a specific MRC    this tool can be used  Its calling syntax is simple  lsvol  lt mrc url gt     This will currently list the volumes names and their internal identifi   cation  Future versions will allow more fine grained information like  available replica  available space etc     rmvol If a volume is no longer used  this tool can be used to delete it from
13. all performance optimization is the files ain kDFS are accessed  form the nodes in the same cluster as the file system     References    1  FUSE Project Web Site  http   fuse sourceforge net     2  XtreemF S Consortium  D3 4 1  The XtreemOS File System   Require   ments and Reference Architecture  2006           3  Michael Factor  Kalman Meth  Dalit Naor  Ohad Rodeh  and Julian  Satran  Object storage  The future building block for storage systems  In  2nd International IEEE Symposium on Mass Storage Systems and Tech   nologies  2005     4  The Open Group  The Single Unix Specification  Version 3     5  M  Mesnier  G  Ganger  and E  Riedel  Object based Storage  IEEE  Communications Magazine  8 84 90  2003           6  Matt Welsh  David Culler  and Eric Brewer  SEDA  An Architecture for  Well Conditioned  Scalable Internet Services  SIGOPS Oper  Syst  Rev    35 5  230 243  2001     30    
14. ased file system  these requests must first  be translated into requests that are based on file objects  XtreemFS also  allows striping over different OSDs  The next step therefore is to associate  each file object with the corresponding OSD and transferring the objects to  and from the OSD  In future versions of XtreemFS there will be also RAID  policies implemented to allow some redundancy of the data  The redundant  information  like parity calculations  will also be considered an object that  must be stored onto an OSD  Currently only RAID level 0 is implemented  in XtreemF S so there will not be any additional data     16          OSD  osp    osp    osp    tt                                                                   Stripe data transfer  Stripe Object  A A A A Stripes associated with objects     File Object  A A Op decomposed into objects  File R W  i Read   Write Operation  XtreemFS FUSE             Figure 6  Interdependencies of the I O stages in the client          1 O decomposed into  file objects                                     Data in VO Operation File Object    Figure 5  I O operations are first divided into file objects and then associated  with stripe objects that correspond to the RAID level of a file     I O requests are handled by different stages that take care of a specific aspect  of the operation  In fig  6 the control flow for an I O operation is sketched     File read write stage  File requests like read or write operations are trans   lated
15. bility truncaTE  fileld  file fileId  isswedEpoch   END    UPON X New Filesize fileld  fileSize  epoch   IF epoch  gt  file fileId  epoch V   epoch   file fileId  epoch   fileSize  gt  file  fileId  size   THEN       accept any file size in a later epoch or any larger file size in the current epoch    file fileId  epoch    epoch  file fileId  size    fileSize  END IF    END       The pseudocode for the head OSD        UPON truncate fileld  fileSize  capability   file  fileId  epoch    capability epoch  truncate_local fileld  fileSize   FOR osd in OSD1  N DO  relay truncate fileld  fileSize  capability epoch   DONE  return X NEW FILESIZE fileSize  capability epoch  END       The pseudocode for other OSDs        UPON relayed_truncate fileld  fileSize  capability     file  fileId  epoch    capability epoch  truncate_local fileld  fileSize   END       The pseudocode for the client        BEGIN truncate  fileId   capability    MRC truncate fileld   X NEW FILE S1IZE    headOSD truncate fileld  fileSize  capability     13       MRC updateFilesize  X NEW FILE S1ZE   5 END       Delete is performed in a fully synchronous fashion  The head OSD relays  the request to all other OSDs  All OSDs will either delete the file  and  all on disk objects  immediately  or mark the file for an    on close     deletion     Write has to consider cases in which the file size is changed        BEGIN write  objId   write object locally  IF objId  gt  globalMax THEN  udpBroadcast  objId  as new globalMax
16. che of MRC Keys and fetches them if necessary from  some authoritative service  e g  the Directory Service      Storage Stage is responsible for storing and managing objects  For the  sake of performance  the Storage Stage relies on caching  The Cache is  used for fast access to objects which are still in memory  If an object    10                                                                                                                                                                                                                         Pink  y ee Handler Parser Stage  Q client rg  x  Lad parse capability  xlocations   a  gt Q     A Authentication Stag  check capablitly  al  a Q  read write Storage Stage  9  ack   gt  QH Cache    Thread 1    StorageLayout E  3  PersistentStorage L U  LF  Thread n  StorageLayout HDs  Striping  N  ne UDPCom  z E                      Figure 3  OSD Design    is not available in memory  the Cache will instruct the PersistentStor   age component to load the object from disk  Both Cache and Persis   tentStorage will also take care of writing objects to disk and managing  the different versions of an object to ensure copy on write semantics     As part of the PersistentStorage component  the StorageLayout com   ponent is used to map objects to the underlying storage devices  For  efficient on disk data access  the current implementation of the stor   age layout relies on an arrangement of objects on the OSD   s local file  system  Rather than 
17. down all XtreemFS services and  unmount the XtreemFS directory  If you want to manually work on the  mounted directory  you have to use a different console     Manual XtreemFS setup  As an alternative to setting up XtreemFS  in one step  the different services can also be set up manually  For this  purpose  use bin xtreemfs_start  Note that you have to set a Directory  Service before setting up  at least one  MRC and  at least one  OSD  See  bin xtreemfs_start   help for usage details     Example       gt  bin xtreemfs_start ds  c config dirconfig properties  XS bin xtreemfs_start mrc  c config mrcconfig properties  XS bin xtreemfs_start osd  c config osdconfig properties    Once a Directory Service and at least one OSD and MRC are running   XtreemF S is operational     XtreemFS relies on the concept of volumes  A volume can be mounted to  a mount point in the local file system  In order to create a new volume   execute bin mkvol  See Section 4 2 1 for usage details     Example     gt  bin mkvol http   localhost  32636 MyVolume    22       After having created a volume  you can mount it by executing AL src xtreemnfs   See Section 4 1 for usage details     Example       gt  bin xtreemfs  o volume_url http   localhost  32636 MyVolume      gt  direct_io   xtreemfs mounted  3 5 Configuration     Sample configuration files are included in the distribution in the config   directory  Configuration files use a simple key   value format     3 5 1 Directory Service Configuration File  
18. ds    tmp server_creds      file containing trusted certificates for SSL handshakes  ssl_trusted_certs    tmp trusted_certs    3 5 3 OSD Configuration File      an integer value between 0  errors only  and 4  full debug   debug_level   0      TCP port the server listens on  default port is 32638  listen_port   32640      absolute path to the directory in which to store objects  object_dir    var run xtreemfs osd objects    24         hostname or IP address of directory service to use  dir_service host   localhost      TCP port number of directory service to use  dir_service port   32638      interval between two local clock updates  time granularity  in ms      Should be set to 50   local_clock_renewal   50      interval between two remote clock updates  in ms      Should be set to 60000   remote_time_sync   60000    4 User Guide    In this section we give a brief outline on how to use the Xtreem File System   As stated earlier  the main use for applications is through the normal POSIX  File API which is described in  4   So we focus on some aspects that are not  related to this API     4 1 Mounting the File System    The file system itself is a user space implementation based on FUSE  Such  kind of file systems can be mounted with one call and several standard FUSE  options  This call starts the user space part of the file system  For the sake  of brevity we will focus in this section on the relevant and additional options     The XtreemFS will be mounted by the call  xtreem
19. ea of how the different components work together   we can see Figure 1  The different processes running on the client side of  XtreemFS access files normally via the regular Linux interface  The VFS  redirects the operations related to XtreemF S volumes to the Access Layer  via FUSE  In turn  the Access Layer interacts with the XtreemF S services  to carry out the requested file system operations  Metadata related requests   such as opening a file  are redirected to the MRC  File content related re   quests  such as reading a file  are redirected to the OSDs on which the corre   sponding objects containing the data reside  The OSS is not included in the  picture because in the current prototype this service is a standalone one     metadata l   Metadata Server client    metadata operations   object locations   authorization    e g  open  rename              User User User  Process    Process    Process    Linux VFS                                     10110101110110101  01001111001010111  10101110010101001  10001001001110010    contents       FUSE          Access Layer  AL           parallel read write    Figure 1  Relationship between XtreemF S components       Regarding the communication between all these components  we have de   cided to use the HTTP protocol  as it is well established and fully suits the  communication needs of our components     2 3 Architecture of the Main Components    This section provides a detailed description of the architecture of the three  compone
20. enFileTable  This table contains a list of currently opened files  time of    last access and a flag if they will be deleted on close  Keeping track  of the open state of files is necessary for POSIX compliance  since files  opened by a process need remain accessible until they are closed even  if they are concurrently deleted by a different process     In memory cache  The cache stage has an in memory cache of objects to    increase performance for read operations  Accesses to objects are first  attempted to be served from the cache  In case of a cache miss  the  PersistentStorage component retrieves the corresponding object from  disk     Protocols for Striping  We defined a set of protocols for reading  writing   truncating and deleting striped files  The protocols are optimized in a way  that operations occurring frequently  like read and write  can normally be  handled fast  whereas truncate and delete have to be coordinated among all  OSDs holding stripes     For each file  each OSD keeps persistently the largest object number of the  objects stored locally  localMazx   In addition  each OSD keeps the global  maximum object number it currently knows  global Maz   global Maz is part  of the    open    state and does not need to be persistently stored     Truncate involves the MRC  a client  the head OSD for the file  OSDO      and all other OSDs  OSD1  N    The pseudocode for the MRC     12       5       UPON truncate  filed    file  fileId  issuedEpoch       return capa
21. ffered by XtreemFS  is to create  delete  mount  and unmount these volumes     The most important feature of XtreemFS is the global view it offers to appli   cations  XtreemF S allows any application to access files that are physically  located anywhere in the world in a transparent way  Files and directory trees  in mounted volumes can be accessed by applications in a POSIX compliant  fashion  by using traditional Linux system calls     In order to improve the access time of data  XtreemF S implements striping  of files across multiple OSDs  Although in the future this striping will be  on a per file replica basis  our current implementation only allows striping  policies to be defined at volume level  i e  all files in a volume have the same  striping policy     Finally  security issues have been taken into account by supporting SSL   secured communication channels     As regards key functionality of XtreemFS which has not yet been imple   mented  we should mention that no replication and fault tolerance is yet  available in the services     The OSS component currently supports basic sharing and memory man   agement functionality  including basic lock and replica management  Fur   thermore  the system call    mmap    is intercepted to later support transparent  memory mapped files for XtreemF S  Next development steps include specula   tive transactions  scalable communication  and support for memory mapped  files        2 2 Main Architecture    The main components involved
22. fs  o  lt xtreemfs opts gt  direct_io  lt fuse opts gt   lt mount point gt     The option direct_io is necessary for proper operation when multiple clients  access the same file  Otherwise data corruption may occur     25       Option    Effect        o volume_url  lt volume url gt      o logfile  lt log file gt    o debug  lt dbg lvl gt     o logging  lt enabled gt    o mem_trace  lt enabled gt      o stack_size  lt size gt        Specify the URL of the volume that  is to be mounted  This URL is  composed of the MRC URL and the  volume on that MRC  ie  the vol   ume url http   demo mrc tld vol1  would specify the volume vol1 on the  MRC demo mrc tld    Write logs to specified file   Set debugging level   Allow tracing of program execution   Trace memory allocation and usage for  debugging purposes    Allow monitoring of client  Option is  available but has no effect right now   Use the specified certificate to identify  the client host  Option is available but  has no effect right now    Set stack size for each thread of the  client     Table 1  Table with available mount options of the XtreemFS client    Because XtreemFS is a distributed file system  the filesystem can be mounted  on several clients  These clients can access the same volume and the volume    is uniquely identified by its volume url     4 2 Tools    XtreemFS has the notion of volumes which is not covered by a POSIX like    standard  So the additional tools are mainly for volume handling     4 2 1 Volumes    
23. implement a Grid Transactional Memory  GTM   This GTM uses specula   tive transactions each bundling a set of write operations thus reducing the  synchronization frequency  Further transaction based optimizations to hide  network latency and to reduce the number of nodes involved during a trans   action commit have been designed  The implementation of transactions is  one of the next steps in the development roadmap for OSS  Furthermore  we  will align OSS within the next months with the XtreemFS client to support  memory mapped files  see also 5 1 1      5 1 Limitations of the Prototype    FUSE does not support mmap in connection with direct I O  In order to  get applications running on XtreemFS that rely on mmap  volumes currently  have to be mounted without using the FUSE option  o direct_io  However   this might lead to inconsistencies if different clients concurrently work on the  same file  as requests might be serviced from the local page cache     5 2 XtreemFS and kDFS    Currently  within the XtreemOS project  two file systems are being devel   oped with very different objectives  On the first hand  we have XtreemFS   presented here  that aims at giving a global Grid view of files from any  node in the Grid  On the other hand  kDFS   s objective is to build a cluster  file system for all then nodes in a cluster running Linux SSI and its main  objective is performance     29       In the future  we plan to allow files from kDFS to be accessed via XtreemF S  and allow 
24. lated in Protocol  Data Units  PDU   PDUs are sent over the network via the underly   ing transport protocol  Received messages are enqueued into a FIFO  message queue  However  the queue allows reordering of PDUs  This  is sometimes necessary if an affected memory page is locked locally  by the application  but the message handler must remain responsive   The current implementation uses a TCP implementation that will be  replaced by a UDP based overlay multicast implementation     Page module  The page module manages the exchange and invalidation  of memory pages  A process is able to request memory pages from  other nodes or reversely to update memory pages at any other node  with its own page content  The process can also invalidate replicas  of memory pages in the grid  Before serving external requests  the  module generates an event to the appropriate consistency module to  check whether consistency constraints are fulfilled  Compression of  pages and or exchanging differences are planned     18            Legacy applications  POSIX support library    Transaction based  applications                 Application  OSS  Consistency modules L  Heap module  be  sc  gtm       Page module  Sie Ge io TET ee  l  i g2  l 5  Overlay network management and message exchange   3  i E             S    am  l E  i E   Transport layer   z       l       Network media    Figure 7  OSS architecture    Heap module  The heap module manages shared objects on behalf of the    applications  It exp
25. nts that currently constitute XtreemF S  as well as the OSS compo   nent     2 3 1 Common Infrastructure for MRC and OSD    Both MRC and OSD rely on an event driven architecture with stages ac   cording to the SEDA  6  approach  Stages are completely decoupled and  communicate only via events which are sent to their queues  They do not  share data structures and therefore do not require synchronization     Each stage processes a specific part of a request  e g  user authentication   disk access  and passes the request on when finished  This way  requests are  split down to smaller units of work handled subsequently by different stages   Depending on the type  e g  read object or write object   a request has to  pass only a subset of the stages in a request type dependent order  This  is modeled by a RequestHandler component  which encapsulates the work   flow for all request types in a single class  It implements the order in which  requests are sent to the individual stages     For communicating with other servers and with clients  the MRC and OSD  both use an HTTP server and client  Each is implemented as a separate  stage     Pinky is a single threaded HTTP Server  It uses non blocking IO and can  handle up to several thousand concurrent client requests with a single  thread  It supports pipelining to allow maximum TCP throughput and  is equipped with mechanisms to throttle clients to gracefully degrade  client throughput in case of high loads  Incoming HTTP Requests are  
26. ons  e g     malloc     for  legacy POSIX applications   Our current implementation instructs the  dynamic linker of the GNU Linux system at application load time to  additionally load the legacy support library for the object sharing ser   vice  The library exports a subset of the C library interface used by  virtually all applications under GNU Linux  Therefore  the applica   tion   s memory allocation functions are linked against the legacy sup   port library  which implements the functions by means of the object  sharing service  The Linux kernel needs not be modified in order to  support legacy applications     Installation and Configuration    3 1 Checking out XtreemFS    In order to obtain XtreemFS  execute the command      gt  svn checkout svntssh    lt user gt  scm gforge inria fr svn    gt   xtreemos WP3 4 branches internal_release_2     lt user gt  represents your INRIA SVN user name     XtreemFS Directory Structure  The directory tree obtained from the  checkout is structured as follows     20          AL   contains the access layer  client  source code   bin   contains shell scripts needed to work with XtreemFS  e g  start  XtreemFS services  create volumes  mount an XtreemFS direc   tory        config   contains default configuration files for all XtreemFS services   docs   contains XtreemFS documentation files   java   contains the Java source code for all XtreemFS services   OSS   contains source code for the Object Sharing Service  OSS                 3 2 Req
27. ons of the Prototype  gt   e oles   dee ser redi  5 2  XtreemFS and kDFS     22442824 roek G e eee e ES    20  20  21  21  21  23    25  25  26  27       Executive Summary    This document presents the development state of XtreemF S  the  XtreemOS Grid file system  as well as the Object Sharing Service   OSS   as it is in month 18 of the project  Even though many more  features are currently in development or planned  we have decided to  only mention those parts of our implementation that are fully func   tional    We give an overview of the functionality currently supported in our  prototype  as well as the main system architecture  The XtreemFS  architecture is based on three services  Object Storage Devices  OSD   for storing file content  Metadata and Replica Catalogs  MRC  for  maintaining file system metadata  and the access layer which allows  user processes to access the file system  The OSS is an additional  service that provides transaction based sharing of volatile memory and  will support memory mapped files for XtreemF S in the near future    The staged architecture of the different services including their  stages and components are described in detail  together with the most  important algorithms used internally  Finally  a user manual with in   structions to install and use the file system  as well as a short developer  guide for the OSS are provided        1 Introduction    XtreemFS  2  is an object based file system that has been specifically designed  for G
28. orts several functions for allocation and deallocation  of memory analogous to the standard C library functions  e g  malloc  or mmap   Every allocated memory block is linked with a consistency  model  A hierarchical scalable 64 Bit memory management is under  development     Consistency modules  A consistency module implements the rules of a    consistency model that defines when write accesses become visible for  other nodes  The current prototype offers strong consistency  This  model guarantees that all processes sharing an object see write accesses  immediately  Obviously  this model will not scale well and heavily  depends on the programmer to carefully allocate data to avoid page    19       thrashing  Nevertheless  there are legacy programs requiring such a  model and are  fortunately  designed to minimize conflicts     The basic consistency modul implements basic mechanisms  e g  lock  management  but does not define a concrete model  The goal is to pro   vide basic operations that can be re used to speed up the development  of future consistency models  Furthermore  this module routes memory  access events to the appropriate consistency modules     Currently  speculative transactions are being developed providing a  sound basis for scalable and efficient data sharing     POSIX support library  This is the module providing the interception    3    facility to hook application calls to e g  mmap     to support memory   mapped files  and to other memory relevant functi
29. parsed by Pinky and passed on to the RequestHandler     Speedy a single threaded HTTP Client  is Pinky   s counterpart  It can be  used to handle multiple connections to different servers and is able to  use pipelining  Speedy automatically takes care of closing unused con   nections  reconnecting after timeout and periodic connection attempts    7       if servers are unavailable  It is optimized to reach maximum perfor   mance when communicating with other services using Pinky     Request Pipelining  Pinky can receive requests one by one or pipelined   When sending requests one by one the client waits for the response before  sending the next request  With HTTP pipelining  the clients can send more  requests while waiting for responses  The one by one approach has significant  performance problems  especially when the latency between client and server  is high     2 3 2 MRC    Figure 2 shows the request flow in the MRC  The stages including their  components and functionality are described in more detail in the following  sections                                                                                                                                                                                                                                                                                                                                         Pinky Request Handler  l Authentication Stage  client rq   gt   Q  lt    _ check certificate      lt  Q  execute operation 
30. rid environments as a part of the XtreemOS operating system  In ac   cordance with its object based design  3  5   it is composed of two different  services  OSDs  Object Storage Devices  that store file content  and MRCs   Metadata and Replica Services  that store file system metadata  A client  module provides the access layer to the file system  It performs file system  operations on behalf of users by communicating with the file system ser   vices  The services are complemented by the Object Sharing Service  OSS    an additional component that provides transaction based sharing of volatile  memory objects and will support memory mapped files for XtreemFS     In a Grid  XtreemFS offers a global view to files  Files and directory trees  are arranged into volumes  A volume can be mounted at a any Grid node   where any sufficiently authorized job can access and modify its files  In order  to improve performance and efficiency  XtreemFS also offers striping of files  across different OSDs  Moreover  replication of files will be supported in  later XtreemF S releases  which will provide data safety and increase access  performance     In this deliverable  we present the main characteristics that appear in the  current prototype of XtreemFS     1 1 Document Structure    This document is structured as follows  In Section 2 we present description of  the XtreemFS prototype and the OSS  We give an overview of the currently  available functionality and the main architecture of these
31. s for shared objects  A call to the function hm_alloc   returns  a new shared memory region which is bound to a certain consistency model   Afterwards  applications on other nodes can access these memory regions  As  a parameter to hm_alloc    the developer can choose between the following  two models  the basic  CONSISTENCY_BC  and the strict  CONSISTENCY_SC   consistency model  On every access to a shared object  OSS will check the  consistency constraints     Nevertheless the application can also manually deal with shared objects by  using the low level request  update and invalidate functions     28       5 Conclusion and Future Work    In this deliverable  we have presented the architecture and the functionality  available at month 18 for both XtreemFS and OSS  If we check the current  prototype with the list of requirement in D4 2 3  we could mention that  around 35  of the requirements have already been fullfiled     Regarding the future work in XtreemFS  we plan to include replica manage   ment within the next few months  After replicas are implemented  the file  system will be tuned to improve its performance and advanced functionality  will start to be developed     The prototype of the OSS component is able to share objects across multiple  nodes in the grid using a strong consistency  The event driven architecture  is designed to support different consistency models  Obviously  strong con   sistency models will not scale well  One of the major goals of OSS is to  
32. storing all objects in a single file  we exploit the  local file system   s capability to organize files in a directory hierarchy  A  file is represented by a physical directory  and an object is represented  by a physical file  Directories representing single files  in turn  are ar   ranged in a directory hierarchy  by means of hash prefixes of their file  IDs  Such an arrangement ensures that single objects can be retrieved  efficiently  as the amount of files contained by a single directory does  not become too large     The Striping component implements the logic needed to deal with  striped files  Striping in a distributed file system has to take into    11       account some special considerations  When a file is read or written   the system must compute the OSD on which corresponding stripe re   sides  Furthermore  I O operations like truncate are more complex   Decreasing or increasing the size of a file may require communication  with remote OSDs  i e  the ones where the affected stripes reside on    since the difference between the original and the new size can involve  several stripes  Moreover  when a file has to be deleted  the system has  to contact each OSD holding a stripe of the file  A detailed description  of the striping protocols can be found at the end of this section     Important Data Structures  Data structures should not be shared among  stages  Information should be attached to the requests instead  reference  confinement must be guaranteed       Op
33. ts  With the aim of increasing  the read write throughput  OSDs and Clients support striping  Strip   ing of a file can be performed by distributing the corresponding objects  across several OSDs  in order to enable a client to read or write the file  content in parallel  Future OSD implementations will support replica   tion of files  for the purpose of improving availability and fault tolerance  or reducing latency     MRC  MRCs constitute our metadata service  For availability and perfor   mance reasons  there may be multiple MRCs running on different hosts   Each MRC has a local database in which it stores the file system meta   data it accounts for  The MRC offers an interface for metadata access   It provides functionality such as creating a new file  retrieving informa   tion about a file or renaming a file        IRMS will not appear anymore in this deliverable because its implementation will not  start till month 18 of the project       OSS  The OSS enables applications to share objects between nodes  In  the context of OSS  the notion object means a volatile memory region   The service resides in user space co located with the applications and  manages object exchange synchronization almost transparently  De   pending on the access pattern of an application and due to efficiency  reasons OSS supports different consistency models  Furthermore  OSS  intercepts    mmap    calls to support memory mapped files for XtreemFS  in the near future     To get a more general id
34. uirements    For building and running XtreemF S  some third party modules are required  which are not included in the XtreemFS release     e gmake 3 8 1   e gcc 4 1 2   e Java Development Kit 1 6  e Apache Ant 1 6 5   e FUSE 2 6   s libxml2 dev 2 6 26   e openssl dev 0 9 8    Before building XtreemF S  make sure that JAVA_HOME and ANT_HOME are set   JAVA_HOME has to point to a JDK 1 6 installation  and ANT_HOME has to point  to an Ant 1 6 5 installation    3 3 Building XtreemFS    Go to the top level directory and execute     L gt  make    3 4 Installation  Loading the FUSE Module  Before running XtreemF S  please make  sure that the FUSE module has been added to the kernel  In order to ensure    this  execute the following statement as root       modprobe fuse    21       Automatic XtreemFS Setup  The fastest way to completely set up  XtreemF S on the local machine is to simply execute    NS bin basicAL_tests    It will take about 15 seconds to set up a running system consisting of a  Directory Service  an OSD and an MRC  The shell script creates a temporary  directory in which all kinds of data and log output will be stored  A newly   created volume called x1 will automatically be mounted to a subdirectory of  the temporary XtreemFS directory  see console output for further details     As long as the prompt   gt   appears  the system is ready for use  In order to  test the basic functionality of XtreemFS  you can enter      gt  test    Note that any other command will shut 
35. written for XtreemFS     The client layer is implemented as a file system based on FUSE  1      Stages The client is built up from different stages  Each stage employs  one or more threads to handle specific requests  One stage can generate  multiple other requests for the following stages  so called child requests  The  request initiating thread can go on with other work or wait until the request  is finished  In any case the request itself is handled by a different thread   Once the work is finished  the work handling thread calls a callback function  that finalizes the request and eventually wakes up any thread that is waiting  for the request to be finished  A simplified sequence diagram of handling a  request is presented in fig  4     15                Initiator  Request Stage  Handler Thread                                Request    T T  1 1  1 1  1 1  1  i  1  i  1  Handle Request  gt  1  Handle next request L        a    1                      Sleep or do other work             C  Wake up  1                Figure 4  Sequence diagram for the clients stages     In fig  5 we present an overview of how an I O operation is divided into  operations on file objects and ultimately into operations on stripe objects   An application that uses XtreemFS and interacts with it via the FUSE in   terface does I O operations based on bytes  Each request     read or write      specifies an offset  in bytes  and a number of bytes starting from that off   set  As XtreemFS is an object b
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
2-DIN-Android-Autoradio mit 6  GE Spacemaker JVM60 User's Manual  Blob Mode d`emploi Nettoyage et maintenance  Catalogue fini 19 juin  Télécharger - Domaine de Label  Dragon Bluetooth Wireless Headset User Guide  Computer- und Trainingsanleitung ST 2529-64    instant recovery® electric fryer service manual keep    Copyright © All rights reserved. 
   Failed to retrieve file