Home
        2.5 - Pittsburgh Supercomputing Center Staff Directory
         Contents
1.                00 002 eee 1 4  Figure 2 1  Example PFS SCFS Configuration           lle e 2 6  Figure 2 2  HP AlphaServer SC Storage Configuration            00    cece cece ete eens 2 9  Figure 3 1  Parallel File System    2 0 0 0    ccc ene tenet e 3 2    vii    viii    List of Tables    Table 0   T  Abbreviations  ien RH RR CRR AT RU RL LA BRIE I EUR xiii  Table 0 2  Documentation Conventions          nasasa osooso ee cence hrs xviii  Table 1 1  Node and Member Numbering in an HP AlphaServer SC System                    04  1 2  Table 4 1  SCFS Mount Status Values 0 0 0    ee tenn en teen hes 4   4    Preface    Purpose of this Guide    This document describes how to administer best practices for I O on an AlphaServer SC  system from the Hewlett Packard Company   HP       Intended Audience    This document is for those who maintain HP AlphaServer SC systems  Some sections will be  helpful to end users  other sections will have information for application engineers  system  administrators  system architects and site directors who may be concerned about I O on an  AlphaServer SC system     Instructions in this document assume that you are an experienced UNIX   administrator who  can configure and maintain hardware  operating systems  and networks     New and Changed Features    This is a new manual so all sections are new     Structure of This Guide    This document is organized as follows     Chapter 1  hp AlphaServer SC System Overview  Chapter 2  Overview of File Systems 
2.     requires the  use of UBC  Executable binaries are normally mmap   d by the loader  The exclusion of executable  files from the default mode of operation allows binary executables to be used in an SCFS FAST file  system     Overview of File Systems and Storage 2 3    PFS    2 2 2 Getting the Most Out of SCFS    SCFS is designed to deliver high bandwidth transfers for applications performing large serial  I O  Disk transfers are performed by a kernel subsystem on the server node using the HP  AlphaServer SC Interconnect kernel to kernel message transport  Data is transferred directly  from the client process    user space buffer to the server thread without intervening copies     The HP AlphaServer SC Interconnect reaches its optimum bandwidth at message sizes of  64K B and above  Because of this  optimal SCFS performance will be attained by  applications performing transfers that are in excess of this figure  An application performing  a single 8MB write is just as efficient as an application performing eight 1MB writes or sixty   four 128KB writes     in fact  a single 8MB write is slightly more efficient  due to the  decreased number of system calls     Because the SCFS system overlaps HP AlphaServer SC Interconnect transfers with storage  transfers  optimal user performance will be seen at user transfer sizes of 128KB or greater   Double buffering occurs when a chunk of data  10 block  default 128KB  is transferred and  is then written to disk while the next 128K 1s bei
3.     usr  and  var directories of the CFS domain AdvFS file system    One disk to be used for generic boot partitions when adding new cluster members  e One disk to be used as a backup during upgrades   Note        Do not configure a quorum disk in HP AlphaServer SC Version 2 5        The remaining storage capacity of the external storage subsystem can be configured for user  data storage and may be served by any connected node     System storage must be configured in multiple bus failover mode     See Chapter 3 of the HP AlphaServer SC Installation Guide for more information on how to  configure the external system storage     2 5 2 2 Data Storage    2 12    Data storage is optional and can be served by Node 0  Node 1  and any other nodes that are  connected to external storage  as necessary     See Chapter 3 of the HP AlphaServer SC Installation Guide for more information on how to  configure the external data storage     Overview of File Systems and Storage    3    Managing the Parallel File System  PFS     This chapter describes the administrative tasks associated with the Parallel File System   PFS      The information in this chapter is structured as follows   e PFS Overview  see Section 3 1 on page 3 2   e Planning a PFS File System to Maximize Performance  see Section 3 2 on page 3 4     e Using a PFS File System  see Section 3 3 on page 3 6     Managing the Parallel File System  PFS  3 1    PFS Overview    3 1 PFS Overview    A parallel file system  PFS  allows a numb
4.    0           fprintf  stderr   Error setting the pfs map data WM     exit  1         exit  0      Example 6 2 and Example 6 3 describe code samples for the get fd  3    function call     Example 6 2 Code Samples for the getfd Function Call    IMPLICIT NONE       CHARACTER 256 FILEN  INTEGER ISTAT                   FILEN      testfile           OPEN         FILE   FILEN    FORM    UNFORMATTED    IOSTAT   ISTAT   STATUS    UNKNOWN    UNIT                        Ke        IF  ISTAT  NE  0  THE  WRITE    155  FILI                                     EN  STOP  ENDIF  CALL SETMYWIDTH 9  1  ISTAT    Ths will truncate the file and set pfs width    to 1       IF  ISTAT  NE  0  THE  WRITE    156  FILI  STOP   ENDIF                j  Z          155 FORMAT   Unable to OPEN file   A   156 FORMAT   Unable to set pfs width on file   A              Streamlining Application I O Performance    6 3    FORTRAN    Example 6 3 Code Samples for the getfd Function Call     include   include   include   include   include   include     lt unistd h gt     lt stdio h gt      fcntl h      inttypes h     lt sys fs pfs common h gt    lt sys fs pfs map h gt     int getfd_ int  logical_unit_number       void setmywidth  int  logical unit number int  width          int  error     pfsmap_t map   int fd   int status           fd   getfd_ logical_unit_number     status   ioctl fd  PFSIO GETMAP   amp map    if  status    0        error   status    return      map pfsmap slice ps count    width   status   ioctl 
5.    e PFS  see Section 2 3 on page 2 4      Preferred File Server Nodes and Failover  see Section 2 4 on page 2 8     e Storage Overview  see Section 2 5 on page 2 8     Overview of File Systems and Storage 2 1    Introduction    2 1 Introduction    This section provides an overview of the HP AlphaServer SC Version 2 5 storage and file  system capabilities  Subsequent sections provide more detail on administering the specific  components     The HP AlphaServer SC system is comprised of multiple Cluster File System  CFS   domains  There are two types of CFS domains  File Serving  FS  domains and Compute   Serving  CS  domains  HP AlphaServer SC Version 2 5 supports a maximum of four FS  domains     The nodes in the FS domains serve their file systems  via an HP AlphaServer SC high speed  proprietary protocol  SCFS   to the other domains  File system management utilities ensure  that the served file systems are mounted at the same point in the name space on all domains     The result is a data file system  or systems  that is globally visible and performs at high  speed  PFS uses the SCFS component file systems to aggregate the performance of multiple  file servers  so that users can have access to a single file system with a bandwidth and  throughput capability that is greater than a single file server     2 2 SCFS    2 2    With SCFS  a number of nodes in up to four CFS domains are designated as file servers  and  these CFS domains are referred to as FS domains  The file serve
6.    unknown    The SCFS file system is mounted  but the FS domain that serves the file  system is no longer serving it    Generally  this is because the FS domain has been rebooted     for a period of  time  the CS domain sees nounted stale until the FS domain has finished  mounting the AdvFS file systems underlying the SCFS file system  The  mounted stale status only applies to CS domains     The SCFS file system was mounted  but all nodes of the FS domain that can  serve the underlying AdvFS domain have left the domain     An attempt was made to mount the file system on the domain  but the mount  command failed  When a mount fails  the reason for the failure is reported as  an event of class sc  s and type mount   failed  See HP AlphaServer SC  Administration Guide for details on how to access this event type     The file system is mounted  however  the FS domain is not responding to  client requests  Usually  this is because the FS domain is shut down     The file system is mounted  but when you attempt to access it  programs get an  I O Error  This can happen on a CS domain when the file system is in the  mount not served state on the FS domain     Usually  this indicates that the FS domain or CS domain is shut down   However  a failure of an FS or CS domain to respond can also cause this state     The attributes of SCFS file systems can be viewed using the sc  smgr show command     4 3 Tuning SCFS    The information in this section is organized as follows       Tuning SCF
7.   Managing the Parallel File System  PFS  3 3    Planning a PFS File System to Maximize Performance    3 1 2 Storage Capacity of a PFS File System    The storage capacity of a PFS file system is primarily dependent on the capacity of the  component file systems  but also depends on how the individual files are laid out across the  component file systems     For a particular file  the maximum storage capacity available within the PFS file system can  be calculated by multiplying the stripe count  that is  the number of file systems it is striped  across  by the actual storage capacity of the smallest of these component file systems     Note        The PFS file system stores directory mapping information on the first  root   component file system  The PFS file system uses this mapping information to  resolve files to their component data file system block  Because of the minor  overhead associated with this mapping information  the actual capacity of the PFS  file system will be slightly reduced  unless the root component file system is larger  than the other component file systems        For example  a PFS file system consists of four component file systems  A  B  C  and D    with actual capacities of 3GB  1GB  3GB  and 4GB respectively  If a file is striped across all  four file systems  then the maximum capacity of the PFS for this file is 4GB     that is  1GB   Minimum Capacity  x 4  File Systems   However  if a file is only striped across component  file systems C and D  t
8.   controller                Fibre Channel       RAID  controller        cY                             m mm um Node X m mu    Node Y im m m m                      O          Local Internal Storage    Figure 2 2 HP AlphaServer SC Storage Configuration    2 5 1 Local or Internal Storage          Local or internal storage is provided by disks that are internal to the node cabinet and not  RAID based  Local storage is not highly available  Local disks are intended to store volatile    data  not permanent data     Local storage improves performance by storing copies of node specific temporary files  for  example  swap and core  and frequently used files  for example  the operating system kernel     on locally attached disks     Overview of File Systems and Storage 2 9    Storage Overview    The SRA utility can automatically regenerate a copy of the operating system and other node   specific files  in the case of disk failure     Each node requires at least two local disks  The first node of each CFS domain requires a  third local disk to hold the base Tru64 UNIX operating system     The first disk  primary boot disk  on each node is used to hold the following   e The node s boot partition   e Swap space     tmp and local partitions  mounted on  tmp and  10ca1 respectively   e cnx h partition    The second disk  alternate boot disk or backup boot disk  on each node is just a copy of the  first disk  In the case of primary disk failure  the system can boot the alternate disk  For mo
9.   e Creating PFS Files  see Section 3 3 1 on page 3 6      Optimizing a PFS File System  see Section 3 3 2 on page 3 7    e PFS Ioctl Calls  see Section 3 3 3 on page 3 9     3 3 1 Creating PFS Files    When a user creates a file  it inherits the default layout characteristics for that PFS file  system  as follows     e Stride size     the default value is inherited from the mkfs_ pfs command     3 6 Managing the Parallel File System  PFS     Using a PFS File System    Number of component file systems     the default is to use all of the component file systems   File system for the initial stripe     the default value for this is chosen at random        You can override the default layout on a per file basis using the PFSIO_SETMAP ioctl on file  creation     Note        This will truncate the file  destroying the content  See Section 3 3 3 3 on page 3 10  for more information about the PFSIO_SETMAP ioctl           PFS file systems also have the following characteristics     Copying a sequential file to a PFS file system will cause the file to be striped  The stride size   number of component file systems  and start file are all set to the default for that file system     Copying a file from a PFS file system to the same PFS file system will reset the layout  characteristics of the file to the default values     3 3 2 Optimizing a PFS File System    The performance of a PFS file system is improved if accesses to the component data on the  underlying CFS file systems follow the
10.  19 for other agencies     HEWLETT PACKARD COMPANY  3000 Hanover Street  Palo Alto  California 94304 U S A     Use of this manual and media is restricted to this product only  Additional copies of the programs may be made for security  and back up purposes only  Resale of the programs  in their present form or with alterations  is expressly prohibited     Copyright Notices       2002 Hewlett Packard Company  Compaq Computer Corporation is a wholly owned subsidiary of the Hewlett Packard Company     Some information in this document is based on Platform documentation  which includes the following copyright notice   Copyright 2002 Platform Computing Corporation     The HP MPI software that is included in this HP AlphaServer SC software release is based on the MPICH V1 2 1  implementation of MPI  which includes the following copyright notice        1993 University of Chicago     1993 Mississippi State University    Permission is hereby granted to use  reproduce  prepare derivative works  and to redistribute to others  This software was  authored by     Argonne National Laboratory Group   W  Gropp   630  252 4318  FAX   630  252 7852  e mail  gropp mcs anl gov   E  Lusk   630  252 5986  FAX   630  252 7852  e mail  lusk mcs anl gov   Mathematics and Computer Science Division  Argonne National Laboratory  Argonne IL 60439    Mississippi State Group   N  Doss and A  Skjellum   601  325 8435  FAX   601  325 8997  e mail  tony erc msstate edu   Mississippi State University  Computer 
11.  4     1 3 Cluster File System  CFS     CFS is a file system that is layered on top of underlying per node AdvFS file systems  CFS  does not change or manage on disk file system data  rather  it is a value add layer that  provides the following capabilities     e Shared root file system  CFS provides each member of the CFS domain with coherent access to all file systems   including the root     file system  All nodes in the file system share the same root     e Coherent name space  CFS provides a unifying view of all of the file systems served by the constituent nodes of  the CFS domain  All nodes see the same path names  A mount operation by any node is  immediately visible to all other nodes  When a node boots into a CFS domain  its file  systems are mounted into the domainwide CFS     Note        One of the nodes physically connected to the root file system storage must be booted  first  typically the first or second node of a CFS domain   If another node boots first   it will pause in the boot sequence until the root file server is established        e High availability and transparent failover  CFS  in combination with the device request dispatcher  provides disk and file system  failover  The loss of a file serving node does not mean the loss of its served file systems   As long as one other node in the domain has physical connectivity to the relevant  storage  CFS will     transparently     migrate the file service to the new node     e Scalability  The system is highl
12.  Attributes  see Section 4 2 on page 4 2    e Tuning SCFS  see Section 4 3 on page 4 5    e  SCFS Failover  see Section 4 4 on page 4 8     Managing the SC File System  SCFS  4   1    SCFS Overview    4 1 SCFS Overview    The HP AlphaServer SC system is comprised of multiple Cluster File System  CFS   domains  There are two types of CFS domains  File Serving  FS  domains and Compute   Serving  CS  domains  HP AlphaServer SC Version 2 5 supports a maximum of four FS  domains     The SCFS file system exports file systems from an FS domain to the other domains   Therefore  it provides a global file system across all nodes of the HP AlphaServer SC system   The SCES file system is a high performance file system that is optimized for large I O  transfers  When accessed via the FAST mode  data is transferred between the client  and server nodes using the HP AlphaServer SC Interconnect network for efficiency     SCFS file systems may be configured by using the scfsmgr command  You can use the  scfsmgr command or SysMan Menu  on any node or on a management server  if present    to manage all SCFS file systems  The system automatically reflects all configuration changes  on all domains  For example  when you place an SCFS file system on line  it is mounted on  all domains     The underlying storage of an SCFS file system 1s an AdvFS fileset on an FS domain  Within  an FS domain  access to the file system from any node is managed by the CFS file system  and has the usual attributes of C
13.  Check if any users of the file system are left on the domain     Run the fuser command on each node of the domain and kill any processes in that area  using the file system     If you are using PFS on top of SCFS  run the fuser command on the PFS file system  first  and then kill all processes using the PFS file system     Unmount the PFS file system using the following command  assuming domain name  atlasD2 and PFS file system  pdata    f scrun  d atlasD2  m all  usr sbin umount pfs  pdata    The umount pfs command may report errors if some components have already  mounted cleanly     Check whether the unmount occurred using the following command     scrun  d atlasD2  m all   usr sbin mount   grep  pdata     Managing the SC File System  SCFS  4   9    SCFS Failover    4 10    Note        If still mounted on any node  repeat the umount_ pfs command on that node        Run the fuser command on the SCFS file systems and kill all processes using the  SCFS     Unmount the SCFS using the following command  where  pd1 is an SCFS       scrun  d atlasD2  usr sbin umount  pdl    Once the SCFS has been unmounted  remount the SCFS file system using the following  command     scfsmgr sync    Note        Step 7 and 8 may fail either because one or more processes could not be killed or  because the SCFS still cannot be unmounted  If that happens  the only remaining  option is to re boot the cluster  Send the dumpsys output to the local HP  AlphaServer SC Support Center for analysis       
14.  GETLOCAL  see Section 3 3 3 7 on page 3 12   PFSIO GETFSLOCAL  see Section 3 3 3 8 on page 3 13   Note        The following ioctl calls will be supported in a future version of the HP AlphaServer  SC system software     PFSIO HSMARCHIVE     Instructs PFS to archive the given file   PFSIO HSMISARCHIVED     Queries if the given PFS file is archived or not        Managing the Parallel File System  PFS  3 9    Using a PFS File System    3 3 3 1 PFSIO_GETFSID    Description     Data Type     Example     For a given PFS file  retrieves the ID for the PFS file system   This is a unique 128 bit value     pfsid t  376a643c 000ce681 00000000 4553872c    3 3 3 2 PFSIO GETMAP    Description     Data Type     Example     For a given PFS file  retrieves the mapping information that specifies how it is laid out across  the component file systems    This information includes the number of component file systems  the ID of the component  file system containing the first data block of a file  and the stride size     pfsmap t    The PFS file system consists of two components  64KB stride    Slice  Base  0 Count   2   Stride  65536   This configures the file to be laid out with the first block on the first component file system   and a stride size of 64KB     3 3 3 3 PFSIO SETMAP    Description     Data Type     Example     For a given PFS file  sets the mapping information that specifies how it is laid out across the  component file systems  Note that this will truncate the file  destroying th
15.  Global    F       FAST Mode  2 3  File System    Recommended  5 1  File System Overview  2 1  FS Domain  1 3  4 2    Internal Storage   See Storage  Local  Ioctl   See PFS       L    Local Disks  1   3    P    Parallel File System  See PFS    PFS  Parallel File System   1 5  Attributes  3 2  Toctl Calls  3 9  Optimizing  3   7  Overview  2 4  3 2  Planning  3 4  Storage Capacity  3 4  Structure  3 4  Using  3 6             RAID  2 12    Index 1    S       SCFS  1 5  2 2  Configuration  4   2  Failover  4 8  Overview  4 2  Tuning  4 5   Storage    Global  2 10  Local  2 9  Overview  2 1  2 8  System  2 12    Stride  3 3  Stripe  3 3    U       UBC Mode  4 3    Index 2    
16.  Managing the SC File System  SCFS     9    Recommended File System Layout    The information in this chapter is arranged as follows     e Recommended File System Layout  see Section 5 1 on page 5 2     Recommended File System Layout 5 1    Recommended File System Layout    5 1 Recommended File System Layout    5 2    Before storage and file systems are configured  the primary use of the file systems should be  identified     PFS and SCFS file systems are designed and optimized for applications that need to dump  large amounts of data in a short period of time and should be considered for the following     e Checkpoint and restart applications  e Applications that write large amounts of data    Note        The HP AlphaServer SC Interconnect reaches its optimum bandwidth at message  sizes of 64KB and above  Because of this  optimal SCFS performance will be  attained by applications performing transfers that are in excess of this figure  An  application performing a single 8MB write is just as efficient as an application  performing eight IMB writes or sixty four 128KB writes     in fact  a single 8MB  write is slightly more efficient  due to the decreased number of system calls        Example 5 1 below displays sample I O block sizes  To display sample block sizes  run the  Tru64 UNIX dd command     Example 5 1 Sample I O Blocks  time dd if  dev zero of  fs hsv fs0 testfile bs 4k count 102400    102400 0 records in  102400 0 records out          real 68 5  user oL  Sys 15 4   
17.  SCFS file system as ONLINE  the  system will mount the SCFS file system on all CFS domains  When you mark the SCFS  file system as OFFLINE  the system will unmount the file system on all CFS domains     The state is persistent  For example  if an SCFS file system is marked ONLINE and the  system is shut down and then rebooted  the SCFS file system will be mounted as soon as  the system has completed booting     Mount Status    This indicates whether an SCFS file system is mounted or not  This attribute 1s specific  to a CFS domain  that is  each CFS domain has a mount status   The mount status values  are listed in Table 4 1     Table 4 1 SCFS Mount Status Values       Mount Status Description   mounted The SCFS file system is mounted on the domain    not mounted The SCFS file system is not mounted on the domain    mounted busy The SCFS file system is mounted  but an attempt to unmount it has failed    because the SCFS file system is in use    When a PFS file system uses an SCFS file system as a component of the PFS   the SCFS file system is in use and cannot be unmounted until the PFS file  system is also unmounted  In addition  if a CS domain fails to unmount the  SCFS  the FS domain does not attempt to unmount the SCFS  but instead  marks it as nounted  busy     Managing the SC File System  SCFS     Tuning SCFS    Table 4 1 SCFS Mount Status Values    Mount Status    Description       mounted stale    mount not served    mount failed    mount noresponse    mounted io err 
18.  and export  the file system  The sc  smgr command performs the following tasks     e Creates the AdvFS file domain and file set  e Creates the mount point    e  Populates the requisite configuration information in the sc  sc  s table in the SC  database  and in the  etc exports file    e  Nominates the preferred file server node    2 6 Overview of File Systems and Storage    PFS    e Synchronizes the other domains  causing the file systems to be imported and mounted at  the same mount point    To create the PFS file system  the system administrator uses the p  smgr command to specify  the operational parameters for the PFS and identify the component file systems  The pfsmgr  command performs the following tasks       Builds the PFS by creating on disk data structures   e Creates the mount point for the PFS   e Synchronizes the client systems   e  Populates the requisite configuration information in the sc_pfs table in the SC database    The following extract shows example contents from the sc_scfs table in the SC database     clu_domain advfs_domain fset_name preferred_server rw speed status mount_point  atlasDO Scfs0 domain scfs0 atlas0 rw FAST ONLINE  scfs0  atlasDO scfs1_domain scfsl atlasl rw FAST ONLINE  scfsl  atlasDO scfs2_domain scfs2 atlas2 rw FAST ONLINE  scfs2  atlasDO scfs3_domain scfs3 atlas3 rw FAST ONLINE  scfs3    In this example  the system administrator created the four component file systems  nominating the respective nodes as the preferred file server  se
19.  and fread    functions is set  at 8K  This buffer size can be increased by supplying a user defined buffer and using the  setbuffer   function call     Note        There is no environment variable setting that can change this unless a special custom  library is developed to provide the functionality  Buffering can only take place  within the application for stdio fread   and fwrite    calls  and not read     and write    function calls        For more information on the setbuf fer    command  read the manpage     6 4 Third Party Applications  Third Party Applications I O may be improved by enabling buffering for FORTRAN  refer to    Section 6 2   or by setting PFS parameters on files you know about that are not required to be  created by the code     Streamlining Application I O Performance 6 5    Third Party Applications    Note        Care should be exercised when setting the default behaviour to buffered I O  The  nature and interaction of the I O has to be well understood before setting this  parameter  If the application is written in C  there are no environment variables that  can be set to change the behaviour        6 6 Streamlining Application I O Performance    Index    A       Abbreviations  Xiii    C       CFS  Cluster File System   Overview  1 3   CFS Domain  Overview  1 2    Cluster File System  See CFS    Code Examples  XiX  CS Domain  1 3    D       Documentation  Conventions  xviii  Online  XIX    E       Examples  Code  xix    External Storage  See Storage 
20.  multiple file server nodes are used  multiple file systems will always be exported   This solution can work for installations that wish to scale file system bandwidth by balancing  I O load over multiple file systems  However  it is more generally the case that installations  require a single file system  or a small number of file systems  with scalable performance     PFS provides this capability  A PFS file system is constructed from multiple component file  systems  Files in the PFS file system are striped over the underlying component file systems     When a file is created in a PFS file system  its mapping to component file systems is  controlled by a number of parameters  as follows     e The component file system for the initial stripe    This is selected at random from the set of components  Using a random selection ensures  that the load of multiple concurrent file accesses is distributed     e The stride size    This parameter 1s set at file system creation  It controls how much data is written per file  to a component before the next component is used     e The number of components used in striping    This parameter is set at file system creation  It specifies the number of components file  systems over which an individual file will be striped  The default is all components  In  file systems with very large numbers of components  it can be more efficient to use only  a subset of components per file  see discussion below      e The block size    This number should 
21.  performance guidelines for CFS  The following  guidelines will help to achieve this goal     1     In general  consider the stripe count of the PFS file system     Ifa PFS is formed from more than 8 component file systems  we recommend setting the  default stripe count to a number that is less than the total number of components  This  will reduce the overhead incurred when creating and deleting files  and improve the  performance of applications that access numerous small to medium sized files     For example  if a PFS file system is constructed using 32 components  we recommend  selecting a default stripe count of 8 or 4  The desired stripe count for a PFS can be  specified when the file system is created  or using the PFSIO SETDFLTMAP ioctl  See  Section 3 3 3 5 on page 3 11 for more information about the PFSIO SETDFLTMAP ioctl     For PFS file systems consisting of FAST mounted SCFS components  consider the stride size     As SCFS FAST mode is optimized for large I O transfers  it is important to select a stride  size that takes advantage of SCFS while still taking advantage of the parallel I O  capabilities of PFS  We recommend setting the stride size to at least 512K     To make efficient use of both PFS and SCFS capabilities  an application should read or  write data in sizes that are multiples of the stride size     Managing the Parallel File System  PFS  3 7    Using a PFS File System    For example  a large file is being written to a 32 component PES  the stripe co
22.  the limitations can also be overcome if  the serial workloads are run on the FS domains on nodes which do not serve the file  system  For example  if the FS domain consists of 6 nodes  and 4 of these nodes were  the c  smgr for the component file systems for PFS  by running on one of the other  two nodes you should be able to see a benefit for small I O and serial general work   loads     If the workload is run on nodes that serve the file system  the interaction with remote  I O and the local jobs will be significant     These applications should consider an alternative type of file system     Note        Alternative file systems that can be used are either locally available file systems  or  Network File Systems  NFS         To configure PFS and SCFS file systems in an optimal way  the following should be  considered     1  Stride Size of the PFS  2  Stripe Count of the PFS  3  Mount Mode of SCFS    Recommended File System Layout 5 3    Recommended File System Layout    5 1 1 Stride Size of the PFS    The stride size of PFS should be large enough to allow the double buffering effects of SCFS  operations to take place on write operations  The minimum recommended stride size is  512K  Depending on the most common application use  the stride size can be made larger to  optimize performance for the majority of use  This will depend on the application load in  question     5 1 2 Stripe Count of the PFS    The benefits of a larger stripe count are to be seen where multiple write
23.  time dd if  dev zero of  fs hsv fs0 testfile bs 1024k count 400  400 0 records in  400 0 records out       atlas64      PFS and SCFS file systems are not recommended for the following     e Applications that only access small amounts of data in a single I O operation     Recommended File System Layout    Recommended File System Layout    PFS SCFS is not recommended for applications that only access small amounts of  data in a single I O operation  for example  1KB reads or writes are very inefficient      PFS SCFS works best when each I O operation has a large granularity  for example   a large multiple of 128KB      With PFS SCFS  if an application is writing out a large data structure   for example   an array   it would be better to specify to write the whole array as a single operation   than to write it as one operation per row or column  If that is not possible  then it is  still much better to access the array one row or column at a time than to access it one  element at a time     e Applications that require caching of data    e  Serial general workloads     PFS and SCFS file systems are not suited to serial general workloads due to limita   tions in PFS mmap support and lack of mmap support when using SCFS on CS  domains  Serial general workloads can use linkers  performance analysis  and or  instrumentation tools which require use of mmap     Some of the limitations of PFS and SCFS can be overcome if the PFS is configured  with a default stripe width of one  Some of
24. 3 10      3 3 3 6 PFSIO_GETFSMAP    Description     Data Type     Example     For a given PFS file system  retrieves the number of component file systems  and the default  stride size     pfsmap t    The PFS file system consists of eight components  128KB stride    Slice  Base 0 Count 8   Stride  131072   This configures the file to be laid out with the first block on the first component file system   and a stride size of 128KB  For PFSIO_GETFSMAP  the base is always 0     the component  file system layout is always described with respect to a base of 0     Managing the Parallel File System  PFS  3 11    Using a PFS File System    3 3 3 7 PFSIO_GETLOCAL    3 12    Description     Data Type     Example     For a given PFS file  retrieves information that specifies which parts of the file are local to the  host    This information consists of a list of slices  taken from the layout of the file across the  component file systems  that are local  Blocks laid out across components that are contiguous  are combined into single slices  specifying the block offset of the first of the components  and  the number of contiguous components     pfsslices ioctl t    a  The PFS file system consists of three components  all local  file starts on first component   Size  3   Count  1   Slice  Base  0 Count   3   b  The PFS file system consists of three components  second is local  file starts on first    component   Size  3  Count  1    Slice  Base 1  Count 1  c  The PFS file system consists o
25. Count of a PFS to an Input Value    finclude   stdio h      include  lt fcntl h gt     include  lt inttypes h gt     include  lt libgen h gt     include  lt string h gt    finclude  lt sys fs pfs common h gt    include  lt sys fs pfs map h gt     static char  cmd_name    pfs_set_stripes      Streamlining Application I O Performance 6 1    PFS Performance Tuning    6 2    static int def_stripes   1   static int max_stripes   256     void  usage  int status  char  msg      if  msg      fprintf stderr     s  s n   cmd name  msg           printf  Usage   s   filename     lt stripes gt   nwhere n t lt stripes gt  defaults to  Sd n   cmd name  def stripes      exit  status           int   main int argc  char  argv         int fd  status  stripes def_stripes   pfsmap_t map     cmd name   strdup  basename  argv 0        if  argc    2       usage  1  NULL         if   arge    3   amp  amp     stripes atoi  argv 2     lt   0      stripes  gt   max_stripes        usage 1   Invalid stripe count            if   fd   open argv 1   O CREAT   O TRUNC  0666      0              fprintf  stderr   Error opening file  s  n  argv 1       exit  1             Get the current map  T       status   ioctl fd  PFSIO GETDFLTMAP   amp map    if  status    0           fprintf stderr  Error getting the pfs map data  n     exit  1      Streamlining Application I O Performance    PFS Performance Tuning    map pfsmap_slice ps_count   stripes        status   ioctl fd  PFSIO_SETDFLTMAP   amp map      if  status 
26. FS file systems  common mount point  coherency  and so  on   An FS domain serves the SCFS file system to nodes in the other domains  In effect  an  FS domain exports the file system  and the other domains import the file system     This is similar to     and  in fact  uses features of     the NFS system  For example    etc exports is used for SCFS file systems  The mount point of an SCFS file system uses  the same name throughout the HP AlphaServer SC system so there is a coherent file name  space  Coherency issues related to data and metadata are discussed later     4 2 SCFS Configuration Attributes    4 2    The SC database contains SCFS configuration data  The  etc fstab file is not used to  manage the mounting of SCFS file systems  However  the  etc exports is used for this  purpose  Use SysMan Menu or the sc  smgr command to edit this configuration data     do  not update the contents of the SC database directly  Do not add entries to  or remove entries  from  the  etc exports file  Once entries have been created  you can edit the  etc   exports file in the usual way     Managing the SC File System  SCFS     SCFS Configuration Attributes    An SCFS file system is described by the following attributes     AdvFS domain and fileset name    This is the name of the AdvFS domain and fileset that contains the underlying data  storage of an SCFS file system  This information is only used by the FS domain that  serves the SCFS file system  However  although AdvFS domain and files
27. FS file systems should be created so that files are spread over the appropriate  component file systems or servers  If only a subset of nodes will be accessing a file  then  it may be useful to limit the file layout to the subset of component file systems that are  local to these nodes  by selecting the appropriate stripe count     The amount of data associated with an operation is important  as this determines what the  stride and block sizes should be for a PFS file system  A small block size will require  more I O operations to obtain a given amount of data  but the duration of the operation  will be shorter  A small stride size will cycle through the set of component file systems  faster  increasing the likelihood of multiple file systems being active simultaneously     Managing the Parallel File System  PFS  3 5    Using a PFS File System    3  The layout of a file should be tailored to match the access pattern for the file  Serial  access may benefit from a small stride size  delivering improved read or write  bandwidth  Random access performance should improve as more than one file system  may seek data at the same time  Strided data access may require careful tuning of the PFS  block size and the file data stride size to match the size of the access stride     4  The base file system for a file should be carefully selected to match application access  patterns  In particular  if many files are accessed in lock step  then careful selection of the  base file system for 
28. File System   ser Identifier    nshielded Twisted Pair       U  U  U  U  U  U    INIX to UNIX Copy Program  Web Based Enterprise Service    Web User Interface    xvii    Documentation Conventions    xviii    Table 0 2 lists the documentation conventions that are used in this document     Table 0 2 Documentation Conventions       Convention Description     A percent sign represents the C shell system prompt      A dollar sign represents the system prompt for the Bourne and Korn shells     A number sign represents the superuser prompt    P00 gt  gt  gt  A P00 gt  gt  gt  sign represents the SRM console prompt     Monospace type    Boldface type    Italic type    UPPERCASE TYPE    Underlined type     l   1    cat 1     Ctrl x    Note    atlas    Monospace type indicates file names  commands  system output  and user input     Boldface type in interactive examples indicates typed user input   Boldface type in body text indicates the first occurrence of a new term     Italic  slanted  type indicates emphasis  variable values  placeholders  menu options   function argument names  and complete titles of documents     Uppercase type indicates variable names and RAID controller commands   Underlined type emphasizes important information    In syntax definitions  brackets indicate items that are optional and braces indicate  items that are required  Vertical bars separating items inside brackets or braces    indicate that you choose one item from among those listed     In syntax definit
29. S Configuration Attributes    00    e 4 2  4 3 TUNING SCES  cce orc ore e ram rente e ots cnt a cen orate an  4 5  4 3 1 Tuning SCFS Kernel Subsystems             0  0000 cece cece teenies 4 5  4 3 2 Tuning SCFS Server Operations      eee 4 6  4 3 2 1 SCFS T O Transfers    voice uc aca ute Re DAI Ce RE 4   6  4 3 2 2 SCFS Synchronization Management      oooooooocoor eee 4 6  4 3 3 Tuning SCFS Client Operations              0  ccc eect teens 4 7  4 3 4 Monitoring SCFS Activity    sees 4 7  4 4 SCES Failovers cose ee ieee ae Cee ahs CR kg as e ee TEN ae 4 8  4 4 1 SCFS Failover in the File Server Domain               0 00  eee 4 8  4 4 2 Failover on an SCFS Importing Node      0 0 0 0    ccc cee tenes 4 8  4 4 2 1 Recovering from Failure of an SCFS Importing Node           o ooooooooo o   4 8  5 Recommended File System Layout  5 1 Recommended File System Layout           0 00  cece ett eh 5 2  5 1 1 Stride Size of the PES 4 essc etre A e Ro Eee 5 4  5 12 Stripe Count of tlie PES LARES GR A uan eA ds 5 4  5 1 3 Mount Mode ofthe SCFS     0 2 0    cee e 5 4  5 1 4 Home File Systems and Data File Systems    eese 5 5  6 Streamlining Application I O Performance  6 1 PFS Performance Tuning pere oc be RR wee Oe dA SERES ERE Oe e 6 1  6 2 FORTRAN ense UE DEBUERAT 6 4  6 3 C rte                                            6 5  6 4 Third Party Applications  ts is A aged onnie re dis 6 5  Index    vi    List of Figures    Figure 1 1  CFS Makes File Systems Available to All Cluster Members  
30. S Kernel Subsystems  see Section 4 3 1 on page 4 5       Tuning SCFS Server Operations  see Section 4 3 2 on page 4 6       Tuning SCFS Client Operations  see Section 4 3 3 on page 4 7     e Monitoring SCFS Activity  see Section 4 3 4 on page 4 7     4 3 1 Tuning SCFS Kernel Subsystems    To tune any of the SCFS subsystem attributes permanently  you must add an entry to the  appropriate subsystem stanza  either sc  s or sc  s client inthe  etc sysconfigtab  file  Do not edit the  etc sysconfigtab file directly     use the sysconfigdb command  to view and update its contents  Changes made to the  etc sysconfigtab file will take    Managing the SC File System  SCFS  4 5    Tuning SCFS    effect when the system is next booted  Some of the attributes can also be changed  dynamically using the sysconfig command  but these settings will be lost after a reboot  unless the changes are also added to the  etc sysconfigtab file     4 3 2 Tuning SCFS Server Operations    A number of configurable attributes in the scfs kernel subsystem affect SCFS serving   Some of these attributes can be dynamically configured  while others require a reboot before  they take effect  For a detailed explanation of the scfs subsystem attributes  see the   sys attrs scfs 5  reference page     The default settings for the scfs subsystem attributes should work well for a mixed work  load  However  performance may be improved by tuning some of the parameters     4 3 2 1 SCFS I O Transfers    SCFS I O achieves b
31. Science Department  amp  NSF Engineering Research Center for Computational  Field Simulation  P O  Box 6176  Mississippi State MS 39762    GOVERNMENT LICENSE    Portions of this material resulted from work developed under a U S  Government Contract and are subject to the  following license  the Government is granted for itself and others acting on its behalf a paid up  nonexclusive  irrevocable  worldwide license in this computer software to reproduce  prepare derivative works  and perform publicly and display  publicly     DISCLAIMER    This computer code material was prepared  in part  as an account of work sponsored by an agency of the United States  Government  Neither the United States  nor the University of Chicago  nor Mississippi State University  nor any of their  employees  makes any warranty express or implied  or assumes any legal liability or responsibility for the accuracy   completeness  or usefulness of any information  apparatus  product  or process disclosed  or represents that its use would  not infringe privately owned rights     Trademark Notices  Microsoft   and Windows   are U S  registered trademarks of Microsoft Corporation   UNIXO is a registered trademark of The Open Group     Expect is public domain software  produced for research purposes by Don Libes of the National Institute of Standards and  Technology  an agency of the U S  Department of Commerce Technology Administration     Tcl  Tool command language  is a freely distributable language  desi
32. a PFS file system  each FS domain must  import the other domain s SCFS file systems  that is  the SCFS file systems are cross   mounted between domains   See Chapter 4 for a description of FS and CS domains     3 1 1 PFS Attributes    3 2    A PFS file system has a number of attributes  which determine how the PFS striping mechanism  operates for files within the PFS file system  Some ofthe attributes  such as the set of component  file systems  can only be configured when the file system is created  so you should plan these  carefully  see Section 3 2 on page 3 4   Other attributes  such as the size of the stride  can be  reconfigured after file system creation  these attributes can also be configured on a per file basis     Managing the Parallel File System  PFS     PFS Overview    The PFS attributes are as follows     NumFS  Component File System List     A PFS file system is comprised of a number of component file systems  The component  file system list is configured when a PFS file system is created     Block  Block Size     The block size is the maximum amount of data that will be processed as part of a single  operation on a component file system  The block size is configured when a PFS file  system is created     Stride  Stride Size     The stride size is the amount  or stride  of data that will be read from  or written to  a  single component file system before advancing to the next component file system   selected in a round robin fashion  The stride value must be 
33. achment Unit Interface  Berkeley Internet Name Domain    Cluster Application Availability    xiii    xiv    Table 0 1 Abbreviations    Abbreviation  CD ROM  CDE  CDFS  CDSL  CFS   CLI   CMF  CPU   CS  DHCP  DMA  DMS  DNS  DRD  DRL  DRM  EEPROM  ELM  EVM  FastFD  FC   FDDI  FRU   FS   GUI    HBA    Description   Compact Disc     Read Only Memory  Common Desktop Environment  CD ROM File System  Context Dependent Symbolic Link  Cluster File System   Command Line Interface   Console Management Facility   Central Processing Unit  Compute Serving   Dynamic Host Configuration Protocol  Direct Memory Access   Dataless Management Services  Domain Name System   Device Request Dispatcher   Dirty Region Logging   Distributed Resource Management  Electrically Erasable Programmable Read Only Memory  Elan License Manager   Event Manager   Fast  Full Duplex   Fibre Channel   Fiber optic Digital Data Interface  Field Replaceable Unit   File Serving   Graphical User Interface    Host Bus Adapter    Table 0 1 Abbreviations    Abbreviation  HiPPI  HPSS  HWID  ICMP  ICS   IP  JBOD  JTAG  KVM  LAN  LIM  LMF  LSF  LSM  MAU  MB3  MFS  MIB  MPI    MTS          Description   High Performance Parallel Interface  High Performance Storage System  Hardware  component  Identifier  Internet Control Message Protocol  Internode Communications Service  Internet Protocol   Just a Bunch of Disks   Joint Test Action Group  Keyboard Video Mouse   Local Area Network   Load Information Manager  License Management Fa
34. an integral multiple of the  block size  see Block above      The default stride value is defined when a PFS file system is created  but this default  value can be changed using the appropriate ioctl  see Section 3 3 3 5 on page 3 11   The  stride value can also be reconfigured on a per file basis using the appropriate ioctl  see  Section 3 3 3 3 on page 3 10      Stripe  Stripe Count     The stripe count specifies the number of component file systems to stripe data across  in  cyclical order  before cycling back to the first file system  The stripe count must be non   zero  and less than or equal to the number of component file systems  see NumFS  above      The default stripe count is defined when a PFS file system is created  but this default  value can be changed using appropriate ioctl  see Section 3 3 3 5 on page 3 11   The  stripe count can also be reconfigured on a per file basis using the appropriate ioctl  see  Section 3 3 3 3 on page 3 10      Base  Base File System     The base file system is the index of the file system  in the list of component file systems   that contains the first stripe of file data  The base file system must be between 0 and  NumFS   1  see NumFS above      The default base file system 1s selected when the file is created  based on the modulus of  the file inode number and the number of component file systems  The base file system  can also be reconfigured on a per file basis using the appropriate ioctl  see Section  3 3 3 3 on page 3 10    
35. and Storage  Chapter 3  Managing the Parallel File System  PFS   Chapter 4  Managing the SC File System  SCFS   Chapter 5  Recommended File System Layout    Chapter 6  Streamlining Application I O Performance    xi    Related Documentation    xii    You should have a hard copy or soft copy of the following documents     HP AlphaServer SC Release Notes   HP AlphaServer SC Installation Guide   HP AlphaServer SC System Administration Guide   HP AlphaServer SC Interconnect Installation and Diagnostics Manual  HP AlphaServer SC RMS Reference Manual   HP AlphaServer SC User Guide   HP AlphaServer SC Platform LSF   Administrator s Guide   HP AlphaServer SC Platform LSF   Reference Guide   HP AlphaServer SC Platform LSF   User s Guide   HP AlphaServer SC Platform LSF   Quick Reference   HP AlphaServer ES45 Owner s Guide   HP AlphaServer ES40 Owner s Guide   HP AlphaServer DS20L User 5 Guide   HP StorageWorks HSG80 Array Controller CLI Reference Guide   HP StorageWorks HSG80 Array Controller Configuration Guide   HP StorageWorks Fibre Channel Storage Switch User 5 Guide   HP StorageWorks Enterprise Virtual Array HSV Controller User Guide  HP StorageWorks Enterprise Virtual Array Initial Setup User Guide  HP SANworks Release Notes   Tru64 UNIX Kit for Enterprise Virtual Array    HP SANworks Installation and Configuration Guide   Tru64 UNIX Kit for Enterprise  Virtual Array    HP SANworks Scripting Utility for Enterprise Virtual Array Reference Guide  Compaq TruCluster Server Cluster Re
36. be less than or equal to the stride size  The stride size must be an  even multiple of the block size  The default block size is the same value as the stride size   This parameter specifies how much data the PFS system will issue  in a read or write  command  to the underlying file system  Generally  there is not a lot of benefit in  changing the default value  SCFS  which is used for the underlying PFS components  is  more efficient at bigger transfers  so leaving the block size equal to the stride size  maximizes SCFS efficiency     These parameters are specified at file system creation  They can be modified by a PFS aware  application or library using a set of PFS specific 1octls     In a configuration with a large number of component file systems and a large client  population  it can be more efficient to restrict the number of stripe components  With a large  client population writing to every file server  the file servers experience a higher rate of  interrupts  By restricting the number of stripe components  individual file server nodes will  serve a smaller number of clients  but the aggregate throughput of all servers remains the  same  Each client will still get a degree of parallel I O activity  due to its file being striped    Overview of File Systems and Storage 2 5    PFS    over a number of components  This is true where each client is writing to a different file  If  each client process is writing to the same file  it is obviously optimal to stripe over all  c
37. cility  Load Sharing Facility   Logical Storage Manager  Multiple Access Unit   Mouse Button 3   Memory File System  Management Information Base  Message Passing Interface  Message Transport System  Network File System   Network Interface Failure Finder  Network Information Service    Network Time Protocol          Non Volatile Random Access Memory    Operator Control Panel    XV    xvi    Table 0 1 Abbreviations    Abbreviation  OS   OSPF  PAK   PBS  PCMCIA  PE   PFS   PID  PPID  RAID  RCM  RIP    RIS    RMC    RMS    SC  SCFS  SCSI  SMP  SMTP  SQL  SRM  SROM    SSH    Description   Operating System   Open Shortest Path First   Product Authorization Key  Portable Batch System   Personal Computer Memory Card International Association  Process Element   Parallel File System   Process Identifier   Parent Process Identifier  Redundant Array of Independent Disks  Remote Console Monitor  Routing Information Protocol  Remote Installation Services   LSF Adapter for RMS   Remote Management Console  Resource Management System  Revolutions Per Minute  SuperComputer   HP AlphaServer SC File System  Small Computer System Interface  Symmetric Multiprocessing  Simple Mail Transfer Protocol  Structured Query Language  System Resources Manager  Serial Read Only Memory    Secure Shell    Table 0 1 Abbreviations    Abbreviation  TCL   BC   JDP   JFS   ID    TP       Ci  occ E ciue E    UCP  WEBES    WUI    Description   Tool Command Language  niversal Buffer Cache  ser Datagram Protocol  INIX 
38. d suggestions that you have on this document  Please send all  comments and suggestions to your HP Customer Support representative     xix    XX    1    hp AlphaServer SC System Overview    This guide does not attempt to cover all aspects of normal HP AlphaServer SC system  administration  these are covered in detail in the HP AlphaServer SC System Administration  Guide   but rather focuses on aspects that are specific to the I O performance    This chapter is organized as follows    e SC System Overview  see Section 1 1 on page 1 1    e CFS Domains  see Section 1 2 on page 1 2    e Cluster File System  CFS   see Section 1 3 on page 1 3    e Parallel File System  PFS   see Section 1 4 on page 1 5    e SC File System  SCFS   see Section 1 5 on page 1 5     1 1 SC System Overview    An HP AlphaServer SC system is a scalable  distributed memory  parallel computer system  that can expand to up to 4096 CPUs  An HP AlphaServer SC system can be used as a single  compute platform to host parallel jobs that consume up to the total compute capacity     The HP AlphaServer SC system is constructed through the tight coupling of up to 1024 HP  AlphaServer ES45 nodes  or up to 128 HP AlphaServer ES40 or HP AlphaServer DS20L  nodes  The nodes are interconnected using a high bandwidth  340 MB s   low latency   3  us  switched fabric  this fabric is called a rail      For ease of management  the HP AlphaServer SC nodes are organized into multiple Cluster  File System  CFS  domains  Each CFS do
39. e Section 2 4 on page 2 8    This caused each of the CS domains to import the four file systems and mount them at the  same point in their respective name spaces  The PFS file system was built on the FS domain  using the four component file systems  the resultant PFS file system was mounted on the FS  domain  Each of the CS domains also mounted the PFS at the same mount point     The end result is that each domain sees the same PFS file system at the same mount point   Client PFS accesses are translated into client SCFS accesses and are served by the  appropriate SCFS file server node  The PFS file system can also be accessed within the FS  domain  In this case  PFS accesses are translated into CFS accesses     When building a PFS  the system administrator has the following choice   e Use the set of complete component file systems  for example    pfs comps fsl   pfs comps fs2   pfs comps fs3   pfs comps fs4    e Usea set of subdirectories within the component file systems  for example    pfs comps fsl x   pfs comps fs2 x   pfs comps fs3 x   pfs comps fs4 x    Using the second method allows the system administrator to create different PFS file systems   for instance  with different operational parameters   using the same set of underlying  components  This can be useful for experimentation  For production oriented PFS file  systems  the first method is preferred     Overview of File Systems and Storage 2 7    Preferred File Server Nodes and Failover    2 4 Preferred File Serve
40. e content    This information includes the number of component file systems  the ID of the component  file system containing the first data block of a file  and the stride size     pfsmap t    The PFS file system consists of three components  64KB stride    Slice  Base  2 Count   3   Stride  131072   This configures the file to be laid out with the first block on the third component file system   and a stride size of 128KB   The stride size of the file can be an integral multiple of the PFS  block size      3 10 Managing the Parallel File System  PFS     Using a PFS File System    3 3 3 4 PFSIO_GETDFLTMAP    Description     Data Type     Example     For a given PFS file system  retrieves the default mapping information that specifies how  newly created files will be laidout across the component file systems    This information includes the number of component file systems  the ID of the component  file system containing the first data block of a file  and the stride size     pfsmap t  See PFSIO GETMAP  Section 3 3 3 2 on page 3 10      3 3 3 5 PFSIO SETDFLTMAP    Description     Data Type     Example     For a given PFS file system  sets the default mapping information that specifies how newly  created files will be laidout across the component file systems    This information includes the number of component file systems  the ID ofthe component file  system containing the first data block of a file  and the stride size     pfsmap t  See PFSIO SETMAP  Section 3 3 3 3 on page 
41. e parallel file transfer protocol  pftp   can achieve good parallel performance by  accessing PFS files in a sequential  stride   1  fashion  However  the performance may be  further improved by integrating the mover with PFS  so that it understands the layout of a  PFS file  This enables the mover to alter its access patterns to match the file layout     3 8 Managing the Parallel File System  PFS     Using a PFS File System    3 3 3 PFS loctl Calls    Valid PFS ioctl calls are defined in the map h header file   lt sys fs pfs map h gt   on an  installed system  A PFS ioctl call requires an open file descriptor for a file  either the specific  file being queried or updated  or any file  on the PFS file system     In PFS ioctl calls  the N different component file systems are referred to by index number   0 to N 1   The index number is that of the corresponding symbolic link in the component  file system root directory        The sample program ioctl example c  provided in the  Examples pfs example  directory on the HP AlphaServer SC System Software CD ROM  demonstrates the use of PFS  ioctl calls     HP AlphaServer SC Version 2 5 supports the following PFS ioctl calls     PFSIO GETFSID  see Section 3 3 3 1 on page 3 10   PFSIO GETMAP  see Section 3 3 3 2 on page 3 10   PFSIO SETMAP  see Section 3 3 3 3 on page 3 10   PFSIO GETDFLTMAP  see Section 3 3 3 4 on page 3 11   PFSIO SETDFLTMAP  see Section 3 3 3 5 on page 3 11   PFSIO GETFSMAP  see Section 3 3 3 6 on page 3 11   PFSIO
42. each file can ensure that the load is spread evenly across the  component file system servers  Similarly  when a file is accessed in a strided fashion   careful selection of the base file system may be required to spread the data stripes  appropriately     3 3 Using a PFS File System    A PFS file system supports POSIX semantics and can be used in the same way as any other  Tru64 UNIX file system  for example  UFS or AdvFS   except as follows     e PFS file systems are mounted with the nogrpid option implicitly enabled  Therefore   SVID III semantics apply  For more details  see the AdvFS UFS options for the  mount  8  command     e The layout of the PFS file system  and of files residing on it  can be interrogated and  changed using special PFS ioctl calls  see Section 3 3 3 on page 3 9      e The PFS file system does not support file locking using the lockf  2   fent1  2   or  lockf  3  interfaces     e PFS provides support for the mmap    system call for multicomponent file systems   sufficient to allow the execution of binaries located on a PFS file system  This support is   however  not always robust enough to support how some compilers  linkers  and  profiling tools make use of the mmap    system call when creating and modifying binary  executables  Most of these issues can be avoided if the PFS file system is configured to  use a stripe count of 1 by default  that is  use only a single data component per file    The information in this section is organized as follows  
43. ed to large data transfers where bypassing the UBC provides  better performance  In addition  since accesses are made directly to the serving node   multiple writes by several client nodes are serialized  hence  data coherency is pre   served  Multiple readers of the same data will all have to obtain the data individually  from the server node since the UBC is bypassed on the client nodes     While a file is opened via the FAST mode  all subsequent file open    calls on that  cluster will inherit the FAST attribute even if not explicitly specified     Managing the SC File System  SCFS  4 3    SCFS Configuration Attributes    44        Access is through the UBC  This corresponds to the UBC mode     The UBC mode is suited to small data transfers  such as those produced by formatted  writes in Fortran  Data coherency has the same characteristics as NFS     Ifa file is currently opened via the UBC mode  and a user attempts to open the same  file via the FAST mode  an error  EINVAL  is returned to the user     Whether the SCFS file system is mounted FAST or UBC  the access for individual files  is overridden as follows         Ifthe file has an executable bit set  access is via the UBC  that is  uses the UBC path         Ifthe file is opened with the O SCFSIO option  defined in   sys sc  s  h7   access  is via the FAST path     ONLINE or OFFLINE    You do not directly mount or unmount SCFS file systems  Instead  you mark the SCFS file  system as ONLINE or OFFLINE  When you mark an
44. een dirty for longer than sync  period seconds  The default value of the  sync period attribute is 10     e The amount of dirty data associated with the file exceeds sync dirty size  The  default value ofthe sync dirty size attribute is 64MB     Managing the SC File System  SCFS     Tuning SCFS    e The number of write transactions since the last synchronization exceeds  sync handle trans  The default value of the sync handle trans attribute is 204     If an application generates a workload that causes one of these conditions to be reached very  quickly  poor performance may result because I O to a file regularly stalls waiting for the  synchronize operation to complete  For example  if an application writes data in 128KB  blocks  the default sync handle trans value would be exceeded after writing 25 5MB   Performance may be improved by increasing the sync handle trans value  You must  propagate this change to every node in the FS domain  and then reboot the FS domain     Conversely  an application may generate a workload that does not cause the   sync dirty sizeand sync handle trans limits to be exceeded     for example  an  application that writes 32MB in large blocks to a number of different files  In such cases  the  data is not synchronized to disk until the sync  period has expired  This could result in  poor performance as UBC resources are rapidly consumed  and the storage subsystems are  left idle  Tuning the dynamically reconfigurable attribute sync period to a lowe
45. er of data file systems to be accessed and viewed  as a single file system view  The PFS file system stores the data as stripes across the  component file systems  as shown in Figure 3 1     Normal I O Operations    SEE  Parallel File    Component  File 2                 are striped over multiple host files                     Component  File 1    Component  File 3    Component  File 4    Figure 3 1 Parallel File System    Files written to a PFS file system are written as stripes of data across the set of component file  systems  For a very large file  approximately equal portions of a file will be stored on each file  system  This can improve data throughput for individual large data read and write operations   because multiple file systems can be active at once  perhaps across multiple hosts     Similarly  distributed applications can work on large shared datasets with improved performance   if each host works on the portion of the dataset that resides on locally mounted data file systems     Underlying a component file system is an SCFS file system  The component file systems of a  PFS file system can be served by several File Serving  FS  domains  Where there is only one  FS domain  programs running on the FS domain access the component file system via the  CFS file system mechanisms  Programs running on Compute Serving  CS  domains access  the component file system remotely via the SCFS file system mechanisms  If several FS  domains are involved in serving components of 
46. erview  For information on configuring NFS  refer to the Compaq TruCluster Server  Cluster Administration Guide     For sites that have a single file system for both home and data files  it is recommended to set  the execute bit on files that are small and require caching  and use a stripe count of 1     Recommended File System Layout 5 5    6    Streamlining Application I O Performance    The file system for the HP AlphaServer SC system and individual files can be tuned for  better I O performance  The information in this chapter is arranged as follows     e PFS Performance Tuning  see Section 6 1 on page 6 1   e FORTRAN  see Section 6 2 on page 6 4    e C  see Section 6 3 on page 6 5    e Third Party Applications  see Section 6 4 on page 6 5     6 1 PFS Performance Tuning    PFS specific ioct1s can be used to set the size of a stride and the number of stripes in a file   This is normally done just after the file has been created and before any data has been written  to the file  otherwise the file will be truncated     The default stripe count and stride can be set in a similar manner     Example 6 1 below describes the code to set the default stripe count of a PFS to the value  input to the program  Similar use of 1oct1s can be incorporated into C code or in  FORTRAN via a callout to a C function     A FORTRAN unit number can be converted to a C file descriptor via the get fd  3     function call  see Example 6   2 and Example 6   3      Example 6 1 Set the Default Stripe 
47. est performance results when processing large I O requests     If a client generates a very large I O request  such as writing 512MB of data to a file  this  request will be performed as a number of smaller operations  The size of these smaller  operations is dictated by the io size attribute of the server node for the SCFS file system   The default value ofthe io size attribute is 16MB     This subrequest is then sent to the SCFS server  which in turn performs the request as a  number of smaller operation  This time  the size of the smaller operations is specified by the  io block attribute  The default value ofthe io block attribute is 128KB  This allows the  SCFS server to implement a simple double buffering scheme which overlaps I O and  interconnect transfers     Performance for very large requests may be improved by increasing the io size attribute   though this will increase the setup time for each request on the client  You must propagate  this change to every node in the FS domain  and then reboot the FS domain     Performance for smaller transfers    256K B  may also be improved slightly by reducing the  io block size  to increase the effect of the double buffering scheme  You must propagate  this change to every node in the FS domain  and then reboot the FS domain     4 3 2 2 SCFS Synchronization Management    4 6    The SCFS server will synchronize the dirty data associated with a file to disk  if one or more  of the following criteria is true     e The file has b
48. et names  generally need only be unique within a given CFS domain  the SCFS system uses unique  names  Therefore  the AdvFS domain and fileset name must be unique across the HP  AlphaServer SC system     In addition  HP recommends the following conventions       You should use only one AdvFS fileset in an AdvFS domain         The domain and fileset names should use a common root name  For example  an  appropriate name would be data domainddata     SysMan Menu uses these conventions  The sc  smgr command allows more flexibility   Mountpoint    This is the pathname of the mountpoint for the SCFS file system  This is the same on all  CFS domains in the HP AlphaServer SC system     Preferred Server    This specifies the node that normally serves the file system  When an FS domain is  booted  the first node that has access to the storage will mount the file system  When the  preferred server boots  it takes over the serving of that storage  For best performance  the  preferred server should have direct access to the storage  The c  smgr command controls  which node serves the storage     Read Write or Read Only   This has exactly the same syntax and meaning as in an NFS file system    FAST or UBC   This attribute refers to the default behavior of clients accessing the FS domain  The client  has two possible paths to access the FS domain         Bypass the Universal Buffer Cache  UBC  and access the serving node directly  This  corresponds to the FAST mode     The FAST mode is suit
49. f three components  second is remote  file starts on first    component   Size  3  Count  2    Slices  Base  0 Count   1  Base   2 Count   1  d  The PFS file system consists of three components  second is remote  file starts on second    component   Size  3  Count  1    Slice  Base   1 Count   2    Managing the Parallel File System  PFS     Using a PFS File System    3 3 3 8 PFSIO_GETFSLOCAL    Description     Data Type     Example     For a given PFS file system  retrieves information that specifies which of the components are  local to the host    This information consists of a list of slices  taken from the set of components  that are local   Components that are contiguous are combined into single slices  specifying the ID of the first  component  and the number of contiguous components     pfsslices ioctl t    a  The PFS file system consists of three components  all local   Size  3  Count 1  Slice  Base  0 Count   3  b  The PFS file system consists of three components  second is local   Size  3  Count  1  Slice  Base   1 Count   1  c  The PFS file system consists of three components  second is remote   Size  3  Count  2  Slices  Base  0 Count   1  Base   2 Count   1    Managing the Parallel File System  PFS  3 13    4    Managing the SC File System  SCFS     The SC file system  SCFS  provides a global file system for the HP AlphaServer SC system   The information in this chapter is arranged as follows    e SCFS Overview  see Section 4 1 on page 4 2    e SCFS Configuration
50. fd  PFSIO SETMAP   amp map    if  status    0        error   status    return      error   0   return      6 2 FORTRAN    FORTRAN programs that write small records using  for example  formatted write statements  will not perform well on an SCFS FAST mounted PFS file system  To optimize performance  of a FORTRAN program that writes in small chunks on an SCFS FAST mounted PFS file  system  it may be possible to compile the application with the option   assume  buffered io     6 4 Streamlining Application I O Performance    C    This will enable buffering within FORTRAN so that data will be written at a later stage once  the size of the FORTRAN buffer has been exceeded  In addition  for FORTRAN  applications  the FORTRAN buffering can be controlled by an environment variable  FORT BUFFERED                 Individual files can also be opened with buffering set to on by explicitly adding the  BUFFERED directive to the FORTRAN open call     Note                    The benefit of using the option   assume buffered io is dependent on the  nature of the applications I O characteristics  This modification is most appropriate  to applications that use FORTRAN formatted I O        6 3 C    If the Tru64 UNIX system read    and write    function calls are used  then the data is  passed directly to the SCFS or PFS read and write functions     However  if the fwrite   and fread   stdio functions are used  then buffering can take  place within the application  The default buffer for fwrite   
51. gned and implemented by Dr  John Ousterhout of  Scriptics Corporation     The following product names refer to specific versions of products developed by Quadrics Supercomputers World Limited    Quadrics    These products combined with technologies from HP form an integral part of the supercomputing systems  produced by HP and Quadrics  These products have been licensed by Quadrics to HP for inclusion in HP AlphaServer SC  systems       Interconnect hardware developed by Quadrics  including switches and adapter cards    Elan  which describes the PCI host adapter for use with the interconnect technology developed by Quadrics    PFS or Parallel File System      RMS or Resource Management System    Preface coss te co eae rile at ere een ue n d Meares    1 hp AlphaServer SC System Overview    1 1 SC System Overview            0 cece cette tenes  1 2  CES DODalnS   uu tirer ox Uer IRIURE EU SERRE  1 3 Cluster File System  CFS       oooooooooooorororrrrcrmo    1 4 Parallel File System  PFS              0 000 e eee ee eee eee  1 5 SC File System  SCFS           0    0 ce cece ete eee eee eee    2 Overview of File Systems and Storage    2 1 Introd  ctionzc  is cere ERR LERRA REX A ES  2 2 SCESu os deas bos HOU oad ae Oe Niels EE  2 2 1 Selection of FAST Mode            0 0 00  eese  2 2 2 Getting the Most Out of SCFS            0 0 0 2  00000   2 3 PHS PE  2 3 1 PES and SCES erica bM beu BEA E ES  2 3 1 1 User Process Operation             0 00 esses  2 3 1 2 System Administrator Ope
52. hen the maximum capacity would be 6GB     that is  3GB  Minimum  Capacity  x 2  File Systems      For information on how to extend the storage capacity of PFS file systems  see the HP  AlphaServer SC Administration Guide     3 2 Planning a PFS File System to Maximize Performance    3 4    The primary goal  when using a PFS file system  is to achieve improved file access  performance  scaling linearly with the number of component file systems  NumFS    However  it is possible for more than one component file system to be served by the same  server  in which case the performance may only scale linearly with the number of servers     To achieve this goal  you must analyze the intended use of the PFS file system  For a given  application or set of applications  determine the following criteria     e Number of Files    An important factor when planning a PFS file system is the expected number of files     Managing the Parallel File System  PFS     Planning a PFS File System to Maximize Performance    If expecting to use a very large number of files in a large number of directories  then you  should allow extra space for PFS file metadata on the first  root  component file system   The extra space required will be similar in size to the overhead required to store the files  on an AdvFS file system     Access Patterns    How data files will be accessed  and who will be accessing the files  are two very  important criteria when determining how to plan a PFS file system     Ifa file i
53. hp AlphaServer SC  Best Practices I O Guide    January 2003    This document describes how to administer best practices for I O on an AlphaServer  SC system from the Hewlett Packard Company     Revision Update Information This is a new manual     Operating System and Version  Compaq Tru64 UNIX Version 5 1A  Patch Kit 2    Software Version  Version 2 5   Maximum Node Count  1024 nodes   Node Type  HP AlphaServer ES45  HP AlphaServer ES40    HP AlphaServer DS20L    Legal Notices    The information in this document is subject to change without notice     Hewlett Packard makes no warranty of any kind with regard to this manual  including  but not limited to  the implied  warranties of merchantability and fitness for a particular purpose  Hewlett Packard shall not be held liable for errors  contained herein or direct  indirect  special  incidental or consequential damages in connection with the furnishing   performance  or use of this material     Warranty    A copy of the specific warranty terms applicable to your Hewlett Packard product and replacement parts can be obtained  from your local Sales and Service Office     Restricted Rights Legend    Use  duplication or disclosure by the U S  Government is subject to restrictions as set forth in subparagraph  c   1   ii  of the  Rights in Technical Data and Computer Software clause at DFARS 252 227 7013 for DOD agencies  and subparagraphs  c    1  and  c   2  of the Commercial Computer Software Restricted Rights clause at FAR 52 227
54. ions  a horizontal ellipsis indicates that the preceding item can be  repeated one or more times     A vertical ellipsis indicates that a portion of an example that would normally be  present is not shown     A cross reference to a reference page includes the appropriate section number in  parentheses  For example  cat  1  indicates that you can find information on the    cat command in Section 1 of the reference pages     This symbol indicates that you hold down the first named key while pressing the key  or mouse button that follows the slash     A note contains information that is of special importance to the reader     atlas is an example system name     Multiple CFS Domains    The example system described in this document is a 1024 node system  with 32 nodes in  each of 32 Cluster File System  CFS  domains  Therefore  the first node in each CFS domain  is Node 0  Node 32  Node 64  Node 96  and so on  To set up a different configuration   substitute the appropriate node name s  for Node 32  Node 64  and so on in this manual     For information about the CFS domain types supported in HP AlphaServer SC Version 2 5     see Chapter 1   Location of Code Examples       Code examples are located in the  1  Software CD ROM     Examples directory of the HP AlphaServer SC System    Location of Online Documentation    Online documentation is located in  Software CD ROM     Comments on this Document    the  docs directory of the HP AlphaServer SC System    HP welcomes any comments an
55. lease Notes   Compaq TruCluster Server Cluster Technical Overview   Compaq TruCluster Server Cluster Administration    Compaq TruCluster Server Cluster Hardware Configuration    e Compaq TruCluster Server Cluster Highly Available Applications  e Compaq Tru64 UNIX Release Notes  e Compaq Tru64 UNIX Installation Guide    e Compaq Tru64 UNIX Network Administration  Connections    e Compaq Tru64 UNIX Network Administration  Services    e Compaq Tru64 UNIX System Administration    e Compaq Tru64 UNIX System Configuration and Tuning      Summit Hardware Installation Guide from Extreme Networks  Inc     e  ExtremeWare Software User Guide from Extreme Networks  Inc     Note        The Compaq TruCluster Server documentation set provides a wealth of information  about clusters  but there are differences between HP AlphaServer SC clusters and  TruCluster Server clusters  as described in the HP AlphaServer SC System  Administration Guide  You should use the TruCluster Server documentation set to  supplement the HP AlphaServer SC documentation set     if there is a conflict of  information  use the instructions provided in the HP AlphaServer SC document        Abbreviations    Table 0 1 lists the abbreviations that are used in this document     Table 0 1 Abbreviations    Abbreviation    Description       ACL  AdvFS  API  ARP  ATM  AUI  BIND    CAA    Access Control List   Advanced File System   Application Programming Interface  Address Resolution Protocol  Asynchronous Transfer Mode  Att
56. lient systems    Universal Buffer Cache   UBC   Bypassing the UBC avoids copying data from user space to the kernel prior to  shipping it on the network  it allows the system to operate on data sizes larger than the system  page size  8KB      Although bypassing the UBC is efficient for large sequential writes and reads  the data is  read by the client multiple times when multiple processes read the same file  While this will  still be fast  it is less efficient  therefore  it may be worth setting the mode so that UBC is  used  see Section 2 2 1      2 2 1 Selection of FAST Mode    The default mode of operation for an SCFS file system is set when the system administrator  sets up the file system using the scfsmgr command  see Chapter 4      The default mode can be set to FAST  that is  bypasses the UBC  or UBC  that is  uses the  UBC   The default mode applies to all files in the file system     You can override the default mode as follows     e Ifthe default mode for the file system is UBC  specified files can be used in FAST mode  by setting the O FASTIO option on the file open    call     e Ifthe default mode for the file system is FAST  specified files can be opened in UBC  mode by setting the execute bit on the file      Note        If the default mode is set to UBC  the file system performance and characteristics are  equivalent to that expected of an NFS mounted file system           1  Note that mmap    operations are not supported for FAST files  This is because mmap
57. luster Members    See the HP AlphaServer SC Administration Guide for more information about the Cluster  File System     hp AlphaServer SC System Overview    Parallel File System  PFS     1 4 Parallel File System  PFS     PFS is a higher level file system  which allows a number of file systems to be accessed and  viewed as a single file system view  PFS can be used to provide a parallel application with   scalable file system performance  This works by striping the PFS over multiple underlying   component file systems  where the component file systems are served by different nodes     A system does not have to use PFS  where it does  PFS will co exist with CFS     See Chapter 3 for more information about PFS     1 5 SC File System  SCFS     SCFS provides a global file system for the HP AlphaServer SC system     The SCFS file system exports file systems from the FS domains to the other domains  It  replaces the role of NFS for inter domain sharing of files within the HP AlphaServer SC  system  The SCFS file system 1s a high performance system that uses the HP AlphaServer  SC Interconnect     See Chapter 4 for more information about SCFS     hp AlphaServer SC System Overview 1 5    2    Overview of File Systems and Storage    This chapter provides an overview of the file system and storage components of the HP  AlphaServer SC system     The information in this chapter is structured as follows    e Introduction  see Section 2 1 on page 2 2    e SCFS  see Section 2 2 on page 2 2 
58. main shares a common domain file system  This is  served by the system storage and provides a common image of the operating system  OS   files to all nodes within a domain  Each node has a locally attached disk  which is used to  hold the per node boot image  swap space  and other temporary files     hp AlphaServer SC System Overview 1 1    CFS Domains    1 2 CFS Domains    1 2    HP AlphaServer SC Version 2 5 supports multiple Cluster File System  CFS  domains  Each  CFS domain can contain up to 32 HP AlphaServer ES45  HP AlphaServer ES40  or HP  AlphaServer DS20Ls nodes  providing a maximum of 1024 HP AlphaServer SC nodes     Nodes are numbered from 0 to 1023 within the overall system  but members are numbered  from 1 to 32 within a CFS domain  as shown in Table 1   1  where atlas is an example  system name     Table 1 1 Node and Member Numbering in an HP AlphaServer SC System       Node Member CFS Domain  atlas0 memberl atlasDO  atlas31 member32   atlas32 member1 atlasD1  atlas63 member32   atlas64 member1 atlasD2  atlas991 member32   atlas992 member1 atlasD31  atlas1023 member32       System configuration operations must be performed on each of the CFS domains  Therefore   from a system administration point of view  a 1024 node HP AlphaServer SC system may  entail managing a single system or managing several CFS domains     this can be contrasted  with managing 1024 individual nodes  HP AlphaServer SC Version 2 5 provides several new  commands  for example  scrun  scmonmg
59. ng transferred from the client system via the  HP AlphaServer SC Elan adapter card     This allows overlap of HP AlphaServer SC Interconnect transfers and I O operations  The  sysconfig parameter io block in the SCFS stanza allows you to tune the amount of data  transferred by the SCFS server  see Section 4 3 on page 4   5   The default value is 128KB  If  the typical transfer at your site is smaller than 128K B  you can decrease this value to allow  double buffering to take effect     We recommend UBC mode for applications that use short file system transfers      performance will not be optimal if FAST mode is used  This is because FAST mode trades  the overhead of mapping the user buffer into the HP AlphaServer SC Interconnect against the  efficiency of HP AlphaServer SC Interconnect transfers  Where an application does many  short transfers  less than 16K B   this trade off results in a performance drop  In such cases   UBC mode should be used     2 3 PFS    2 4    Using SCFS  a single FS node can serve a file system or multiple file systems to all of the  nodes in the other domains  When normally configured  an FS node will have multiple  storage sets  see Section 2 5 on page 2 8   in one of the following configurations     e There is a file system per storage set     multiple file systems are exported     e The storage sets are aggregated into a single logical volume using LSM     a single file  system is exported     Overview of File Systems and Storage    PFS    Where
60. omponents     2 3 1 PFS and SCFS    PFS is a layered file system  It reads and writes data by striping it over component file  systems  SCFS is used to serve the component file systems to the CS nodes  Figure 2   1  shows a system with a single FS domain comprised of four nodes  and two CS domains  identified as single clients  The FS domain serves the component file systems to the CS  domains  A single PFS is built from the component file systems        SCFS Client SCFS Server 1   SCFS Server 2                      _A    Client Node in  Compute Domain    Figure 2 1 Example PFS SCFS Configuration    FILE SERVER DOMAIN    2 3 1 1 User Process Operation    Processes running in either  or both  of the CS domains act on files in the PFS system   Depending on the offset within the file  PFS will map the transaction onto one of the  underlying SCFS components and pass the call down to SCFS  The SCFS client code passes  the I O request  this time for the SCFS file system  via the HP AlphaServer SC Interconnect  to the appropriate file server node  At this node  the SCFS thread will transfer the data  between the client   s buffer and the file system  Multiple processes can be active on the PFS  file system at the same time  and can be served by different file server nodes     2 3 1 2 System Administrator Operation    The file systems in an FS domain are created using the sc  smgr command  This command  allows the system administrator to specify all of the parameters needed to create
61. r  scevent  and scalertmgr  that simplify the  management of a large HP AlphaServer SC system     The first two nodes of each CFS domain provide a number of services to the rest of the nodes  in their respective CFS domain     the second node also acts as a root file server backup in  case the first node fails to operate correctly     The services provided by the first two nodes of each CFS domain are as follows     e Serves as the root of the Cluster File System  CFS   The first two nodes in each CFS  domain are directly connected to a different Redundant Array of Independent Disks   RAID  subsystem     e Provides a gateway to an external Local Area Network  LAN   The first two nodes of  each CFS domain should be connected to an external LAN     hp AlphaServer SC System Overview    Cluster File System  CFS     In HP AlphaServer SC Version 2 5  there are two CFS domain types   e  File Serving  FS  domain  e  Compute Serving  CS  domain    HP AlphaServer SC Version 2 5 supports a maximum of four FS domains  The SCFS file  system exports file systems from an FS domain to the other domains  Although the FS  domains can be located anywhere in the HP AlphaServer SC system  HP recommends that  you configure either the first domain s  or the last domain s  as FS domains     this provides a  contiguous range of CS nodes for MPI jobs  It is not mandatory to create an FS domain  but  you will not be able to use SCFS if you have not done so  For more information about SCFS   see Chapter
62. r Nodes and Failover    In HP AlphaServer SC Version 2 5  you can configure up to four FS domains  Although the   FS domains can be located anywhere in the HP AlphaServer SC system  we recommend that  you configure either the first domain s  or the last domain s  as FS domains     this provides  a contiguous range of CS nodes for MPI jobs     Because file server nodes are part of CFS  any member of an FS domain is capable of serving  the file system  When an SCFS file system is being configured  one of the configuration  parameters specifies the preferred server node  This should be one of the nodes with a direct  physical connection to the storage for the file system     If the node serving a particular component fails  the service will automatically migrate to  another node that has connectivity to the storage     2 5 Storage Overview    There are two types of storage in an HP AlphaServer SC system   e Local or Internal Storage  see Section 2 5 1 on page 2 9     e Global or External Storage  see Section 2 5 2 on page 2 10     2 8 Overview of File Systems and Storage    Figure 2   2 shows the HP AlphaServer SC storage configuration     Global External Storage  Mandatory    System Storage    Storage Array         RAID       controller   cA                 Fibre Channel             RAID    controller           cB                 Bx       B             Ox     Node 0     Noce 1    Storage Array      Storage Overview    Global External Storage  Optional    Data Storage       RAID
63. r nodes are normally  connected to external high speed storage subsystems  RAID arrays   These nodes serve the  associated file systems to the remainder of the system  the other FS domain and the CS  domains  via the HP AlphaServer SC Interconnect     Note        Do not run compute jobs on the FS domains  SCFS I O is performed by kernel  threads that run on the file serving nodes  The kernel threads compete with all other  threads on these nodes for I O bandwidth and CPU availability under the control of  the Tru64 UNIX operating system  For this reason  we recommend that you do not  run compute jobs on any nodes in the FS domains  Such jobs will compete with the  SCES server threads for machine resources  and so will lower the throughput that the  SCFS threads can achieve on behalf of other jobs running on the compute nodes        Overview of File Systems and Storage    SCFS    The normal default mode of operation for SCFS is to ship data transfer requests directly to   the node serving the file system  On the server node  there is a per file system SCFS server  thread in the kernel  For a write transfer  this thread will transfer the data directly from the   user   s buffer via the HP AlphaServer SC Interconnect and write it to disk     Data transfers are done in blocks  and disk transfers are scheduled once the block has arrived   This allows large transfers to achieve an overlap between the disk and the HP AlphaServer  SC Interconnect  Note that the transfers bypass the c
64. r value  may improve performance in this case     4 3 3 Tuning SCFS Client Operations    The scfs client kernel subsystem has one configurable attribute  The max buf attribute  specifies the maximum amount of data that a client will allow to be shadow copied for an  SCFS file system  before blocking new requests from being issued  The default value of the  max buf attribute is 256MB  and can be dynamically modified     The client keeps shadow copies of data written to an SCFS file system so that  in the event of  a server crash  the requests can be re issued     The SCFS server notifies clients when requests have been synchronized to disk so that they  can release the shadow copies  and allow new requests to be issued     Ifa client node is accessing many SCFS file systems  for example  via a PFS file system  see  Chapter 3   it may be better to reduce the max_buf setting  This will minimize the impact of  maintaining many shadow copies for the data written to the different file systems     For a detailed explanation of the max  bu   subsystem attribute  see the  sys attrs scfs client  5  reference page     4 3 4 Monitoring SCFS Activity       The activity of the sc  s kernel subsystem  which implements the SCFS I O serving and data  transfer capabilities  can be monitored by using the sc  s xfer stats command  You can  use this command to determine what SCFS file systems a node is using  and report the SCFS    Managing the SC File System  SCFS  4 7    SCFS Failover    usage 
65. ration                      2 4 Preferred File Server Nodes and Failover                      2 5 Storage OVER Edel  2 5 1 Local or Internal Storage    esses   2 5 1 1 Using Local Storage for Application  O               2522  Global or External Storage          ooooooooooommmooo    2 5 2 1 System SOTA debs oil are  2 5 2 2 Data Storage    Leer S UG e ER e RR E    3 Managing the Parallel File System  PFS     3 1 PES Overview uc RR RR RR REEL RC RE nei ES  3 1 1 PES  Attributes   oso ub TUS px ANO XA ERO aS  3 1 2 Storage Capacity ofa PFS File System                    32 Planning a PFS File System to Maximize Performance   3 3 Using a PFS File System         ooooooooooororrrrrrrr so  3 3 1 Creating PES Files  i secu a ee ee  3 32 Optimizing a PFS File System         o ooooooooooooo o      Contents    Mofa sh een alt soe   xi    3 3 3 PES TocthGallsn  ted oe tM uiuos Ht at duca Heh obs de uod ds Te 3 9    3 3 3 1 PESIO  GE TESI Doy cerato Cete e IARE S RR REIP Perna 3 10  3 3 3 2 PESIO GETMAD rende Np Re V aN ee TAA 3 10  3 3 3 3 PESIO SETMJAD 4  uU ER PELA EMEN EAS 3 10  3 3 3 4 PFSIO GETDFLTMAP        sssseseese e haa 3 11  3 3 3 5 PESIO   SETDELTM AP    ote serre A Sette Ce a Gees 3 11  3 3 3 6 PESIO GETESMAP   nn e ADR gan C Pea C UE 3 11  3 3 3 7 PESIO GETEOGCAL    jy ae ek oe M AREE UE E 3 12  3 3 3 8 PFSIO GETFSLOCAL        0    cece cece cee cee nce hh 3 13  4 Managing the SC File System  SCFS   4 1 SCES OVetvie Wu ess IL UA utes tue p eee cet ita e id els 4 2  4 2 SCF
66. re  information about the alternate boot disk  see the HP AlphaServer SC Administration Guide     2 5 1 1 Using Local Storage for Application I O    PFS provides applications with scalable file bandwidth  Some applications have processes  that need to write temporary files or data that will be local to that process     for such  processes  you can write the temporary data to any local storage that is not used for boot   swap  and core files  If multiple processes in the application are writing data to their own  local file system  the available bandwidth is the aggregate of each local file system that is  being used     2 5 2 Global or External Storage    2 10    Global or external storage is provided by RAID arrays located in external storage cabinets   connected to a subset of nodes  minimum of two nodes  for availability and throughput     A HSG based storage array contains the following in system cabinets with space for disk  storage     e A pair of HSG80 RAID controllers  e Cache modules    e Redundant power supplies    Overview of File Systems and Storage    Storage Overview    An Enterprise Virtual Array storage system  HSV based  consists of the following     A pair of HSV110 RAID controllers     An array of physical disk drives that the controller pair controls  The disk drives are  located in drive enclosures that house the support systems for the disk drives     Associated physical  electrical  and environmental systems     The SANworks HSV Element Manager  which i
67. rs are all writing to  just one file  Performance improvements are also noticeable  however  where multiple  processes are all writing to multiple files  This will depend on the most common application  type used     As the stripe count of the PFS is increased  the penalty applied to operations  such as  getattr which access each component that the PFS file is striped over  will also increase   You are not advised to stripe the PFS for more than eight components  especially if there are  significant meta data operations on the specific file system     If there are operations that require mmap    support  the recommended configuration is a  stripe count of one  for more information  see the HP AlphaServer SC Administration Guide  and Release Notes      Note        Having a stripe count of one does not mean that the number of components in the  PFS is one  It means that any file in the PFS will only use one component to store  data        5 1 3 Mount Mode of the SCFS    In general  the FAST mode for SCFS is configured  This allows a fast mode operation for  reading and writing data  however  there are some caveats with this mode of operation     e UBC is not used on the client systems  so in general mmap operations will fail   To  disable SCFS FAST mode  and enable SCFS UBC mode on a SCFS FAST mounted file  system  set the execute bit on a file     5 4 Recommended File System Layout    Recommended File System Layout    Note        On a typical file system  the best performance 
68. s the graphical interface to the storage  system  The element manager software resides on the SANworks Management  Appliance and is accessed through a browser     SANworks Management Appliance  switches  and cabling   At least one host attached through the fabric     External storage is fully redundant in that each storage array is connected to two RAID  controllers  and each RAID controller is connected to at least a pair of host nodes  To provide  additional redundancy  a second Fibre Channel switch may be used  but this is not obligatory     We use the following terms to describe RAID configurations     Stripeset  RAID 0    Mirrorset  RAID 1    RAIDset  RAID 3 5    Striped Mirrorset  RAID 0 1   JBOD  Just a Bunch Of Disks     External storage can be organized as Mirrorsets  to ensure that the system continues to  function in the event of physical media failure     External storage is further subdivided as follows     System Storage  see Section 2 5 2 1   Data Storage  see Section 2 5 2 2     Overview of File Systems and Storage 2 11    Storage Overview    2 5 2 1 System Storage    System storage is mandatory and is served by the first node in each CFS domain  The second  node in each CFS domain is also connected to the system storage  for failover  Node pairs 0  and 1  32 and 33  64 and 65  and 96 and 97 each require at least three additional disks  which  they will share from the RAID subsystems  Mirrorsets   These disks are required as follows     e One disk to hold the 
69. s to be shared among a number of process elements  PEs  on different nodes on  the CFS domain  you can improve performance by ensuring that the file layout matches  the access patterns  so that all PEs are accessing the parts of a file that are local to their  nodes     If files are specific to a subset of nodes  then localizing the file to the component file  systems that are local to these nodes should improve performance     If a large file is being scanned in a sequential or random fashion  then spreading the file  over all of the component file systems should benefit performance     File Dynamics and Lifetime    Data files may exist for only a brief period while an application is active  or they may  persist across multiple runs  During this time  their size may alter significantly     These factors affect how much storage must be allocated to the component file systems   and whether backups are required     Bandwidth Requirements    Applications that run for very long periods of time frequently save internal state at  regular intervals  allowing the application to be restarted without losing too much work     Saving this state information can be a very I O intensive operation  the performance of  which can be improved by spreading the write over multiple physical file systems using  PFS  Careful planning is required to ensure that sufficient I O bandwidth is available     To maximize the performance gain  some or all of the following conditions should be met     l     P
70. statistics for the node as a whole  or for the individual file systems  in summary format  or in full detail  This information can be reported for a node as an SCFS server  as an SCFS  client  or both     For details on how to use this command  seethe sc  s xfer stats 8  reference page     4 4 SCFS Failover    The information in this section is organized as follows   e  SCFS Failover in the File Server Domain  see Section 4 4 1 on page 4 8   e Failover on an SCFS Importing Node  see Section 4 4 2 on page 4 8     4 4 1 SCFS Failover in the File Server Domain    SCFS will failover if a node fails in the FS domain because the file systems are CFS and or  AdvFS     4 4 2 Failover on an SCFS Importing Node    Failover on an SCFS importing node relies on NFS cluster failover  As NFS cluster failover  does not exist on Tru64 UNIX  and there are no plans to implement this functionality on  Tru64 UNIX  there are no plans to support SCFS failover in a compute domain     HP AlphaServer SC uses an automated mechanism to allow pfsmgr scfsmgr to unmount  PFS SCFS and remount when the importing SCFS node fails  The automated mechanism  unmounts the file systems and remounts the file systems when the importing node reboots     Note        This implementation does not imply failover        4 4 2 1 Recovering from Failure of an SCFS Importing Node  Note        If the automated mechanism fails  a cluster reboot should not be required to recover   It should be sufficient to reboot the SCFS impor
71. ting node        The automated mechanism runs the sc  smgr sync command on system reboot  There are  two possible reasons why the sc  smgr sync command did not remount the file systems     4 8 Managing the SC File System  SCFS     SCFS Failover    A problem in scfsmgr itself  Review the log files below for further information         The event log  by using the scevent command   and look in particular at SCFS   NES  and PFS classes         The log files in  var sra adm 1log scmountd  Review the log file on the domain  where the failure occurred and not on the management server         The  var sra adm log scmountd scmountd 1og file on the management  server  This log file may contain no direct evidence of the problem  However  if after  member   failed  srad failed to failover to member 2  the log file reports that the  domain did not respond     The file system was not unmounted by Tru64 UNIX  even though the original importing  member has left the cluster     Note        If this occurs  the mount or unmount commands might hang and this will not be  reflected in the log files  In the event of such a failure  send log files and support   ing data to the HP AlphaServer SC Support Centre for analysis and debugging        To facilitate analysis and debugging  follow these steps     1     To gather information on why the file system was not unmounted  run dumps ys from all  nodes in the domain  Send the data gathered to the local HP AlphaServer SC Support  Center for analysis    
72. unt for the file  is 8  and the stride size is 512K  Ifthe file is written in blocks of 4MB or more  this will make  maximum use of both the PFS and SCFS capabilities  as it will generate work for all of the  component file systems on every write  However  setting the stride size to 64K and writing  in blocks of 512K is not a good idea  as it will not make good use of SCFS capabilities     3  For PFS file systems consisting of UBC mounted SCFS components  follow these  guidelines     e Avoid False Sharing    Try to lay the file out across the component file systems such that only one node is likely  to access a particular stripe of data  This is especially important when writing data   False sharing occurs when two nodes try to get exclusive access to different parts of  the same file  This causes the nodes to repeatedly seek access to the file  as their  privileges are revoked     e Maximize Caching Benefits    A second order effect that can be useful is to ensure that regions of a file are  distributed to individual nodes  If one node handles all the operations on a particular  region  then the CFS Client cache is more likely to be useful  reducing the network  traffic associated with accessing data on remote component file systems     File system tools  such as backup and restore utilities  can act on the underlying CFS file  system without integrating with the PFS file system     External file managers and movers  such as the High Performance Storage System  HPSS   and th
73. will be obtained by writing data in the  largest possible chunks  In all cases  if the files are created with the execute bit set   then the characteristics will be that of NFS on CS domains  and AdvFS on FS  domains  In particular  for small writers or readers that require caching it is useful to  set the execute bit on files        e Small data writes are slow due to the direct communication between the client and server  and the additional latency that this entails     e Ifa process or application requires read caching  this is not available since each read  request will be directed to the server     Note        If any of the above characteristics are an important consideration  then the SCFS  should be configured in UBC mode  SCFS in UBC mode offers exactly the same  performance characteristics as NFS  If SCFS UBC is to be considered  then one  should review why NFS was not configured originally        5 1 4 Home File Systems and Data File Systems    With home file systems  you should configure the system to use NFS due to the nature and  type of usage     Note        SCFS UBC configured file systems  which are equivalent to NFS  can also be  considered if the home file system is served by another cluster in the HP  AlphaServer SC system        File systems that are used for data storage from application output  or for checkpoint restart   will benefit from an SCFS PFS file system     For more information on NFS  refer to the Compaq TruCluster Server Cluster Technical  Ov
74. y scalable  due to the ability to add more active file server nodes     hp AlphaServer SC System Overview 1 3    Cluster File System  CFS     1 4    A key feature of CFS is that every node in the domain is simultaneously a server and a client  of the CFS file system  However  this does not mandate a particular operational mode  for  example  a specific node can have file systems that are potentially visible to other nodes  but  not actively accessed by them  In general  the fact that every node is simultaneously a server  and a client is a theoretical point     normally  a subset of nodes will be active servers of file  systems into the CFS  while other nodes will primarily act as clients     Figure 1 1 shows the relationship between file systems contained by disks on a shared SCSI  bus and the resulting cluster directory structure  Each member boots from its own boot  partition  but then mounts that file system at its mount point in the clusterwide file system   Note that this figure is only an example to show how each cluster member has the same view  of file systems in a CFS domain  Many physical configurations are possible  and a real CFS  domain would provide additional storage to mirror the critical root       usr  and  var file  systems        clusterwide    clusterwide  usr  clusterwide  var    member2  boot_partition    member1  boot_partition       External RAID       Cluster Interconnect    memberid 1 memberid 2  Figure 1 1 CFS Makes File Systems Available to All C
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
remorque HTK Garant 3S  Bases des systèmes FRP de S&P - S&P Clever Reinforcement  Betriebsanleitung    Copyright © All rights reserved. 
   Failed to retrieve file