Home

Lustre 1.8 Operations Manual

1. In this example the following is ignored comment with semicolon BEFORE comment pt11 192 168 0 92 96 Because LNET silently ignores pt11 192 168 0 92 96 these nodes are not properly initialized Best Practice 2 Do not add an excessive number of comments to these options The Linux kernel has a limit on the length of string module options it is usually 1KB but may differ in vendor kernels If you exceed this limit errors result and the configuration specified by the user is not processed properly Using Routing Parameters Across a Cluster To ease Lustre administration the same routing parameters can be used across different parts of a routed cluster For example the bi directional routing example above can be used on an entire cluster TCP clients TCP IB routers and IB servers a TCP clients would ignore 02ib0 ib0 192 168 10 1 128 in ip2nets since they have no such interfaces Similarly IB servers would ignore tcp0 192 168 0 But TCP IB routers would use both since they are multi homed a TCP clients would ignore the route tcp 192 168 10 1 8 o2ib0 since the target network is a local network For the same reason IB servers would ignore o2ib 10 10 0 1 8 tcp0 Chapter 2 Understanding Lustre Networking 2 9 2 10 TCP IB routers would ignore both routes because they are multi homed Moreover the routers would enable LNet forwarding since their NIDs are specified in the routes parameters as being ro
2. define TESTDIR tmp Results directory define TESTFILE lustre_dummy Name for the file we create destroy define FILESIZE 262144 Size of the file in words define DUMWORD DEADBEEF Dummy word used to fill files define MY_STRIPE_WIDTH 2 Set this to the number of OST required define MY_LUSTRE_DIR mnt lustre ftest int close_file int fd Lustre 1 8 Operations Manual October 2009 if close fd lt 0 fprintf stderr File close failed d s n errno strerror errno return 1 return 0 int write_file int fd char stng DUMWORD int cnt 0 for cnt 0 cnt lt FILESIZE cnt write fd stng sizeof stng return 0 Open a file set a specific stripe count size and starting OST Adjust the parameters to suit int open_stripe_file char tfile TESTFILE int stripe_size 65536 System default is 4M int stripe_offset 1 Start at default int stripe_count MY_STRIPE_ WIDTH Single stripe for this demo int stripe_pattern 0 only RAID 0 at this time Ko int rc fd re llapi_ file create tfile stripe_size stripe offset stripe count stripe pattern result code is inverted we may return EINVAL or an ioctl error We borrow an error message from sanity c Kf if rc fprintf stderr llapi_file_create failed d s n rc strerror rc return 1 1lap
3. lt number gt lt number gt lt comment lt non net sep chars gt lt net sep gt n lt w gt lt whitespace chars gt lt whitespace chars gt lt net spec gt contains enough information to uniquely identify the network and load an appropriate LND The LND determines the missing address within network part of the NID based on the interfaces it can use lt iface list gt specifies which hardware interface the network can use If omitted all interfaces are used LNDs that do not support the lt iface list gt syntax cannot be configured to use particular interfaces and just use what is there Only a single instance of these LNDs can exist on a node at any time and lt iface list gt must be omitted lt net match gt entries are scanned in the order declared to see if one of the node s IP addresses matches one of the lt ip range gt expressions If there is a match lt net spec gt specifies the network to instantiate Note that it is the first match for a particular network that counts This can be used to simplify the match expression for the general case by placing it after the special cases For example ip2nets tcp ethl eth2 134 32 1 4 10 2 tcep ethl 4 nodes on the 134 32 1 network have 2 interfaces 134 32 1 4 6 8 10 but all the rest have 1 ip2nets vib 192 168 0 tcp eth2 192 168 0 1 7 4 12 This describes an IB cluster on 192 168 0 Four of thes
4. Chapter 30 Configuration Files and Module Parameters man5 30 11 30 2 5 30 12 VIB LND The VIB LND is connection based establishing reliable queue pairs over InfiniBand with its peers It does not use the acceptor It is limited to a single instance using a single HCA that can be specified via the networks module parameter If this is omitted it uses the first HCA in numerical order it can open The address within network is determined by the IPoIB interface corresponding to the HCA used Variable Description service_number 0x11b9a2 arp_retries 3 W min_reconnect_interval 1 W max_reconnect_interval 60 W timeout 50 W ntx 32 ntx_nblk 256 concurrent_peers 1152 hca_basename InfiniHost ipif_basename ipoib local_ack_timeout 0x12 Wc retry_cnt 7 We rnr_cnt 6 Wc Fixed IB service number on which the LND listens for incoming connection requests NOTE All instances of the viblnd on the same network must have the same setting for this parameter Number of times the LND will retry ARP while it establishes communications with a peer Minimum connection retry interval in seconds After a failed connection attempt this sets the time that must elapse before the first retry As connections attempts fail this time is doubled on each successive retry up to a maximum of the max_reconnect_interval option Maximum connection retry interval in seconds
5. lst update group clients refresh lst update group clients clean busy lst update group clients clean invalid invalid busy down unknown lst update_group clients remove 192 168 1 10 20 tcp 18 24 Lustre 1 8 Operations Manual October 2009 list_group NAME active busy down unknown all Prints information about a group or lists all groups in the current session if no group is specified NAME active busy down unknown all The name of the group Lists the active nodes Lists the busy nodes Lists the down nodes Lists unknown nodes Lists all nodes lst list_group 1 clients 2 servers Total 2 groups lst list_group clients ACTIVE BUSY DOWN UNKNOWN TOTAL 31206 Ist list_group clients all 192 168 1 10 tcp Active 192 168 1 11 tcp Active 192 168 1 12 tcp Busy 192 168 1 13 tcp Active 192 168 1 14 tcp DOWN 192 168 1 15 tcp DOWN Total 6 nodes lst list_group clients busy 192 168 1 12 tcp Busy Total 1 node Chapter 18 Lustre I O Kit 18 25 18 26 del_group NAME Removes a group from the session If the group is referred to by any test then the operation fails If nodes in the group are referred to only by this group then they are kicked out from the current session otherwise they are still in the current session lst del_ group clients Userland client Istclient sesid NID group NAME Use Istcli
6. 7 1 Multihomed Servers If you are using multiple networks with Lustre certain configuration settings are required Throughout this section a worked example is used to illustrate these settings In this example servers megan and oscar each have three TCP NICs eth0 eth1 and eth2 and an Elan NIC The eth2 NIC is used for management purposes and should not be used by LNET TCP clients have a single TCP interface and Elan clients have a single Elan interface 7 1 1 Modprobe conf Options under modprobe conf are used to specify the networks available to a node You have the choice of two different options the networks option which explicitly lists the networks available and the ip2nets option which provides a list matching lookup Only one option can be used at any one time The order of LNET lines in modprobe conf is important when configuring multi homed servers If a server node can be reached using more than one network the first network specified in modprobe conf will be used 7 1 7 2 Networks On the servers options lnet networks tcp0 eth0 eth1l elan0 Elan only clients options lnet networks elan0 TCP only clients options lnet networks tcp0 Note In the case of TCP only clients the first available non loopback IP interface is used for tcp0 since the interfaces are not specified ip2nets The ip2nets option is typically used to provide a single universal modprobe conf file that can be run
7. EX Ur Ur UE VE Ur ww Ur Caution Parameters specified with the 1ct1 conf_param command are set permanently in the file system s configuration file on the MGS Lustre 1 8 Operations Manual October 2009 Getting Parameters To get current Lustre parameter settings use the lctl get_param command with this syntax letl get _ param n lt obdtype gt lt obdname gt lt proc_file_name gt For example lctl get_param n ost ost_io timeouts Listing Parameters To list Lustre parameters that are available to set use the lctl list_param command with this syntax letl list_param n lt obdtype gt lt obdname gt For example lctl list_param obdfilter lustre OST0000 Chapter 4 Configuring Lustre 4 23 4 3 10 4 24 Running the Writeconf Command If the system s configuration logs are in a state where the file system cannot be started or if you are changing a server NID use the writeconf command to erase all of the file system s configuration logs including all lct1 conf_param settings After the writeconf command is run the configuration logs are re generated as servers restart and the current server NIDs are used To run the writeconf command 1 Unmount all servers and clients 2 On the MDT run mdt gt tunefs lustre writeconf lt mount point gt 3 Remount all servers You must mount the MDT first Caution Lustre 1 8 introduces the OST pools feature which enables a
8. This command must be run on the MDS Lustre 1 8 Operations Manual October 2009 In this example there are two OSTs lustre OST0000 and lustre OST0001 which are both active cfs21 tmp cat proc fs lustre lov lustre mdtlov target_obd 0 lustre OSTO000 UUID ACTIVI 1 lustre OST0001_UUID ACTIVI A 4 3 6 Mounting a Server Without Lustre Service If you are using a combined MGS MDT but you only want to start the MGS and not the MDT run this command mount t lustre lt MDT partition gt o nosvc lt mount point gt The lt MDT partition gt variable is the combined MGS MDT In this example the combined MGS MDT is test s MDT0000 and the mount point is mnt test mdt mount t lustre L testfs MDT0000 o nosve mnt test mdt Chapter 4 Configuring Lustre 4 17 4 3 7 4 18 Specifying Failout Failover Mode for OSTs Lustre uses two modes failout and failover to handle an OST that has become unreachable because it fails is taken off the network is unmounted etc a In failout mode Lustre clients immediately receive errors EIOs after a timeout instead of waiting for the OST to recover m In failover mode Lustre clients wait for the OST to recover By default the Lustre file system uses failover mode for OSTs To specify failout mode instead run this command mkfs lustre fsname lt fsname gt ost mgsnode lt MGS node NID gt param failover mode failout lt block device name g
9. proc fs lustre ldlm ldlm namespaces lt OSC name MDCname gt lru_size The 1ru_size parameter is used to control the number of client side locks in an LRU queue LRU size is dynamic based on load This optimizes the number of locks available to nodes that have different workloads e g login build nodes vs compute nodes vs backup nodes The total number of locks available is a function of the server s RAM The default limit is 50 locks 1 MB of RAM If there is too much memory pressure then the LRU size is shrunk The number of locks on the server is limited to number of OST MDT on node number of clients client lru_size m To enable automatic LRU sizing set the 1ru_size parameter to 0 In this case the 1ru_size parameter shows the current number of locks being used on the export In Lustre 1 6 5 1 and later LRU sizing is enabled by default m To specify a maximum number of locks set the 1ru_size parameter to a value gt 0 former numbers are okay 100 CPU_NR We recommend that you only increase the LRU size on a few login nodes where users access the file system interactively To clear the LRU on a single client and as a result flush client cache without changing the 1ru_size value lctl set_param 1dlm namespaces lt osc_name mdc_name gt 1lru_size clear If you shrink the LRU size below the number of existing unused locks then the unused locks are canceled immediately Use echo clear to cancel all locks
10. 3 1 4 als Optional High Availability Software If you plan to enable failover server functionality with Lustre either on an OSS or MDS you must add high availability HA software to your cluster software You can use any HA software package with Lustre For more information see Failover Debugging Tools Lustre is a complex system and you may encounter problems when using it You should have debugging tools on hand to help figure out how and why a problem occurred The e2fsprogs package available on the Lustre download site includes the Lustre debugfs tool which can be can used to interactively debug an ext3 Idiskfs file system The debugfs utility can either be used either to check status of or modify information in the file system There are also several third party tools you can use such as GDB coupled with crash These tools can be used to investigate live systems and kernel core dumps There are also useful kernel patches modules such as netconsole and netdump that allow core dumps to be made across the network For more information see these websites Third party Tool URL GDB http www gnu org software gdb gdb html crash http oss missioncriticallinux com projects crash netconsole http Iwn net 2001 0927 a netconsole php3 netdump http www redhat com support wpapers redhat netdump 3 4 3 In this manual the Linux HA Heartbeat package is referenced but you can use any HA software
11. CHAPTER 22 LustreProc This chapter describes Lustre proc entries and includes the following sections m Proc Entries for Lustre m Lustre I O Tunables a Debug Support The proc file system acts as an interface to internal data structures in the kernel It can be used to obtain information about the system and to change certain kernel parameters at runtime sysct1 The Lustre file system provides several proc file system variables that control aspects of Lustre performance and provide information The proc variables are classified based on the subsystem they affect 22 1 22 1 22 41 22 2 Proc Entries for Lustre This section describes proc entries for Lustre Locating Lustre File Systems and Servers Use the proc files on the MGS to locate the following All known file systems cat proc fs lustre mgs MGS filesystems spfs lustre The server names participating in a file system for each file system that has at least one server running cat proc fs lustre mgs MGS live spfs fsname spfs flags 0x0 gen 7 spfs MDT0000 spfs OST0000 All servers are named according to this convention lt fsname gt lt MDT OST gt lt XXXX gt This can be shown for live servers under proc fs lustre devices 0 1 2 3 4 5 6 7 b36cf9cbdaa04 8 b36cf9cbdaa05 9 cat proc fs lustre devices P mgs MGS MGS 11 mgc MGC192 168 10 34 tcp 1 45bb57 d9be 2ddb c0b0 5431a49226705 mdt MDS MDS_uui
12. How do I configure recoverable failover object servers There are two object server modes the default failover recoverable mode and the fail out mode In fail out mode if a client becomes disconnected from an object server because of a server or network failure applications which try to use that object server will receive immediate errors In failover mode applications attempting to use that resource pause until the connection is restored which is what most people want This is the default mode in Lustre 1 4 3 and later To disable failover mode 1 If this is an existing Lustre configuration shut down all client MDS and OSS nodes Change the configuration script to add failover to all ost lines Change lines like imc add ost to imc add ost failover and regenerate your Lustre configuration file Start your object servers They should report that recovery is enabled to syslog Lustre 1394 0 filter c 1205 filter_common_setup databarn ost3 recovery enabled Update the MDS and client configuration logs On the MDS run leonf write_conf path to lustre xml Start the MDS as usual Mount Lustre on the clients Appendix A Lustre Knowledge Base A 11 How do I resize an MDS OST file system This is a method to back up the MDS including the extended attributes containing the striping data If something goes wrong you can restore it to a newly formatted larger file system with
13. Lustre 1 8 Operations Manual October 2009 4 3 3 In general it is wise to specify noauto and let your high availability HA package manage when to mount the device If you are not using failover make sure that networking has been started before mounting a Lustre server RedHat SuSE Debian and perhaps others use the _netdev flag to ensure that these disks are mounted after the network is up We are mounting by disk label here the label of a device can be read with e2label The label of a newly formatted Lustre server ends in FFFF meaning that it has yet to be assigned The assignment takes place when the server is first started and the disk label is updated Caution Do not do this when the client and OSS are on the same node as memory pressure between the client and OSS can lead to deadlocks Caution Mount by label should NOT be used in a multi path environment Unmounting a Server To stop a Lustre server use the umount lt mount point gt command For example to stop ost0 on mount point mnt test run umount mnt test Gracefully stopping a server with the umount command preserves the state of the connected clients The next time the server is started it waits for clients to reconnect and then goes through the recovery procedure If the force flag is used then the server evicts all clients and stops WITHOUT recovery Upon restart the server does not wait for recovery Any currently connecte
14. portal thread min max For details see Changing MDS and OSS Thread Counts Optimizing the Number of Service Threads An OSS can have a minimum of 2 service threads and a maximum of 512 service threads The number of service threads is a function of how much RAM and how many CPUs are on each OSS node 1 thread 128MB num_cpus If the load on the OSS node is high new service threads will be started in order to process more requests concurrently up to 4x the initial number of threads subject to the maximum of 512 For a 2GB 2 CPU system the default thread count is 32 and the maximum thread count is 128 Increasing the size of the thread pool may help when m Several OSTs are exported from a single OSS m Back end storage is running synchronously m I O completions take excessive time In such cases a larger number of I O threads allows the kernel and storage to aggregate many writes together for more efficient disk I O The OSS thread pool is shared each thread allocates approximately 1 5 MB maximum RPC size 0 5 MB for internal I O buffers 20 2 Lustre 1 8 Operations Manual October 2009 20 1 2 It is very important to consider memory consumption when increasing the thread pool size Drives are only able to sustain a certain amount of parallel I O activity before performance is degraded due to the high number of seeks and the OST threads just waiting for I O In this situation it may be advisable to decrease the lo
15. root oss1 lctl set_param obdfilter writethrough_cache_enable 0 To re enable writethrough cache on one OST run root oss1 lctl set_param obdfilter OST_name writethrough_cache_enable 1 To check if writethrough cache is root oss1 lctl set_param obdfilter writethrough_cache_enable 1 Lustre 1 8 Operations Manual October 2009 m readcache_max_filesize controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory Files larger than readcache_max_filesize will not be kept in cache for either reads or writes This can be very useful for workloads where relatively small files are repeatedly accessed by many clients such as job startup files executables log files etc but large files are read or written only once By not putting the larger files into the cache it is much more likely that more of the smaller files will remain in cache for a longer time When setting readcache_max_filesize the input value can be specified in bytes or can have a suffix to indicate other binary units such as Kilobytes Megabytes Gigabytes Terabytes or Petabytes To limit the maximum cached file size to 32MB on all OSTs of an OSS run root oss1 lctl set_param obdfilter readcache_max_filesize 32M To disable the maximum cached file size on an OST run root oss1 lctl set_param obdfilter OST_name readcache_max_filesize 1 To check the current maximum cached file size on al
16. BJ Caution The 1ct1 conf_param command permanently sets parameters in the file system configuration To get current Lustre parameter settings use the lctl get_param command with this syntax lctl get_param n lt obdtype gt lt obdname gt lt proc_file_name gt For example lctl get_param n ost ost_io timeouts To list Lustre parameters that are available to set use the lctl list_param command with this syntax letl list_param n lt obdtype gt lt obdname gt For example lctl list_param obdfilter lustre OST0000 Chapter 31 System Configuration Utilities man8 31 9 31 10 Network Configuration Option Description network lt up down gt lt tcp elan myrinet gt Starts or stops LNET Or select a network type for other Ictl LNET commands list_nids Prints all NIDs on the local node LNET must be running which_nid lt nidlist gt From a list of NIDs for a remote node identifies the NID on which interface communication will occur ping nid Check s LNET connectivity via an LNET ping This uses the fabric appropriate to the specified NID interface_list Prints the network interface information for a given network type peer_list Prints the known peers for a given network type conn_list Prints all the connected remote NIDs for a given network type active_tx This command prints active transmits It is only used for the Elan network type Lustre 1 8 Operations Manu
17. CERROR Something very bad has happened and the return code is d n rc Add messages to aid in call tracing takes no arguments When using these macros cover all exit conditions to avoid confusion when the debug log reports that a function was entered but never exited Used when tracing MDS and VFS operations for locking These macros build a thin trace that shows the protocol exchanges between nodes Prints information about the given ptlrpc_request structure Allows insertion of failure points into the Lustre code This is useful to generate regression tests that can hit a very specific sequence of events This works in conjunction with sysctl w lustre fail_loc fail_loc to set a specific failure point for which a given OBD_FAIL_CHECK will test Similar to OBD_FAIL_CHECK Useful to simulate hung blocked or busy processes or network devices If the given fail_loc is hit OBD_FAIL_TIMEOUT waits for the specified number of seconds Similar to OBD_FAIL_CHECK Useful to have multiple processes execute the same code concurrently to provoke locking races The first process to hit OBD_RACE sleeps until a second process hits OBD_RACE then both processes continue A flag set on a lustre fail_loc breakpoint to cause the OBD_FAIL_CHECK condition to be hit only one time Otherwise a fail_loc is permanent until it is cleared with sysctl w lustre fail_loc 0 Chapter 23 Lustre Debugging 23 11 Macro Description
18. W Wide striping The API that manipulates storage objects This API is richer than that of block devices and includes the create delete of storage objects read write of buffers from and to certain offsets set attributes and other storage object metadata A generic concept referring to data containers similar identical to file inodes A contiguous logical extent of a Lustre file written to a single OST The maximum size of a stride typically 4 MB The number of OSTs holding objects for a RAIDO striped Lustre file The extended attribute associated with a file that describes how its data is distributed over storage objects See also default stripe pattern An object storage protocol tied to the SCSI transport layer Lustre does not use T10 Strategy of using many OSTs to store stripes of a single file This obtains maximum bandwidth to a single file through parallel utilization of many OSTs Glossary 9 Glossary 10 Lustre 1 8 Operations Manual October 2009 Index A access control list ACL 25 1 ACL using 25 1 ACLs examples 25 3 Lustre support 25 2 active active configuration failover 8 7 adaptive timeouts configuring 22 6 interpreting 22 8 introduction 22 5 adding clients 4 10 OSTs 4 10 adding multiple LUNs on a single HBA 26 5 allocating quotas 9 7 backup device level 15 2 file system level 15 1 file level 15 4 benchmark Bonnie 17 2 IOR 17 3 TOzone 17 5 bonding 12 1 configuring
19. With Lustre 1 4 2 and later you can abort recovery when starting a service by adding abort recovery to the lconf command line For earlier Lustre versions or if the service has already started follow these steps 1 Find the correct device The server console displays a message similar to RECOVERY service mds1 10 recoverable clients last_transno 1664606 2 Obtain a list of all Lustre devices On the MDS or OST run lctl device_list 3 Look for the name of the recovering service in this case mds1 3 UP mds mds1 mds1_UUID 2 4 Instruct Lustre to abort recovery run letl device lt OST device number gt abort_recovery The device number is on the left Lustre 1 8 Operations Manual October 2009 What does denying connection for new client mean When service nodes are performing recovery after a failure only clients which were connected before the failure are allowed to connect This enables the cluster to first re establish its pre failure state before normal operation continues and new clients are allowed to connect How do I set a default debug level for clients If using zeroconf mount t lustre you can add a line similar to the following to your modules conf post install portals sysctl w lnet debug 0x3 0400 This sets the debug level whenever the portals module is loaded to whatever value you specify The value specified above is a good starting choice and will become the in code default
20. and a 2 million file working set of which 400 000 files are cached on the clients file system journal 400 MB 1000 4 core clients 100 files core 2kB 800 MB 16 interactive clients 10 000 files 2kB 320 MB 1 600 000 file extra working set 1 5kB file 2400 MB Thus the minimum requirement for a system with this configuration is 4 GB RAM However additional memory may significantly improve performance If there are directories containing 1 million or more files you may benefit significantly from having more memory For example in an environment where clients randomly access one of 10 million files having extra memory for the cache significantly improves performance 5 Having more RAM is always prudent given the relatively low cost of this component compared to the total system cost 3 6 Lustre 1 8 Operations Manual October 2009 3 1 7 2 OSS Memory Requirements When planning the hardware for an OSS node consider the memory usage of several components in the Lustre system i e journal service threads file system metadata etc Also consider the effect of the OSS read cache feature new in Lustre 1 8 which consumes memory as it caches data on the OSS node Journal size By default each Lustre Idiskfs file system has 400 MB for the journal size This can pin up to an equal amount of RAM on the OSS node per file system Service threads The service threads on the OSS node pre allocate a 1 MB I O buffer for each
21. block device dev sdd mount point mnt ost2 Second OSS node in Lustre file system temp Second OST in Lustre file system temp Block device for the second OSS node oss2 Mount point for the ost2 block device dev sdd on the oss2 node Client node client node client1 mount point lustre Client in Lustre file system temp Mount point for Lustre file system temp on the client1 node Chapter 4 Configuring Lustre 4 5 4 6 1 Define the module options for Lustre networking LNET by adding this line to the etc modprobe conf file options lnet networks tcp Create a combined MGS MDT file system on the block device On the MDS node run root mds mkfs lustre fsname temp mgs mdt dev sdb This command generates this output Permanent disk data Target temp MDTffff Index unassigned Lustre FS temp Mount type ldiskfs Flags 0x75 MDT MGS needs_index first_time update Persistent mount opts errors remount ro iopen_nopriv user_xattr Parameters mdt group_upcall usr sbin 1_getgroups checking for existing Lustre data not found device size 16MB 2 6 18 formatting backing filesystem ldiskfs on dev sdb target name temp MDTffff 4k blocks 0 options i 4096 I 512 q O dir _index uninit_ groups F mkfs_cmd mkfs ext2 j b 4096 L temp MDTffff i 4096 I 512 q 0O dir_index uninit_groups F dev sdb Writing CONFIGS mountdata Mount the combined MGS MDT file system on the block device On the
22. cums die 0 o o0 0 o 0 read write rpcs in flight rpcs cums rpcs cums 0 0 0 o0 0 o o0 read write offset rpcs cums rpcs cums 0 0 o 0 0 o 0 RPCs in flight This represents the number of RPCs that are issued by the OSC but are not complete at the time of the snapshot It should always be less than or equal to max_rpcs_in_ flight pending read write pages These fields show the number of pages that have been queued for I O in the OSC other RPCs in flight when a new RPC is sent When an RPC is sent it records the number of other RPCs that were pending in this table When the first RPC is sent the 0 row will be incremented If the first RPC is sent while another is pending the 1 row will be incremented and so on The number of RPCs that are pending as each RPC completes is not tabulated This table is a good way of visualizing the concurrency of the RPC stream Ideally you will see a large clump around the max_rpcs_in_flight value which shows that the network is being kept busy pages in each RPC As an RPC is sent the number of pages it is made of is recorded in order in this table A single page RPC increments the 0 row 128 pages the 7 row and so on These histograms can be cleared by writing any value into the rpc_stats file 22 14 Lustre 1 8 Operations Manual October 2009 22253 Client Read Write Offset Survey The offset_stats parameter maintains statistics for occurr ences where a series of read o
23. dev dev sdb Lustre 1 8 Operations Manual October 2009 Now for the LOVs imc m test xml add lov lov foo lov mds foo mds stripe_sz 1048576 stripe_cnt 1 stripe_pattern 0 imc m test xml add lov lov bar lov mds bar mds stripe_sz 1048576 stripe_cnt 1 stripe_pattern 0 Each LOV needs at least one OST imc m test xml add ost node ost_server lov foo lov ost foo ostl group foo ostl1 fstype ldiskfs dev dev sdc imc m test xml add ost node ost_server lov bar lov ost bar ostl group bar ost1 fstype ldiskfs dev dev sdd Set up the client mount points imc m test xml add mtpt node foo client path mnt foo mds foo mds lov foo lov imc m test xml add mtpt node bar client path mnt bar mds bar mds lov bar lov If the Lustre file system foo already exists and you want to add the file system bar without reformatting foo use the group designator to reformat only the new disks ost_server gt lconf group bar ost1 select bar ost1 reformat test xml mds_server gt lconf group bar mds select bar mds reformat test xml If you change the dev that foo mds uses you also need to commit that new configuration foo mds must not be running mds_server gt lconf group foo mds select foo mds write_conf test xml Note If you want both mount points on a client you can use the same client node name for both mount points Appe
24. lt lustre liblustreapi h gt lt lustre lustre_user h gt i_quotactl char mnt f_quotactl u32 u32 u32 u32 struct obd_dqinfo struct obd_dqblk char struct obd_uuid struct obd_dqblk u64 dgb_bhardlimit __u64 dqb_bsoftlimit u64 dqb_curspace u64 dgb_ihardlimit __u64 dqb_isoftlimit u64 dgb_curinodes u64 dgqb_btime __u64 dqb_itime u32 dqb_valid u32 padding struct obd_dqinfo si u64 dqi_bgrace u64 dqi_igrace u32 dqi_flags u32 dqi_ valid struct obd_uuid 3 char uuid 40 Lustre 1 8 Operations Manual October 2009 struct if_quotactl qctl qc_cmd qc_type qc_id qc_stat qc_dginfo qc_dqb1lk obd_type 16 obd_uuid Description The 1lapi_quotactl command manipulates disk quotas on a Lustre file system mount qc_cmd indicates a command to be applied to UID qc_id or GID qc_id Option Description LUSTRE_Q_QUOTAON Turns on quotas for a Lustre file system qc_type is USRQUOTA GRPQUOTA or UGQUOTA both user and group quota The quota files must exist They are normally created with the Ilapi_quotacheck 3 call This call is restricted to the super user privilege LUSTRE_Q_QUOTAOFF Turns off quotas for a Lustre file system qc_type is USRQUOTA GRPQUOTA or UGQUOTA both user and group quota This call is restricted to the super user privilege LUSTRE_Q_GETQUOTA Gets disk quota limits and current usage for user or grou
25. s recovery period No action is needed Tips for Using VBR VBR will be successful for clients which do not share data with other client Therefore the strategy for reliable use of VBR is to store a client s data in its own directory where possible VBR can recover these clients even if other clients are lost Chapter 19 Lustre Recovery 19 7 19 8 Lustre 1 8 Operations Manual October 2009 part IIT Lustre Tuning Monitoring and Troubleshooting The part includes chapters describing how to tune debug and troubleshoot Lustre CHAPTER 20 Lustre Tuning This chapter contains information to tune Lustre for better performance and includes the following sections Module Options LNET Tunables Options to Format MDT and OST File Systems Network Tuning DDN Tuning Large Scale Tuning for Cray XT and Equivalents 20 1 20 1 20 1 1 20 1 1 1 Module Options Many options in Lustre are set by means of kernel module parameters These parameters are contained in the modprobe conf file On SuSE this may be modprobe conf local OSS Service Thread Count The oss_num_threads parameter enables the number of OST service threads to be specified at module load time on the OSS nodes options ost oss_num_threads N Tip After startup the minimum and maximum number of OSS thread counts can be set via the min max _thread_count tunable To change the tunable at runtime use lctl get set conf _param service
26. the file starts on the specified OST index starting at zero 0 stripe count Stripe count is how many OSTs to use The default stripe count is 1 and passing a stripe count of 0 causes the default stripe count to be used A stripe count of 1 means that all available OSTs should be used Note If you pass a starting ost of 0 and a stripe count of 1 all files are written to OST 0 until space is exhausted This is probably not what you meant to do If you only want to adjust the stripe count and keep the other parameters at their default settings do not specify any of the other parameters lfs setstripe c lt stripe count gt lt file gt Lustre 1 8 Operations Manual October 2009 24 3 1 24 3 2 Changing Striping for a Subdirectory In a directory the 1fs setstripe command sets a default striping configuration for files created in the directory The usage is the same as 1fs setstripe for a regular file except that the directory must exist prior to setting the default striping configuration If a file is created in a directory with a default stripe configuration without otherwise specifying striping Lustre uses those striping parameters instead of the file system default for the new file To change the striping pattern file layout for a sub directory create a directory with desired file layout as described above Sub directories inherit the file layout of the root parent directory Note Striping of ne
27. 1 18 Lustre 1 8 Operations Manual October 2009 CHAPTER 2 Understanding Lustre Networking This chapter describes Lustre Networking LNET and supported networks and includes the following sections m Introduction to LNET m Supported Network Types m Designing Your Lustre Network Configuring LNET 21 Introduction to LNET In a Lustre network servers and clients communicate with one another using LNET a custom networking API which abstracts away all transport specific interaction In turn LNET operates with a variety of network transports through Lustre Network Drivers LNDs The following terms are important to understanding LNET a LND Lustre Network Driver A modular sub component of LNET that implements one of the network types LNDs are implemented as individual kernel modules or a library in userspace and typically must be compiled against the network driver software a Network A group of nodes that communicate directly with each other The network is how LNET represents a single cluster Multiple networks can be used to connect clusters together Each network has a unique type and number for example tcp0 tcp1 or elanO m NID Lustre Network Identifier The NID uniquely identifies a Lustre network endpoint including the node and the network type There is an NID for every network which a node uses 2 1 Key features of LNET include RDMA when supported by underlying networks such as Elan
28. 12 2 Requirements The most basic requirement for successful bonding is that both endpoints of the connection must support bonding In a normal case the non server endpoint is a switch Two systems connected via crossover cables can also use bonding Any switch used must explicitly support 802 3ad Dynamic Link Aggregation The kernel must also support bonding All supported Lustre kernels have bonding functionality The network driver for the interfaces to be bonded must have the ethtool support To determine slave speed and duplex settings ethtool support is necessary All recent network drivers implement it To verify that your interface supports ethtool run which ethtool sbin ethtool ethtool eth0 Settings for eth0 Supported ports TP MII Supported link modes 10baseT Half 10baseT Ful1 100baseT Half 100baseT Full Supports auto negotiation Yes Advertised link modes 10baseT Half 10baseT Full 100baseT Half 100baseT Full Advertised auto negotiation Yes Speed 100Mb s Duplex Full Port MII PHYAD 1 Transceiver internal Auto negotiation on Supports Wake on pumbg Wake on d Current message level 0x00000001 1 Link detected yes Lustre 1 8 Operations Manual October 2009 ethtool eth1 Settings for ethl Supported ports TP MII Supported link modes 10baseT Half 10baseT Full 100baseT Half 100baseT Full Supports auto negotiation Yes Advertised link modes 10baseT Half 10bas
29. 4 Idiskfs is the Sun development version of ext4 Lustre 1 8 Operations Manual October 2009 3 1 6 Environmental Requirements Make sure the following environmental requirements are met before installing Lustre Recommended Provide remote shell access to clients Although not strictly required to run Lustre we recommend that all cluster nodes have remote shell client access to facilitate the use of Lustre configuration and monitoring scripts Parallel Distributed SHell pdsh is preferable although Secure SHell SSH is acceptable Ensure client clocks are synchronized Lustre uses client clocks for timestamps If clocks are out of sync between clients and servers timeouts and client evictions will occur Drifting clocks can also cause problems by for example making it difficult to debug multi node issues or correlate logs which depend on timestamps We recommend that you use Network Time Protocol NTP to keep client and server clocks in sync with each other For more information about NTP see http www ntp org Maintain uniform file access permissions on all cluster nodes Use the same user IDs UID and group IDs GID on all clients If use of supplemental groups is required verify that the group_upcall requirements have been met See User Group Cache Upcall Recommended Disable Security Enhanced Linux SELinux on servers and clients Lustre does not support SELinux Therefore disable the SELinux system extensio
30. DIRECT I O 18 14 Directory statahead using 22 19 downed routers 2 12 downgrade 1 8 x to 1 6 x 13 8 complete file system 13 9 rolling 13 11 E e2fsprogs 3 3 Elan Quadrics Elan 2 2 Elan to TCP routing Index 2 Lustre 1 8 Operations Manual October 2009 modprobe conf 7 5 start clients 7 5 start servers 7 5 end to end client checksums 24 16 environmental requirements 3 5 error messages 21 5 external journal creating 10 6 F failover 8 1 active active configuration 8 7 configuring 4 28 configuring MDS and OSTs 8 6 connection handling 8 4 hardware requirements 8 8 Heartbeat 8 4 MDS 8 6 OST 8 6 power equipment 8 3 power management software 8 3 role of nodes 8 5 setup with Heartbeat V1 8 9 setup with Heartbeat V2 8 18 software considerations 8 23 starting stopping a resource 8 7 failover Heartbeat V1 configuring Heartbeat 8 10 installing software 8 9 failover Heartbeat V2 configuring hardware 8 19 installing software 8 18 operating 8 22 file formats quotas 9 12 File readahead using 22 19 file striping 24 1 file system name 4 14 file system level backup 15 1 filefrag command 27 20 file level backup 15 4 file level restore 15 4 flock utility 31 22 free space management adjusting weighting between free space and location 24 12 round robin allocator 24 11 weighted allocator 24 11 G getting Lustre parameters 4 23 GM and MX Myrinet 2
31. Elan to TCP Routing Servers megan and oscar are on the Elan network with eip addresses 132 6 1 2 and 4 Megan is also on the TCP network at 192 168 0 2 and routes between TCP and Elan There is also a standalone router router1 at Elan 132 6 1 10 and TCP 192 168 0 10 Clients are on either Elan or TCP Modprobe conf modprobe conf is identical on all nodes run options lnet ip2nets tcp0 192 168 0 elan0 132 6 1 routes tcp 2 10 elan0 elan 192 168 0 2 10 tcp0 Start servers To start router1 run modprobe linet letl network configure To start megan and oscar run mkfs lustre fsname spfs mdt mgs dev sda mkdir p mnt test mdt mount t lustre dev sda mnt test mdt mount t lustre mgs16 tcp0 1 elan testfs mnt testfs Ur Ur Ur Ur Start clients For the TCP client run mount t lustre megan mdsA client mnt lustre For the Elan client run mount t lustre 2 elan0 mdsA client mnt lustre Chapter 7 More Complicated Configurations 7 5 7 3 Load Balancing with InfiniBand There is one OSS with two InfiniBand HCAs Lustre clients have only one InfiniBand HCA using native Lustre drivers of o2ibind Load balancing is done on both HCAs on the OSS with the help of LNET 7 3 1 Modprobe conf Lustre users have options available on following networks m Dual HCA OSS server options Inet ip2nets 02ib0 ib0 o2ib1 ib1 192 168 10 1 101 102 m Client with the odd IP address options In
32. If your network is not using SOCKLND or InfiniBand and uses Quadrics Elan or Myrinet for example configure a etc lustre nid2hostname simple script that translates a NID to a hostname on each server node MDT and OST This is an example on an Elan cluster bin bash set x exec 2 gt tmp basename 0 debug convert a NID for a LND to a hostname for GSS for example called with three arguments lnd netid nid Sind will be string QSWLND GMLND etc Snetid will be number in hex string format like 0x16 etc nid has the same format as netid output the corresponding hostname or error message leaded by a for error logging Ind 1 netid 2 nid 3 11 8 Lustre 1 8 Operations Manual October 2009 11 2 1 6 uppercase the hex nid echo nid tr abcdef ABCDEF and convert to decimal nid echo e ibase 16 n nid 0x bc case lnd in QSWLND Simply stick mtn on the front echo mtn nid 11 echo unknown LND lnd esac Building Lustre If you are compiling the kernel from the source enable GSS during configuration configure with linux path to _ linux source enable gss other options When you enable Lustre with GSS the configuration script checks all dependencies like Kerberos and 1ibgssapi installation and in kernel SUNRPC related facilities When you install lustre xxx rpm on target machines RPM again checks for dependencies
33. Myrinet and InfiniBand Support for many commonly used network types such as InfiniBand and TCP IP High availability and recovery features enabling transparent recovery in conjunction with failover servers Simultaneous availability of multiple network types with routing between them LNET is designed for complex topologies superior routing capabilities and simplified configuration 22 Supported Network Types LNET supports the following network types TCP openib Mellanox Gold InfiniBand cib Cisco Topspin iib Infinicon InfiniBand vib Voltaire InfiniBand o2ib OFED InfiniBand and iWARP ra RapidArray Elan Quadrics Elan GM and MX Myrinet Cray Seastar 2 2 Lustre 1 8 Operations Manual October 2009 23 2 31 292 2919 Designing Your Lustre Network Before you configure Lustre it is essential to have a clear understanding of the Lustre network topologies Identify All Lustre Networks A network is a group of nodes that communicate directly with one another As previously mentioned in this manual Lustre supports a variety of network types and hardware including TCP IP Elan varieties of InfiniBand Myrinet and others The normal rules for specifying networks apply to Lustre networks For example two TCP networks on two different subnets tcp0 and tcp1 would be considered two different Lustre networks Identify Nodes to Route Between Networks Any node with appropriate interfaces ca
34. Note If the OST later becomes available it needs to be reactivated run lctl device lt OST device name or number gt activate 3 Determine all the files that are striped over the missing OST run lfs find R o OST_UUID mountpoint This returns a simple list of filenames from the affected file system 4 If necessary you can read the valid parts of a striped file run dd if filename of new_filename bs 4k conv sync noerror 5 You can delete these files with the unlink or munlink command unlink munlink filename filename Note There is no functional difference between the unlink and munlink commands The unlink command is for newer Linux distributions You can run munlink if unlink is not available When you run the unlink or munlink command the file on the MDS is permanently removed 6 If you need to know specifically which parts of the file are missing data then you first need to determine the file layout striping pattern which includes the index of the missing OST Run lfs getstripe v filename 7 Use this computation is to determine which offsets in the file are affected C N X S C N X S S 1 N 0 1 2 where C stripe count S stripe size X index of bad OST for this file For example for a 2 stripe file stripe size 1M the bad OST is at index 0 and you have holes in the file at 2 N 0 1M 2 N 0 1M 1M 1 N 0 1 2 If
35. The interval of the statistics in seconds lst run bulkperf lst stat clients LNet Rates of clients W Avg 1108 RPC s Min 1060 RPC s Max 1155 RPC s R Avg 2215 RPC s Min 2121 RPC s Max 2310 RPC s LNet Bandwidth of clients W Avg 16 60 MB s Min 16 10 MB s Max 17 1 MB s R Avg 40 49 MB s Min 40 30 MB s Max 40 68 MB s 2 In the future more statistics will be supported Chapter 18 Lustre I O Kit 18 31 show _error session GROUP NIDs Lists the number of failed RPCs on test nodes session Lists errors in the current test session With this option historical RPC errors are not listed lst show_error clients clients 12345 192 168 1 15 tcp Session 1 brw errors 0 ping errors RPC 20 errors 0 dropped 12345 192 168 1 16 tcp Session 0 brw errors 0 ping errors RPC 1 errors 0 dropped Total 2 error nodes in clients lst show_error session clients clients 12345 192 168 1 15 tcp Session 1 brw errors 0 ping errors a Total 1 error nodes in clients 18 32 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 9 Lustre Recovery This chapter describes how to recover Lustre and includes the following sections m Recovering Lustre m Types of Failure m Version based Recovery Lustre offers substantial recovery support to deal with node or network failure and returns the cluster to a reliable functional state When Lustre is in recovery
36. brw WRITE size 16K add brw WRITE 16 KB test to batch bulkperf the test will run in 4 workitem each 192 168 1 10 13 will write to 192 168 10 100 101 192 168 1 14 17 will write to 192 168 10 102 103 Ur Ur Ur YN list_batch NAME test INDEX active invalid server Lists batches in the current session or lists client server nodes in a batch or a test test INDEX Lists tests in a batch If no option is used all tests in the batch are listed If the option is used only specified tests in the batch are listed lst list_batch bulkperf Ist list_batch bulkperf Batch bulkperf Tests 1 State Idle ACTIVE BUSY DOWN UNKNOWN TOTAL client 8 0 0 0 8 server 4 0 0 0 4 T Test 1 brw loop 100 concurrency 4 ACTIVE BUSY DOWN UNKNOWN TOTAL client 8 0 0 0 8 server 4 0 0 0 4 lst list batch bulkperf server active 192 168 10 100 tcp Active 192 168 10 101 tcp Active 192 168 10 102 tcp Active 192 168 10 103 tcp Active Lustre 1 8 Operations Manual October 2009 run NAME Runs the batch lst run bulkperf stop NAME Stops the batch lst stop bulkper query NAME test INDEX timeout loop delay all Queries the batch status test INDEX Only queries the specified test The test INDEX starts from 1 timeout The timeout value to wait for RPC The default is 5 seconds loop The loop count of the query
37. delay The interval of each query The default is 5 seconds all The list status of all nodes in a batch or a test lst lst Batch Batch Batch Batch Batch lst 192 1 192 1 192 1 192 1 192 1 192 1 192 1 192 1 lst stop bul query bulkperf is is is is is query bulkperf 68 1 1 68 1 1 68 1 1 68 1 1 68 1 1 68 1 1 68 1 1 68 running running running running running 0O tcp 1 tcp 2 tcp 3 tcp 4 tcp 5 tcp 6 tcp 17 tcp run bulkperf Running Running Running Running Running Running Running Running kperf lst query bulkperf Batch is idle loop 5 all delay 3 Chapter 18 Lustre I O Kit 18 29 18 4 2 4 18 30 Other Commands This section lists other lst commands ping session group NAME nodes NIDs batch name server timeout Sends a hello query to the nodes session group NAME nodes NIDs batch NAME server timeout Pings all nodes in the current session Pings all nodes in a specified group Pings all specified nodes Pings all client nodes in a batch Sends RPC to all server nodes instead of client nodes This option is only used with batch NAME The RPC timeout value lst ping 192 168 10 15 20 tcp 192 168 1 15 tcp Active session liang id 192 168 1 3 tcp 192 168 1 16 tcp Active session liang id 192 168 1 3 tcp 192 168 1 17 tcp Active session lian
38. node in the test cluster All self test commands are entered from the console node From the console node a user can control and monitor the status of the entire test cluster session The console node is exclusive meaning that a user cannot control two different sessions LNET self test clusters on one node Group A user can only control nodes in his her session To allocate nodes to the session the user needs to add nodes to a group of the session All nodes in a group can be referenced by group s name A node can be allocated to multiple groups of a session Note A console user can associate kernel space test nodes with the session by running lst add_group NIDs but a userspace test node cannot be actively added to the session However the console user can passively accept a test node to associate with test session while the test node running 1stclient connects to the console node i e lstclient sesid CONSOLE NID group NAME 18 20 Lustre 1 8 Operations Manual October 2009 18 4 1 6 18 4 1 7 18 4 1 8 Test A test is a configuration of a test case which defines individual point to pointer network conversation all running in parallel A user can specify test properties such as RDMA operation type source group target group distribution of test nodes concurrency of test etc Batch A test batch is a named collection of tests All tests in a batch run in parallel Each test should belong to a batc
39. proc fs lustre mds max_atime_diff Lustre considers the latest atime from all OSTs If a setattr is set by user then it is updated on both the MDS and OST allowing the atime to go backward File status was last changed N 24 hours ago File status was last modified N 24 hours ago File has an object on a specific OST s Lustre 1 8 Operations Manual October 2009 Option Description osts getstripe setstripe size type uid user gid group quiet verbose recursive File has a size in bytes or kilo Mega Giga Tera Peta or Exabytes if a suffix is given File has a type block character directory pipe file symlink socket or Door for Solaris File has a specific numeric user ID File is owned by a specific user numeric user ID is allowed File has a specific group ID File belongs to a specific group numeric group ID allowed Lists all OSTs for the file system Lists the striping information for a given filename or files in a directory optionally recursive for all files in a directory tree Does not print object IDs Prints striping parameters Recurses into sub directories Creates a new file or sets the directory default with specific striping parameters size stripe size Number of bytes to store on an OST before moving to the next OST A stripe size of 0 uses the file system s default stripe size 1MB Can be specifi
40. required for liblustre clients to allow connections on non privileged ports e none Do not run the acceptor accept_port Port number on which the acceptor should listen for connection 988 requests All nodes in a site configuration that require an acceptor must use the same port accept_backlog Maximum length that the queue of pending connections may grow 127 to see listen 2 accept_timeout Maximum time in seconds the acceptor is allowed to block while 5 W communicating with a peer accept_proto_version Version of the acceptor protocol that should be used by outgoing connection requests It defaults to the most recent acceptor protocol version but it may be set to the previous version to allow the node to initiate connections with nodes that only understand that version of the acceptor protocol The acceptor can with some restrictions handle either version that is it can accept connections from both old and new peers For the current version of the acceptor protocol version 1 the acceptor is compatible with old peers if it is only required by a single local network Chapter 30 Configuration Files and Module Parameters man5 30 7 30 2 2 30 8 SOCKLND Kernel TCP IP LND The SOCKLND kernel TCP IP LND socklnd is connection based and uses the acceptor to establish communications via sockets with its peers It supports multiple instances and load balances dynamically over multiple interfaces If no inte
41. s stripe_size offset o start_ost count c stripe_count pool p pool_name lt dir filename gt 24 14 Lustre 1 8 Operations Manual October 2009 24 5 2 Note The pool option for 1fs setstripe is compatible with other modifiers For example you can set striping on a directory to use an explicit starting index To list pools in a named file system lfs pool list lt fsname gt lt poolname gt lt pathname gt To list OSTs in a named pool lfs pool _ list lt fsname gt lt poolname gt Tips for Using OST Pools Here are several suggestions for using OST pools a A directory or file can be given an extended attribute EA that restricts striping to a pool m Pools can be used to group OSTs with the same technology or performance slower or faster and preferred for certain jobs Examples are SATA OSTs versus SAS OSTs or remote OSTs versus local OSTs a A file created in an OST pool tracks the pool by keeping the pool name in the file LOV EA 24 6 24 6 1 Performing Direct I O Starting with 1 4 7 Lustre supports the O_DIRECT flag to open Applications using the read and write calls must supply buffers aligned on a page boundary usually 4 K If the alignment is not correct the call returns EINVAL Direct I O may help performance in cases where the client is doing a large amount of I O and is CPU bound CPU utilization 100 Making File System Objects Immutable An immutable fi
42. see Performing a Complete File System Upgrade Note If the Lustre component to be upgraded is an OSS in a failover pair follow these special upgrade steps to minimize downtime 1 Fail over the server to its peer server so the file system remains available 2 Install the Lustre 1 8 x packages on the idle server 3 Unload the old Lustre modules on the idle server by either Rebooting the node OR Removing the Lustre modules manually by running the lustre_rmmod command several times and checking the currently loaded modules with the 1smod command 4 Fail back services to the idle now upgraded server 5 Repeat Steps 1 to 4 on the peer server This limits the outage per OSS to a single server for as long as it takes to fail over Lustre 1 8 Operations Manual October 2009 1 Make a complete restorable file system backup before upgrading Lustre 2 Install the 1 8 x packages on the Lustre component server or client For help determining where to install a specific package see TABLE 3 1 Lustre packages descriptions and installation guidance a Install the kernel modules and Idiskfs packages For example rpm ivh kernel lustre smp lt ver gt kernel ib lt ver gt lustre modules lt ver gt lustre ldiskfs lt ver gt b Upgrade the utilities userspace packages For example rpm Uvh lustre lt ver gt c If a new e2fsprogs package is available upgrade it For example rpm Uvh e2
43. size differently on the MDT which does small random I O and on the OST which does large contiguous I O In customer testing we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST Note The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per LUN basis These are CLI commands for the DDN m For the MDT LUN cache size 64 size is in KB 64 128 256 512 1024 and 2048 Default 128 m For the OST LUN cache size 1024 Chapter 20 Lustre Tuning 20 9 20 5 3 20 10 Setting Write Back Cache Performance is noticeably improved by running Lustre with write back cache turned on However there is a risk that when the DDN controller crashes you need to run e2fsck Still it takes less time than the performance hit from running with the write back cache turned off For increased data security and in failover configurations you may prefer to run with write back cache off However you might experience performance problems with the small writes during journal flush In this mode it is highly beneficial to increase the number of OST service threads options ost ost_num_threads 512 in etc modprobe conf The OST should have enough RAM about 1 5 MB thread is preallocated for 1 0 buffers Having more I O threads allows you to have more I O requests in flight waiting for the disk to complete the synchronous write You have to decide whether performance is mor
44. sourceforge net projects powerman For more information on PowerMan go to https computing llnl gov linux powerman html Power Equipment A multi port Ethernet addressable RPC is relatively inexpensive For recommended products refer to the list of supported hardware on the PowerMan site Linux Network Iceboxes are also very good tools They combine the remote power control and the remote serial console into a single unit Chapter 8 Failover 8 3 8 1 3 8 1 4 8 4 Heartbeat The Heartbeat package is one of the core components of the Linux HA project Heartbeat is highly portable and runs on every known Linux platform as well as FreeBSD and Solaris For more information see http linux ha org HeartbeatProgram To download Linux HA go to http linux ha org download Lustre supports both Heartbeat V1 and Heartbeat V2 V1 has a simpler configuration and works very well V2 adds monitoring and supports more complex cluster topologies For additional information we recommend that you refer to the Linux HA website Connection Handling During Failover A connection is alive when it is active and in operation When a connection request is sent a connection is not established until either a reply arrives or a connection disconnects or fails If there is no traffic on a given connection periodically check the connection to ensure its status If an active connection disconnects it leads to at least one timeout reques
45. such as running as both a client and an OST Caution Lustre contains kernel modifications which interact with storage devices and may introduce security issues and data loss if not installed configured or administered properly Before installing Lustre back up ALL data 1 Verify that all Lustre installation requirements have been met For more information on these prerequisites see Preparing to Install Lustre 2 Download the Lustre RPMs tarballs a Navigate to the Lustre download site and select your platform The files required to install Lustre kernels modules and utilities RPMs are listed for the selected platform b Download the required files using either the Sun Download Manager SDM or downloading the files individually Tip For a non production Lustre environment or for testing a Lustre client and server can run on the same machine However for best performance in a production environment dedicated clients are always best Performance and other issues can occur when an MDS or OSS and a client are running on the same machine The MDS and MGS can run on the same machine 6 Running the MDS and a client on the same machine can cause recovery and deadlock issues and the performance of other Lustre clients to suffer Running the OSS and a client on the same machine can cause issues with low memory and memory pressure The client consume all of the memory and tries to flush pages to disk The
46. that relates to the service itself e g ostla ost1b In the Imc configuration script put each OST into a separate group use imc add ost group lt name gt When starting up each OST use lconf group lt name gt reformat cleanup etc foo xml to start up each one individually Unless a group is specified all of the services on the that node will be affected by the command Beginning with Lustre 1 4 4 managing individual services has been substantially simplified The group select mechanics are gone and you can operate purely on the basis of service names lconf service lt service gt reformat cleanup foo xml For example if you add the service ostl home type imc add ost ost ostl home You can start it with lconf service osti home foo xml As before if you do not specify a service all services configured for that node will be affected by your command Lustre 1 8 Operations Manual October 2009 What extra resources are required for automated failover To automate failover with Lustre you need power management software remote control power equipment and cluster management software Power Management Software PowerMan by the Lawrence Livermore National Laboratory is a tool that manipulates remote power control RPC devices from a central location PowerMan natively supports several RPC varieties Expect like configurability simplifies the addition of new devices For mo
47. u Rebuilds the parent database from scratch Otherwise the current parent database is used Utilities to Manage Large Clusters The following utilities are located in usr bin lustre_config sh The lustre_config sh utility helps automate the formatting and setup of disks on multiple nodes An entire installation is described in a comma separated file and passed to this script which then formats the drives updates modprobe conf and produces high availability HA configuration files lustre_createcsv sh The lustre_createcsv sh utility generates a CSV file describing the currently running installation lustre_up14 sh The lustre_up14 sh utility grabs client configuration files from old MDTs When upgrading Lustre from 1 4 x to 1 6 x if the MGS is not co located with the MDT or the client name is non standard this utility is used to retrieve the old client log For more information see Upgrading and Downgrading Lustre Chapter 31 System Configuration Utilities man8 31 19 31 5 4 31 55 Application Profiling Utilities The following utilities are located in usr bin lustre_req_history sh The lustre_req history sh utility run from a client assembles as much Lustre RPC request history as possible from the local node and from the servers that were contacted providing a better picture of the coordinated network activity Ilstat sh The llstat sh utility improved in Lustre 1 6 handles a wider range of proc file
48. us 2 2 2 quota_ctl 4 samples us 80 3470 4293 adjust_qunit 1 samples us 70 70 70 In the first line snapshot_time indicates when the statistics were taken The remaining lines list the quota events and their associated data In the second line the async_acq_req event occurs one time The min_time max_time and sum_time statistics for this event are 32 32 and 32 respectively The unit is microseconds us In the fifth line the quota_ctl event occurs four times The min_time max_time and sum_time statistics for this event are 80 3470 and 4293 respectively The unit is microseconds us Lustre 1 8 Operations Manual October 2009 Involving Lustre Support in Quotas Analysis Quota statistics are collected in proc fs lustre Iquota stats Each MDT and OST has one statistics proc file If you have a problem with quotas but cannot successfully diagnose the issue send the statistics files in the folder to Lustre Support for analysis To prepare the files 1 Initialize the statistics data to 0 zero Run lctl set_param lquota FSNAME MDT stats 0 lctl set_param lquota FSNAME OST stats 0 2 Perform the quota operation that causes the problem or degraded performance 3 Collect all statistics in proc fs lustre lquota and send them to Lustre Support Note the following m Proc quota entries are collected in these folders proc fs lustre obdfilter lustre OSTXXXX quota and proc fs lustre mds lustre MDTXXX
49. us us tre osc lustre OST0001 osc ce63ca00 stats tre osc lustre OST0000 osc ce63ca00 stats tre osc lustre OST0001 osc stats tre osc lustre OST0000 osc stats tre mdt MDS mds_readpage stats tre mdt MDS mds_setattr stats tre mdt MDS mds stats tre mds lustre MDT0000 exports ab206805 0630 6647 8543 tre mds lustre MDT0000 exports 08ac6584 6c4a 3536 2c6d tre mds lustre MDT0000 stats tre 1dlm services 1dlm_ canceld stats tre 1dlm services 1ldlm chd stats tre llite lustre ce63ca00 stats Chapter 22 LustreProc 22 33 22 3 1 1 Interpreting OST Statistics The OST stats files can be used to track client statistics client activity for each OST It is possible to get a periodic dump of values from these file for example every 10 seconds that show the RPC rates similar to iostat by using the llstat pl tool llstat proc fs lustre osc lustre OST0000 osc stats usr bin llstat STATS on 09 14 07 proc fs lustre osc lustre OST0000 osc stats on 192 168 10 34 tcp snapshot_time 1189732762 835363 ost_create 1 ost_get_info 1 ost_connect 1 ost_set_info 1 obd_ping 242 To clear the statistics give the c option to 11stat p1 To specify how frequently the statistics should be cleared in seconds use an integer for the i option This is sample output with c and i10 options used providing statistics every 10s llstat c i10 proc fs lustre ost OSS ost_io stats usr bin llstat STATS on 06 06 07 proc f
50. 1 gt amp 2 exit 1 while getopts O opt do case Sopt in O OST_PARAM SOST_PARAM O SOPTARG usage esac done shift OPTIND 1 MVDIR 1 if ne 1 o d SMVDIR then usage fi lfs find type f SOST_PARAM SMVDIR while read OLDNAME do echo n SOLDNAME if w SOLDNAME then echo No write permission skipping continue fi Chapter 26 Lustre Operating Tips 26 3 OLDCHK SCKSUM SOLDNAME awk print 1 if z SOLDCHK then echo checksum error exiting 1 gt amp 2 exit 1 fi NEWNAME mktemp SOLDNAME tmp XXXXXX if ne 0 o z SNEWNAME J then echo unable to create temp file exiting exit 2 fi cp a SOLDNAME SNEWNAME if ne 0 then echo copy error exiting 1 gt amp 2 rm f SNEWNAME exit 4 fi NEWCHK SCKSUM SNEWNAME awk print 1 if z SNEWCHK then echo SNEWNAME checksum error exiting exit 6 fi if SOLDCHK SNEWCHK then echo SNEWNAME bad checksum SOLDNAME not moved exiting 1 gt amp 2 rm f SNEWNAME exit 8 else mv SNEWNAME SOLDNAME if ne 0 then echo rename error exiting 1 gt amp 2 rm SNEWNAME exit 12 fi fi echo done done 26 4 Lustre 1 8 Operations Manual October 2009 26 3 Adding Multiple SCSI LUNs on Single HBA The configuration of the kernels packaged by th
51. 12 2 Format LVM volumes as Lustre targets In this example the backup file system is called main and designates the current most up to date backup cfs21 mkfs lustre mdt fsname main dev volgroup MDT No management node specified adding MGS to this MDT Permanent disk data Target main MDT fff Index unassigned Lustre FS main Mount type ldiskfs Flags 0x75 MDT MGS needs_index first_time update Persistent mount opts errors remount ro iopen_nopriv user_xattr Parameters checking for existing Lustre data device size 200MB formatting backing filesystem ldiskfs on dev volgroup MDT target name main MDTffff 4k blocks 0 options i 4096 I 512 q O dir_index F mkfs_cmd mkfs ext2 j b 4096 L main MDT ffff i 4096 I 512 q O dir_index F dev volgroup MDT Writing CONFIGS mountdata cfs21 mkfs lustre ost mgsnode cfs21 fsname main dev volgroup OST0O Permanent disk data Target main OST fff Index unassigned Lustre FS main Mount type ldiskfs Flags 0x72 OST needs_index first_time update Persistent mount opts errors remount ro extents mballoc Parameters mgsnode 192 168 0 21 tcp checking for existing Lustre data device size 200MB formatting backing filesystem ldiskfs on dev volgroup OSTO target name main OSTf fff 4k blocks 0 options I 256 q O dir_index F mkfs_cmd mkfs ext2 j b 4096 L main OSTffff I 256 q O dir_index F dev volgroup OSTO writing CONFIGS mountdata
52. 2 OST devices 800 MB 600 MB file system metadata cache 2 OSTs 1200 MB This consumes about 1 700 MB just for the pre allocated buffers and an additional 2 GB for minimal file system and kernel usage Therefore for a non failover configuration the minimum RAM would be 4 GB for an OSS node with two OSTs While it is not strictly required adding additional memory on the OSS will improve the performance of reading smaller frequently accessed files For a failover configuration the minimum RAM would be at least 6 GB For 4 OSTs on each OSS in a failover configuration 10GB of RAM is reasonable When the OSS is not handling any failed over OSTs the extra RAM will be used as a read cache As a reasonable rule of thumb about 2 GB of base memory plus 1 GB per OST can be used In failover configurations about 2 GB per OST is needed Lustre 1 8 Operations Manual October 2009 3 2 Installing Lustre from RPMs This procedure describes how to install Lustre from the RPM packages This is the easier installation method and is recommended for new users Alternately you can install Lustre directly from the source code For more information on this installation method see Installing Lustre from Source Code Note In all Lustre installations the server kernel that runs on an MDS MGS or OSS must be patched Running a patched kernel on a Lustre client is optional If is only required if the client will be used for multiple purposes
53. 21 4 12 21 4 13 Replacing An Existing OST or MDS The OST file system is an Idiskfs file system which is simply a normal ext3 file system plus some performance enhancements making if very close in fact to ext4 To copy the contents of an existing OST to a new OST or an old MDS to a new MDS use one of these methods m Connect the old OST disk and new OST disk to a single machine mount both and use rsync to copy all data between the OST file systems For example mount t ldiskfs dev old mnt ost_old mount t ldiskfs dev new mnt ost_new rsync aSv mnt ost_old mnt ost_new note trailing slash on ost_old m If you are unable to connect both sets of disk to the same computer use rsync to copy over the network using rsh or ssh with e ssh rsync aSvz mnt ost_old new_ost_node mnt ost_new m Use the same procedure for the MDS with one additional step cd mnt mds_old getfattr R e base64 d gt tmp mdsea lt copy all MDS files as above gt cd mnt mds_new setfattr restore tmp mdsea Handling Debugging Error 28 Linux error 28 is ENOSPC and indicates that the file system has run out of space You need to create larger file systems for the OSTs Normally Lustre reports this to your application If the application is checking the return code from its function calls then it decodes it into a textual error message like No space left on device It also appears in the system log messages D
54. 4 Configuring Kerberos To configure Kerberos to work with Lustre 1 Modify the files for Kerberos etc krb5 conf libdefaults default_realm CLUSTERFS COM realms CLUSTERFS COM kdc mds16 clustrefs com admin_server mds16 clustrefs com domain_realm clustrefs com CLUSTERFS COM clustrefs com CLSUTREFS COM logging default FILE var log kdc log 2 Prepare the Kerberos database 3 Create service principals so Lustre supports Kerberos authentication Note You can create service principals when configuring your other services to support Kerberos authentication 4 Configure the client nodes For each client node a Create a lustre_root principal and generate the keytab kadmin gt addprinc randkey lustre_root client_host domain REALM kadmin gt ktadd e aes128 cts normal lustre_root client_host domain REALM This process populates etc krb5 keytab which is not human readable Use the ktutil program to read and modify it 11 6 Lustre 1 8 Operations Manual October 2009 b Install the keytab Note There is only one security context for each client OST pair shared by all users on the client This protects data written by one user to be passed to an OST by another user due to asynchronous bulk I O The client OST connection only guarantees message integrity or privacy it does not authenticate users 5 Configure the MDS nodes For each MDT node create a lust
55. 4 KB block 4 35840 KB Lustre 1 8 Operations Manual October 2009 What is the Lustre data path On the OST data is read directly from the disk into pre allocated network I O buffers in chunks up to 1 MB in size This data is sent zero copy where possible to the clients where it is put again zero copy where possible into the file s data mapping The clients maintain local writeback and readahead caches for Lustre On the OST the file system metadata such as inodes bitmaps and file allocation information is cached in RAM up to the maximum amount that the kernel allows No user data is currently cached on the OST In cases where only few files are read by many clients it makes sense to use a RAID device with a lot of local RAM cache so that the multiple read requests can skip the disk access The networking code bundles up page requests into a maximum of 1 MB in a single RPC to minimize overhead In each client OSC this is controlled by the proc fs lustre osc max_pages_per_rpc field The size of the writeback cache can be tuned via proc fs lustre osc max_dirty_mb The size of the readahead can be tuned via proc fs lustre llite max_read_ahead_mb Total client side cache usage can be limited via proc fs lustre 1lite max_cached_mb Questions about using Lustre quotas This section covers various aspects of using Lustre quotas When I enable quotas with Ifs quotaon will it automatically set default quotas for all user
56. 43 c1 87 00 00 00 00 a0 88 00 00 00 00 00 00 00 00 00 00 00 00 00 00 32 BLOCKS 0 63 47968 48031 TOTAL 64 6 The FID is the file identifier 21 22 Lustre 1 8 Operations Manual October 2009 2 Note the FID s EA and apply it to the osd_inode_id mapping In this example the FID s EA is e2001100000000002543c18700000000a0880000000000000000000000000000 struct osd_inode_id __u64 oii_ino inode number __u32 oii_gen inode generation __u32 oii_pad alignment padding J After swapping you get an inode number of 0x001100e2 and generation of 0 3 On the MDT as root use debugfs to find the file associated with the inode debugfs c R ncheck 0x001100e2 dev lustre mdt_test Here is the command output debugfs 1 41 5 sun2 23 Apr 2009 dev lustre mdt_test catastrophic mode not reading inode or group bitmaps Inode Pathname 1114338 ROOT brian laptop guest clients client11 dmtmp PWRPNT ZD16 BMP The command lists the inode and pathname associated with the object Note Debugfs ncheck is a brute force search that may take a long time to complete Note To find the Lustre file from a disk LBA follow the steps listed in the document at this URL http smartmontools sourceforge net badblockhowto html Then follow the steps above to resolve the Lustre filename Chapter 21 Lustre Monitoring and Troubleshooting 21 23 21 24 Lustre 1 8 Operations Manual October 2009
57. 64 bit limits that support large limits handling The old quota file format v1 with 32 bit limits is also supported Lustre 1 6 6 introduced the v2 file format for operational quotas A few notes regarding the current quota file formats Lustre 1 6 5 and later use mdt quota_type to force a specific administrative quota version v2 or v1 m For the v2 quota file format OBJECTS admin_quotafile_v2 usr grp m For the v1 quota file format OBJECTS admin_quotafile usr grp Lustre 1 6 6 and later use ost quota_type to force a specific operational quota version v2 or v1 m For the v2 quota file format lquota_v2 user group m For the v1 quota file format lquota user group The quota_type specifier can be used to set different combinations of administrative operational quota file versions on a Lustre node a 1 v1 32 bit administrative quota file v1 32 bit operational quota file default in releases before Lustre 1 6 5 a 2 v2 64 bit administrative quota file v1 32 bit operational quota file default in Lustre 1 6 5 a 3 v2 64 bit administrative quota file v2 64 bit operational quota file default in releases after Lustre 1 6 5 If quotas do not exist or look broken then quotacheck creates quota files of a required name and format If Lustre is using the v2 quota file format when only v1 quota files exist then quotacheck converts old v1 quota files to new v2 quota files This conversion
58. CPU moves data in and out of the socket for a uni directional data flow to each peer If the NICs are not bonded Lustre establishes two bundles of sockets to the peer Since ksocklnd spreads traffic between sockets and sockets between CPUs both CPUs move data 12 4 Lustre 1 8 Operations Manual October 2009 12 4 Bonding Module Parameters Bonding module parameters control various aspects of bonding Outgoing traffic is mapped across the slave interfaces according to the transmit hash policy For Lustre we recommend that you set the xmit_hash_policy option to the layer3 4 option for bonding This policy uses upper layer protocol information if available to generate the hash This allows traffic to a particular network peer to span multiple slaves although a single connection does not span multiple slaves xmit_hash_policy layer3 4 The miimon option enables users to monitor the link status The parameter is a time interval in milliseconds It makes an interface failure transparent to avoid serious network degradation during link failures A reasonable default setting is 100 milliseconds run miimon 100 For a busy network increase the timeout 12 5 Setting Up Bonding To set up bonding 1 Create a virtual bond interface by creating a configuration file in etc sysconfig network scripts vi etc sysconfig network scripts ifcfg bond0 2 Append the following lines to the file EVICE bond0 PAD
59. Debugging 23 1 23 1 23 3 23 4 23 5 23 6 Lustre Debug Messages 23 2 23 1 1 Format of Lustre Debug Messages 23 3 Tools for Lustre Debugging 23 4 23 2 1 23 2 2 23 2 3 23 2 4 23 2 5 23 2 6 23 2 7 23 2 8 23 2 9 Debug Daemon Option to Ictl 23 5 Controlling the Kernel Debug Log 23 7 The Ictl Tool 23 7 Finding Memory Leaks 23 9 Printing to var log messages 23 9 Tracing Lock Traffic 23 9 Sample Ictl Run 23 10 Adding Debugging to the Lustre Source Code 23 10 Debugging in UML 23 12 Troubleshooting with strace 23 13 Looking at Disk Content 23 14 23 4 1 23 4 2 Determine the Lustre UUID of an OST 23 16 Tcpdump 23 16 Ptlrpc Request History 23 16 Using LWT Tracing 23 17 Contents xvii Part IV Lustre for Users 24 Striping and I O Options 24 1 24 1 File Striping 24 1 24 1 1 Advantages of Striping 24 2 24 1 2 Disadvantages of Striping 24 3 24 1 3 Stripe Size 24 3 24 2 Displaying Files and Directories with lfs getstripe 24 4 24 3 lfs setstripe Setting File Layouts 24 6 24 3 1 Changing Striping for a Subdirectory 24 7 24 3 2 Using a Specific Striping Pattern File Layout for a Single File 24 7 24 3 3 Creating a File on a Specific OST 24 8 24 4 Managing Free Space in Lustre 24 9 2441 Querying File System Free Space 24 9 2442 Using Stripe Allocations 24 11 244 3 Round Robin Allocator 24 11 2444 Weighted Allocator 24 11 244 5 Adjusting the Weighting Between Free Space and Location 24 12
60. If you are unsure which NID to use there is a lctl command that can help MDS On the MDS run lctl list_nids This displays the server s NIDs Client On a client run letl which nid lt NID list gt This displays the closest NID for the client Lustre 1 8 Operations Manual October 2009 Client with SSH Access From a client with SSH access to the MDS run mds_nids ssh the mds lctl list _ nids letl which nid mds_ nids This displays generally the correct NID to use for the MDS in the mount command Note In the mds_nids command above be sure to use the correct mark not a straight quotation mark Otherwise the command will not work 2 4 2 4 1 Configuring LNET This section describes how to configure LNET Note We recommend that you use dotted quad IP addressing rather than host names We have found this aids in reading debug logs and helps greatly when debugging configurations with multiple interfaces Module Parameters LNET network hardware and routing are configured via module parameters of the LNET and LND specific modules Parameters should be specified in the etc modprobe conf or etc modules conf file This example specifies that the node should use a TCP interface and an Flan interface options lnet networks tcp0 elan0 Depending on the LNDs used it may be necessary to specify explicit interfaces For example if you want to use two TCP interfaces tcp0 and tc
61. If you need to keep the file system running while some clients are upgraded the following module parameter controls interoperability with pre 1 4 6 Lustre Compatibility between versions is not possible if you are using portals routers gateways If you use gateways you must update the clients gateways and servers at the same time nyt portals_compatibility strong weak none strong is compatible with Lustre 1 4 5 and 1 4 6 running in either strong or weak compatibility mode Since this is the only mode compatible with 1 4 5 all 1 4 6 nodes in the cluster must use strong until the last 1 4 5 node has been upgraded weak is not compatible with 1 4 5 or with 1 4 6 running in none mode none is not compatible with 1 4 5 or with 1 4 6 running in strong mode For more information see Upgrading Lustre on page 117 Note Lustre v 1 4 2 through v 1 4 5 clients are only compatible zero conf mounting from a 1 4 6 MDS if the MDS was originally formatted with Lustre 1 4 5 or earlier If the file system was formatted with v 1 4 6 on the MDS or Iconf write conf was run on the MDS then the backward compatibility is lost It is still possible to mount 1 4 2 through 1 4 5 clients with lconf node client_node config xml Appendix A Lustre Knowledge Base A 25 A 26 How to fix bad LAST_ID on an OST The file system must be stopped on all servers prior to performing this procedure For hex
62. It can be used for interactive debugging of an ext3 Idiskfs file system The debugfs tool can either be used to check status or modify information in the file system In Lustre all objects that belong to a file are stored in an underlying Idiskfs file system on the OST s The file system uses the object IDs as the file names Once the object IDs are known the debugfs tool can be used to obtain the attributes of all objects from different OST s A sample run for the mnt lustre frog file used in the example above is shown here debugfs c tmp ost1 debugfs cd O debugfs cd 0 for files in group 0 debugfs cd d lt objid 32 gt debugfs stat lt objid gt for getattr on object debugfs quit Suppose object id is 36 then follow the steps below debugfs tmp ost1 debugfs cd O debugfs cd 0 debugfs cd d4 objid 32 debugfs stat 36 for getattr on obj 4 debugfs dump 36 tmp obj 36 dump contents of obj 4 Lustre 1 8 Operations Manual October 2009 debugfs quit Chapter 23 Lustre Debugging 23 15 23 4 1 23 4 2 Determine the Lustre UUID of an OST To determine the Lustre UUID of an obdfilter disk for example if you mix up the cables on your OST devices or the SCSI bus numbering suddenly changes and the SCSI devices get new names use debugfs to get the last_rcvd file Tcpdump Lustre provides a modified version of tcpdump which helps to decode the complete Lustre message packet This too
63. Lustre recovery by looking at the recovery_status proc entry for each device on the OSSs for example cat proc fs lustre obdfilter ost1 recovery_ status m The file system may get stuck in recovery if any servers are down or if any servers have thrown a Lustre bug LBUG check proc fs lustre health_check Lustre 1 8 Operations Manual October 2009 19 3 Version based Recovery Lustre 1 8 introduces the Version based Recovery VBR feature which improves Lustre reliability in cases where client requests RPCs fail to replay during recovery In pre 1 8 versions of Lustre if the MGS or an OST went down and then recovered a recovery process was triggered in which clients attempted to replay their requests Clients were only allowed to replay RPCs in serial order If a particular client could not replay its requests then those requests were lost as well as the requests of clients later in the sequence The downstream clients never got to replay their requests because of the wait on the earlier client s RPCs Eventually the recovery period would time out so the component could accept new requests leaving some number of clients evicted and their requests and data lost With VBR the recovery mechanism does not result in the loss of clients or their data because changes in inode versions are tracked and more clients are able to reintegrate into the cluster With VBR inode tracking looks like this m Each inode stores a
64. MDS 21 16 Handling Debugging Error 28 21 16 Triggering Watchdog for PID NNN 21 17 Handling Timeouts on Initial Lustre Setup 21 18 Handling Debugging LustreError xxx went back in time 21 19 Lustre Error Slow Start_ Page Write 21 19 Drawbacks in Doing Multi client O_APPEND Writes 21 20 Slowdown Occurs During Lustre Startup 21 20 Log Message Out of Memory on OST 21 20 Contents XV xvi 21 4 21 Number of OSTs Needed for Sustained Throughput 21 21 21 4 22 Setting SCSI I O Sizes 21 21 21 423 Identifying Which Lustre File an OST Object Belongs To 21 22 22 LustreProc 22 1 22 1 Proc Entries for Lustre 22 2 22 1 1 22 1 2 22 1 3 22 1 4 22 1 5 22 2 Lustre 22 2 1 22 2 2 22 2 3 22 24 22 2 5 22 2 6 22 2 7 22 2 8 22 2 9 22 2 10 22 2 11 22 3 Debug 22 3 1 Lustre 1 8 Operations Manual Locating Lustre File Systems and Servers 22 2 Lustre Timeouts 22 3 Adaptive Timeouts 22 5 LNET Information 22 9 Free Space Distribution 22 11 I O Tunables 22 12 Client I O RPC Stream Tunables 22 12 Watching the Client RPC Stream 22 14 Client Read Write Offset Survey 22 15 Client Read Write Extents Survey 22 16 Watching the OST Block I O Stream 22 18 Using File Readahead and Directory Statahead 22 19 OSS Read Cache 22 21 mballoc History 22 24 mballoc3 Tunables 22 26 Locking 22 28 Changing MDS and OSS Thread Counts 22 29 Support 22 30 RPC Information for Other OBD Devices 22 33 October 2009 23 Lustre
65. Maximum total number of concurrent sends that are outstanding to a single peer at a given time Maximum number of concurrent sends that are outstanding to a single peer at a given time Maximum immediate message size This MUST be the same on all nodes in a cluster A peer that connects with a different max_msg_size value will be rejected Chapter 30 Configuration Files and Module Parameters man5 30 17 30 2 8 30 18 Portals LND Catamount The Portals LND Catamount ptlind can be used as a interface layer to communicate with Sandia Portals networking devices This version is intended to work on the Cray XT3 Catamount nodes using Cray Portals as a network transport To enable the building of the Portals LND configure with this option configure with portals lt path to portals headers gt The following PTLLND tunables are currently available Variable Description PTLLND_DEBUG boolean dflt 0 PTLLND_TX_HISTORY int dflt debug 1024 0 PTLLND_ABORT_ON_PROT OCOL_MISMATCH boolean dflt 1 PTLLND_ABORT_ON_NAK boolean dflt 0 PTLLND_DUMP_ON_NAK boolean dflt debug 1 0 PTLLND_WATCHDOG_INTE RVAL int dflt 1 PTLLND_TIMEOUT int dflt 50 PTLLND_LONG_WAIT int dflt debug 5 PTLLND_TIMEOUT Enables or disables debug features Sets the size of the history buffer Calls abort action on connecting to a peer running a different version of the ptllnd protocol Calls abort a
66. NIDs In addition to the standard mount options Lustre understands the following client specific options Option Description flock Enables flock support coherent across client nodes localflock Enables local flock support using only client local flock faster for applications that require flock but do not run on multiple nodes noflock Disables flock support entirely Applications calling flock get an error user_xattr Enables get set of extended attributes by regular users nouser_xattr Disables use of extended attributes by regular users Root and system processes can still use extended attributes acl Enables ACL support noacl Disables ACL support 31 16 Lustre 1 8 Operations Manual October 2009 In addition to the standard mount options and backing disk type e g ext3 options Lustre understands the following server specific options Option Description nosvc Starts only the MGC and MGS if co located for a target service not the actual service nomgs Starts only the MDT with a co located MGS without starting the MGS exclude ostlist Starts a client or MDT with a colon separated list of known inactive OSTs abort_recov Aborts client recovery and immediately starts the target service md_stripe_cache_size Sets the stripe cache size for server side disk with a striped RAID configuration Examples Starts a client for the Lustre file system testfs at mount point
67. OBD_FAIL_RAND Has OBD_FAIL_CHECK fail randomly on average every 1 lustre fail_val times OBD_FAIL_SKIP Has OBD_FAIL_CHECK succeed lustre fail_val times and then fail permanently or once with OBD_FAIL_ONCE OBD_FAIL_SOME Has OBD_FAIL_CHECK fail lustre fail_val times and then succeed 23 2 9 Debugging in UML Lustre developers use gdb in User Mode Linux UML to debug Lustre The 1mc and lconf tools can be used to configure a Lustre cluster load the required modules start the services and set up all the devices lconf puts the debug symbols for the newly loaded module into tmp gdb localhost localdomain on the host machine These symbols can be loaded into gdb using the source command in gdb symbol file delete symbol file usr src lum linux source tmp gdb hostname b panic b stop 23 12 Lustre 1 8 Operations Manual October 2009 23 3 Troubleshooting with strace The operating system makes strace program trace utility available Use strace to trace program execution The strace utility pauses programs made by a process and records the system call arguments and return values This is a very useful tool especially when you try to troubleshoot a failed system call To invoke strace on a program strace lt program gt lt args gt Sometimes a system call may fork child processes In this situation use the option of strace to trace the child processes strace f lt program gt lt args gt
68. OSS needs to allocate pages to receive data from the client but cannot perform this operation due to low memory This can result in OOM kill and other issues Chapter 3 Installing Lustre 3 9 3 Install the Lustre RPMs Lustre requires a set of RPMs be installed kernel module utilities and e2fsprogs in a specific order Depending on the selected platform different packages are required In Step 2 you downloaded the RPMs specific to your platform a For each Lustre package determine if it needs to be installed on servers and or clients TABLE 3 1 provides a complete list of the required Lustre packages and for each package where to install it Depending on the selected platform not all packages listed in TABLE 3 1 need to be installed TABLE 3 1 Lustre required packages descriptions and installation guidance Install Installon Installon on patchless patched Lustre Package Description servers clients clients Lustre kernel RPMs kernel lustre lt ver gt Lustre patched kernel X x package for RHEL 5 i686 ia64 and x86_64 platform kernel lustre smp lt ver gt Lustre patched kernel X x package for SuSE Server 10 x86_64 platform kernel lustre bigsmp lt ver gt Lustre patched kernel package for SuSE Server 10 x x i686 platform kernel ib lt ver gt Lustre OFED package Install if the network x x x interconnect is InfiniBand kernel lustre default lt ver gt Lustre patched kernel kernel lustre default base l
69. Operational quotas for the MDT and OSTs which contain quota information dedicated to a cluster node Lustre 1 6 5 introduced the v2 file format for administrative quota files with continued support for the old file format v1 The mdt quota_type parameter also handles 1 and 2 options to specify the Lustre quota versions that will be used For example param mdt quota_type ugl param mdt quota_type u2 Lustre 1 6 6 introduced the v2 file format for operational quotas with continued support for the old file format v1 The ost quota_type parameter handles 1 and 2 options to specify the Lustre quota versions that will be used For example param ost quota_type ug2 param ost quota_type ul For more information about the v1 and v2 formats see Quota File Formats Chapter 9 Configuring Quotas 9 3 9 1 2 9 4 Creating Quota Files and Quota Administration Once each quota enabled file system is remounted it is capable of working with disk quotas However the file system is not yet ready to support quotas If umount has been done regularly run the 1fs command with the quotaon option If umount has not been done perform these steps 1 Take Lustre offline That is verify that no write operations append write truncate create or delete are being performed preparing to run 1fs quotacheck Operations that do not change Lustre files such as read or mount are okay to run Caution
70. Outline necessary changes to Lustre configuration for the new networking features in v 1 4 6 Further details may be found in the Lustre manual excerpts found at https wiki clusterfs com cfs intra FrontPage action AttachFile amp do get amp target LustreManual pdf Backwards Compatibility The 1 4 6 version of Lustre itself uses the same wire protocols as the previous release but has a different network addressing scheme and a much simpler configuration for routing In single network configurations LNET can be configured to work with the 1 4 5 networking portals so that rolling upgrades can be performed on a cluster See the portals_compatibility parameter below When portals_compatibility is enabled old XML configuration files remain compatible lconf automatically converts old style network addresses to the new LNET style Lustre 1 8 Operations Manual October 2009 If a rolling upgrade is not required that is all clients and servers can be stopped at one time then follow the standard procedure Appendix A Lustre Knowledge Base A 23 1 Shut down all clients and servers 2 Install new packages everywhere 3 Edit the Lustre configuration 4 Update the configuration on the MDS with Iconf write_conf 5 Restart New Network Addressing A NID is a Lustre network address Every node has one NID for each network to which it is attached The NID has the form lt address gt lt network gt where the lt
71. PIOS I O Modes There are several supported PIOS I O modes POSIX I O This is the default operational mode where I O is done using standard POSIX calls such as pwrite pread This mode is valid on both Linux and Solaris DIRECT I O This mode corresponds to the O_DIRECT flag in open 2 system call and it is currently applicable only to Linux Use this mode when using PIOS on the Idiskfs file system on an OSS COW I O This mode corresponds to the copy overwrite operation where file system blocks that are being overwritten were copied to shadow files Only use this mode if you want to see overhead of preserving existing data in case of overwrite This mode is valid on both Linux and Solaris Lustre 1 8 Operations Manual October 2009 18 3 3 PIOS Parameters PIOS has five basic parameters to determine the amount of data that is being written ChunkSize c Amount of data that a thread writes in one attempt ChunkSize should be a multiple of file system block size RegionSize s Amount of data required to fill up a region PIOS writes a chunksize of data continuously until it fills the regionsize RegionSize should be a multiple of ChunkSize RegionCount n Number of regions to write in one or multiple files The total amount of data written by PIOS is RegionSize x RegionCount ThreadCount t Number of threads working on regions Chapter 18 Lustre I O Kit 18 15 18 16 Offset o Distance betwee
72. Ready 102400 512 2 2 i Ready 102400 512 3 3 1 Ready 102400 512 4 4 2 Ready GHS 102400 4096 5 5 2 Ready GHS 102400 4096 6 6 2 Critical 102400 512 7 7 2 Critical 102400 4096 8 10 1 Cache Locked 64 512 1 LL Al Ready 64 512 2 12 1 Cache Locked 64 512 3 13 1 Cache Locked 64 512 4 14 2 Ready GHS 64 512 5 15 2 Ready GHS 64 512 6 16 2 Ready GHS 64 4096 7 17 2 Ready GHS 64 4096 8 System verify extent 16 Mbytes System verify delay 30 20 12 Lustre 1 8 Operations Manual October 2009 20 6 20 6 1 Large Scale Tuning for Cray XT and Equivalents This section only applies to Cray XT3 Catamount nodes and explains parameters used with the kptlind module If it does not apply to your setup ignore it Network Tunables With a large number of clients and servers possible on these systems tuning various request pools becomes important We are making changes to the ptllnd module Parameter Description max_nodes max_procs_per_node max_nodes is the maximum number of queue pairs and therefore the maximum number of peers with which the LND instance can communicate Set max_nodes to a value higher than the product of the total number of nodes and maximum processes per node Max nodes gt Total Nodes max_procs_per_node Setting max_nodes to a lower value than described causes Lustre to throw an error Setting max_nodes to a higher value causes excess memory to be consumed max_procs_per_node
73. Size The second reason to stripe is when a single OST does not have enough free space to hold the entire file There is never an exact one to one mapping between clients and OSTs Lustre uses a round robin algorithm for OST stripe selection until free space on OSTs differ by more than 20 However depending on actual file sizes some stripes may be mostly empty while others are more full For a more detailed description of stripe assignments see Managing Free Space in Lustre wou After every ostcount 1 objects Lustre skips an OST This causes Lustre s starting point to precess around eliminating some degenerated cases where applications that create very regular file layouts striping patterns would have preferentially used a particular OST in the sequence 24 2 Lustre 1 8 Operations Manual October 2009 24 1 2 24 1 2 1 241 222 24 1 3 Disadvantages of Striping There are two disadvantages to striping which should deter you from choosing a default policy that stripes over all OSTs unless you really need it increased overhead and increased risk Increased Overhead Increased overhead comes in the form of extra network operations during common operations such as stat and unlink and more locks Even when these operations are performed in parallel there is a big difference between doing 1 network operation and 100 operations Increased overhead also comes in the form of server contention Consider a cluster with
74. The maximum number of stripe count is 160 This limit is hard coded but is near the upper limit imposed by the underlying ext3 file system It may be increased in future releases Under normal circumstances the stripe count is not affected by ACLs 32 1 32 2 Maximum Stripe Size For a 32 bit machine the product of stripe size and stripe count stripe_size stripe_count must be less than 2432 The ext3 limit of 2TB for a single file applies for a 64 bit machine Lustre can support 160 stripes of 2 TB each on a 64 bit system 32 3 Minimum Stripe Size Due to the 64 KB PAGE_SIZE on some 64 bit machines the minimum stripe size is set to 64 KB 32 4 Maximum Number of OSTs and MDTs You can set the maximum number of OSTs by a compile option The limit of 1020 OSTs in Lustre release 1 4 7 is increased to a maximum of 8150 OSTs in 1 6 0 Testing is in progress to move the limit to 4000 OSTs The maximum number of MDSs will be determined after accomplishing MDS clustering 32 5 Maximum Number of Clients Currently the number of clients is limited to 131072 We have tested up to 22000 clients 32 2 Lustre 1 8 Operations Manual October 2009 32 6 Maximum Size of a File System For i386 systems with 2 6 kernels the block devices are limited to 16 TB Each OST or MDT can have a file system up to 8 TB regardless of whether 32 bit or 64 bit kernels are on the server For 2 6 kernels the 8 TB limit is imposed by e
75. You can also use 1s 1 proc lt pid gt fd to find open files using Lustre run lfs getstripe readlink proc pidof cat fd 1 OBDS 0 databarn ost1_UUID ACTIVI 1 databarn ost2_UUID ACTIVI 2 databarn ost3_UUID ACTIV 3 databarn ost4_UUID ACTIVI barn users jacob tmp foo obdidx objid objid group 2 835487 Oxchf9f 0 HoH w This shows that the file lives on obdidx 2 which is databarn ost3 To see which node is serving that OST run cat proc fs lustre osc databarn ost3 ost_conn_uuid NID_oss1 databarn 87k net_UUID The above condition operation also works with connections to the MDS For that replace osc with mdc and ost with mds in the above commands Chapter 24 Striping and I O Options 24 5 24 3 24 6 lfs setstripe Setting File Layouts Use the lfs setstripe command to create new files with a specific file layout stripe pattern configuration lfs setstripe size s stripe size count c stripe cnt index i start ost lt filename dirname gt stripe size Stripe size is how much data to write to one OST before moving to the next OST The default stripe size is 1 MB and passing a stripe size of 0 causes the default stripe size to be used Otherwise the stripe size must be a multiple of 64 KB stripe start Stripe start is the first OST to which files are written The default stripe start is 1 and passing a stripe start of 1 causes a random first OST to be chosen Otherwise
76. a kernel patch The NOOP is provided for delivering a zero copy ACK when there is no LNET message to back it on Note that socklnd may connect to its peers via a bundle of sockets one for bidirectional ping pong data and the other two for unidirectional bulk data However the message protocol on every socket is as described earlier Appendix A Lustre Knowledge Base A 29 A 30 Information on the Lustre Networking LNET protocol Lustre layers the socket LND sockind protocol above TCP IP Every LNET message is an Inet_hdr_t sent in little endian LE byte order followed by payload_length bytes of opaque payload data There are four types of messages a PUT request to send data contained in the payload m ACK response to a PUT with ack_wmd LNET_WIRE HANDLE_NONE m GET request to fetch data m REPLY response to a GET with data in the payload Typically ACK and GET messages have 0 bytes of payload Explanation of previously skipped similar messages in Lustre logs Unlike syslog which occupies exactly identical lines the space for Lustre messages is occupied if there are bursts of messages from the same line of code even if they are not sequential This avoids duplication of the same event from different clients or in cases where two or more messages are repeated All messages are kept in the Lustre kernel debug log so Ictl dk at that time would show all messages in case they are not wrapped
77. a mask entry and may contain any number of named user and named group entries Lustre ACL support depends on the MDS which needs to be configured to enable ACLs Use mountfsoptions to enable ACL support when creating your configuration mkfs lustre fsname spfs mountfsoptions acl mdt mgs dev sda Alternately you can enable ACLs at run time by using the acl option with mkfs lustre mount t lustre o acl dev sda mnt mdt To check ACLs on the MDS lctl get_param n mdc home MDT0000 mdc connect_flags grep acl acl To mount the client with no ACLs mount t lustre o noacl ibmds2 o2ib home home Lustre 1 8 Operations Manual October 2009 255133 Lustre ACL support is a system wide feature either all clients enable ACLs or none do Activating ACLs is controlled by MDS mount options acl noacl enable disableACLs Client side mount options acl noacl are ignored You do not need to change the client configuration and the acl string will not appear in the client etc mtab The client acl mount option is no longer needed If a client is mounted with that option then this message appears in the MDS syslog MDS requires ACL support but client does not The message is harmless but indicates a configuration issue which should be corrected If ACLs are not enabled on the MDS then any attempts to reference an ACL on a client return an Operation not supported error Examples These examples are
78. a new line For example mdt 1 clusterfs com options lnet networks tcp dev sdb mnt mdt mgs mdt AND ost1 clusterfs com options lnet networks tcp dev sda mnt ostl ost 192 168 16 34 tcp0 Chapter 6 Configuring Lustre Examples 6 9 6 10 Using CSV with lustre_config Once you created the CSV file you can start to configure the file system by using the lustre_config script 1 List the available parameters At the command prompt Type lustre _ config lustre config Missing csv file Usage lustre _ config options lt csv file gt This script is used to format and set up multiple lustre servers from a csv file Options h help and examples a select all the nodes from the csv file to operate on wW hostname hostname select the specified list of nodes separated by commas to operate on rather than all the nodes in the csv file Xx hostname hostname exclude the specified list of nodes separated by commas t HAtype produce High Availability software configurations The argument following t is used to indicate the High Availability software type The HA software types which are currently supported are hbv1 Heartbeat version 1 and hbv2 Heartbeat version 2 n no net don t verify network connectivity and hostnames in the cluster d configure Linux MD LVM devices before formatting the Lustre targets f force format the Lustre targets using reformat optio
79. a shared storage device between the two servers Lustre File Identifier A collection of integers which uniquely identify a file or object The FID structure contains a sequence identity and version number A group of files that are defined through a directory that represents a file system s start point FID Location Database This database maps a sequence of FIDs to a server which is managing the objects in the sequence Group or I O transfer operations initiated in the OSC which is simultaneously going between two endpoints Tuning the flight group size correctly leads to a full pipe An RPC made by an OST or MDT to another system usually a client to indicate to tthat an extent lock it is holding should be surrendered if it is not in use If the system is using the lock then the system should report the object size in the reply to the glimpse callback Glimpses are introduced to optimize the acquisition of file sizes The state held by a client to fully recover a transaction sequence after a server failure and restart A special locking operation introduced by Lustre into the Linux kernel An intent lock combines a request for a lock with the full information to perform the operation s for which the lock was requested This offers the server the option of granting the lock or performing the operation and informing the client of the operation result without granting a lock The use of intent locks enables metadata operations ev
80. all OSTs either via a shared file system or by copying it to the OSTs pdcp is very useful here It copies files to groups of hosts and in parallel it gets installed with pdsh You can download it at http sourceforge net projects pdsh Run a similar e2fsck step on the OSTs You can run this step simultaneously on OSTs The mdsdb is read only in this step a single copy can be shared by all OSTs e2fsck n v mdsdb tmp mdsdb ostdb tmp ostNdb dev ostNdev Example root oss161 e2fsck n v mdsdb tmp mdsdb ostdb tmp ostdb dev sda e2fsck 1 39 cfs1 29 May 2006 Warning skipping journal recovery because doing a read only filesystem check lustre OST0000 contains a file system with errors check forced Pass 1 Checking inodes blocks and sizes Pass 2 Checking directory structure Pass 3 Checking directory connectivity Pass 4 Checking reference counts Pass 5 Checking group summary information Free blocks count wrong 989015 counted 817968 Fix no Free inodes count wrong 262088 counted 261767 Fix no Pass 6 Acquiring information for lfsck OST lustre OSTO000_UUID ost idx 0 compat 0x2 rocomp 0 incomp 0x2 OST num files 321 OST last_id 321 Lustre 1 8 Operations Manual October 2009 lustre OSTO0000 WARNING Filesystem still has errors 56 inodes used 0 27 non contiguous inodes 48 2 of inodes with ind dind tind blocks 13 0 0 59561 blocks us
81. amp SUN microsystems Lustre 1 8 Operations Manual Sun Microsystems Inc www sun com Part No 821 0035 10 Lustre manual version Lustre_1 8_man_v1 2 October 2009 Copyright 2007 2009 Sun Microsystems Inc 4150 Network Circle Santa Clara California 95054 U S A All rights reserved U S Government Rights Commercial software Government users are subject to the Sun Microsystems Inc standard license agreement and applicable provisions of the FAR and its supplements Sun Sun Microsystems the Sun logo and Lustre are trademarks or registered trademarks of Sun Microsystems Inc in the U S and other countries UNIX is a registered trademark in the U S and other countries exclusively licensed through X Open Company Ltd Products covered by and information contained in this service manual are controlled by U S Export Control laws and may be subject to the export or import laws in other countries Nuclear missile chemical biological weapons or nuclear maritime end uses or end users whether direct or indirect are strictly prohibited Export or reexport to countries subject to U S embargo or to entities identified on U S export exclusion lists including but not limited to the denied persons and specially designated nationals lists is strictly prohibited DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS REPRESENTATIONS AND WARRANTIES INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY FITNES
82. and ZONE_NORMAL On 32 bit systems low system memory is at most 896M no matter how much RAM is installed The size of the default router buffer puts big pressure on low memory zones making it more likely that an out of memory OOM situation will occur This is a known cause of router hangs Lowering the value of the large_router_buffers parameter can circumvent this problem but at the cost of penalizing router performance by making large messages wait for longer for buffers On 64 bit architectures the ZONE_HIGHMEM zone is always empty Router buffers can come from all available memory and out of memory hangs do not occur Therefore we recommend using 64 bit routers Chapter 2 Understanding Lustre Networking 2 11 2 4 3 Downed Routers There are two mechanisms to update the health status of a peer or a router a LNET can actively check health status of all routers and mark them as dead or alive automatically By default this is off To enable it set auto_down and if desired check_routers_before_use This initial check may cause a pause equal to router_ping_timeout at system startup if there are dead routers in the system m When there is a communication error all LNDs notify LNET that the peer not necessarily a router is down This mechanism is always on and there is no parameter to turn it off However if you set the LNET module parameter auto_down to 0 LNET ignores all such peer down notifications Several key differe
83. and so on The risk increases with increase in the number of clients directly accessing the storage 21 14 Lustre 1 8 Operations Manual October 2009 21 411 Handling Debugging Bind Address already in use Error During startup Lustre may report a bind Address already in use error and reject to start the operation This is caused by a portmap service often NFS locking which starts before Lustre and binds to the default port 988 You must have port 988 open from firewall or IP tables for incoming connections on the client OSS and MDS nodes LNET will create three outgoing connections on available reserved ports to each client server pair starting with 1023 1022 and 1021 Unfortunately you cannot set sunprc to avoid port 988 If you receive this error do the following m Start Lustre before starting any service that uses sunrpc a Use a port other than 988 for Lustre This is configured in etc modprobe conf as an option to the LNET module For example options lnet accept_port 988 m Add modprobe ptlrpc to your system startup scripts before the service that uses sunrpc This causes Lustre to bind to port 988 and sunrpc to select a different port Note You can also use the sysct1 command to mitigate the NFS client from grabbing the Lustre service port However this is a partial workaround as other user space RPC servers still have the ability to grab the port Chapter 21 Lustre Monitoring and Troubleshooting 21 15
84. as Elan Myrinet and InfiniBand Support for many commonly used network types such as InfiniBand and IP m High availability and recovery features enabling transparent recovery in conjunction with failover servers a Simultaneous availability of multiple network types with routing between them LNET includes LNDs to support many network type including a InfiniBand OpenFabrics versions 1 0 and 1 2 Mellanox Gold Cisco Voltaire and Silverstorm m TCP Any network carrying TCP traffic including GigE 10GigE and IPoIB m Quadrics Elan3 Elan4 a Myrinet GM MX m Cray Seastar RapidArray The LNDs that support these networks are pluggable modules for the LNET software stack LNET offers extremely high performance It is common to see end to end throughput over GigE networks in excess of 110 MB sec InfiniBand double data rate DDR links reach bandwidths up to 1 5 GB sec and 10GigE interfaces provide end to end bandwidth of over 1 GB sec Lustre 1 8 Operations Manual October 2009 1 7 Lustre Failover and Rolling Upgrades Lustre offers a robust application transparent failover mechanism that delivers call completion This failover mechanism in conjunction with software that offers interoperability between versions is used to support rolling upgrades of file system software on active clusters The Lustre recovery feature allows servers to be upgraded without taking down the system The server is simply taken off
85. as much space on the MDT as on the OSTs This is not a very common configuration for Lustre Chapter 20 Lustre Tuning 20 5 20 3 3 20 3 3 1 20 3 3 2 Overriding Default Formatting Options To override the default formatting options for any of the Lustre backing file systems use the mkfsoptions backing fs options argument to mkfs lustre to pass formatting options to the backing mkfs For all options to format backing ext3 and Idiskfs file systems see the mke2fs 8 man page this section only discusses several Lustre specific options Number of Inodes for the MDT The MDT consumes one inodel for each file in the Lustre file system The default MDT inode ratio is 1024 bytes per inode To override the inode ratio use the option i lt bytes per inode gt for example mkfsoptions i 4096 to create one inode per 4096 bytes of file system space Note Use this ratio to make sure that Extended Attributes EAs can fit on the inode as well Otherwise you have to make an indirect allocation to hold the EAs which impacts performance owing to the additional seeks Alternately to set an absolute number of inodes use the N lt number of inodes gt option To avoid unintentional mistakes do not specify the i option with an inode ratio below one inode per 1024 bytes Use the N option instead By default a 2 TB MDT has 512M inodes Currently the largest supported file system size is 8 TB which holds 2B inodes With a
86. assigning a unique keytab to each client node create a general lustre_root principal and keytab and install the keytab on as many client nodes as needed kadmin gt addprinc randkey lustre_root REALM kadmin gt ktadd e aes128 cts normal lustre_root REALM Remember that if you use a general keytab then one compromised client means that all client nodes are insecure 11 4 Lustre 1 8 Operations Manual October 2009 General Installation Notes m The host domain should be the FQDN in your network Otherwise the server may not recognize any GSS request m To install a keytab entry on a node use the ktutil utility m Lustre supports these encryption types for MIT Kerberos 5 v1 4 and higher des cbc cre des cbc md5 des3 hmac sha1 aes128 cts aes256 cts arcfour hmac md5 For MIT Kerberos 1 3 x only des cbc md5 works because of a known issue between libgssapi and the Kerberos library Note The encryption type or enctype is an identifier specifying the encryption mode and hash algorithms Each Kerberos key has an associated enctype that identifies the cryptographic algorithm and mode used when performing cryptographic operations with the key It is important that the enctypes requested by the client are actually supported on the system hosting the client This is the case if the defaults that control enctypes are not overridden 1 Kerberos keytab file maintenance utility Chapter 11 Kerberos 11 5 11 2 1
87. available to the Red Hat Cluster Suite Overview For more information on installing and configuring Cluster Manager for Lustre failover and testing MDS failover see Cluster Manager SNMP Monitoring Lustre has a native SNMP module which enables you to use various standard SNMP monitoring packages anything using RRDTool as a backend to track performance For more information in installing building and using the SNMP module see Lustre SNMP Module CollectL CollectL is another tool that can be used to monitor Lustre You can run CollectL on a Lustre system that has any combination of MDSs OSTs and clients The collected data can be written to a file for continuous logging and played back at a later time It can also be converted to a format suitable for plotting For more information about CollectL see http collectl sourceforge net Lustre specific documentation is also available See http collectl sourceforge net Tutorial Lustre html Other Monitoring Options Another option is to script a simple monitoring solution which looks at various reports from ipconfig as well as the procfs files generated by Lustre Chapter 21 Lustre Monitoring and Troubleshooting 21 3 21 2 2k21 21 4 Troubleshooting Lustre Several resources are available to help use troubleshoot Lustre This section describes error numbers error messages and logs Error Numbers Error numbers for Lustre come from the Linux errno h and are l
88. being written to the OST The new file called degraded located in proc fs lustre obdfilter OST marks the OST as degraded if it is written with a 1 or any non zero value until a 0 is written to it Therefore 1 should be written to the file when the array becomes degraded and 0 should be written when the array becomes healthy If the OST is remounted due to a reboot or other condition the flag resets to 0 10 2 Insights into Disk Performance Measurement Several tips and insights for disk performance measurement are provided below Some of this information is specific to RAID arrays and or the Linux RAID implementation m Performance is limited by the slowest disk Before creating a software RAID array benchmark all disks individually We have frequently encountered situations where drive performance was not consistent for all devices in the array Replace any disks that are significantly slower than the rest m Disks and arrays are very sensitive to request size To identify the optimal request size for a given disk benchmark the disk with different record sizes ranging from 4 KB to 1 to 2 MB Note Try to avoid sync writes probably subsequent write would make the stripe full and no reads will be needed Try to configure RAID arrays and the application so that most of the writes are full stripe and stripe aligned Chapter 10 RAID 10 7 10 3 10 3 0 1 Lustre Software RAID Support A nu
89. block device name gt lt mount point gt c Mount the file system on the clients On each client node run mount t lustre lt MGS node gt lt fsname gt lt mount point gt If you have a problem downgrading Lustre contact us via the Bugzilla bug tracker Lustre 1 8 Operations Manual October 2009 13 5 2 Performing a Rolling Downgrade This procedure describes a rolling downgrade in which one Lustre component server or client is downgraded and restarted at a time while the file system is running If you want to downgrade the complete Lustre file system or multiple components at a time requiring a file system shutdown see Performing a Complete File System Downgrade Note If the Lustre component to be downgraded is an OSS in a failover pair follow these special downgrade steps to minimize downtime 1 Fail over the server to its peer server so the file system remains available 2 Install the Lustre 1 8 x packages on the idle server 3 Unload the old Lustre modules on the idle server by either Rebooting the node OR Removing the Lustre modules manually by running the lustre_rmmod command several times and checking the currently loaded modules with the 1smod command 4 Fail back services to the idle now upgraded server 5 Repeat Steps 1 to 4 on the peer server This limits the outage per OSS to a single server for as long as it takes to fail over Chapter 13 Upgrading and Downgrading Lus
90. block devices and zpools similar to what can be expected from a large Lustre OSS server when handling the load from many clients The program generates and executes the I O load in a manner substantially similar to an OSS that is multiple threads take work items from a simulated request queue It forks a CPU load generator to simulate running on a system with additional load PIOS can read write data to a single shared file or multiple files default is a single file To specify multiple files use the fpp option It is better to measure with both single and multiple files If the final argument is a file block device or zpool PIOS writes to RegionCount regions in one file PIOS issues I O commands of size ChunkSize The regions are spaced apart Offset bytes or in the case of many files the region starts at Offset bytes In each region RegionSize bytes are written or read one ChunkSize I O at a time Note that ChunkSize lt Regionsize lt Offset Multiple runs can be specified with comma separated lists of values for ChunkSize Offset RegionCount ThreadCount and RegionSize Multiple runs can also be specified by giving a starting low value increase in percent and high value for each of these arguments If a low value is given no value list or value may be supplied Every run is given a timestamp and the timestamp and offset are written with every chunk to allow verification Before every run PIOS executes the pre run
91. cfs21 mount t lustre dev volgroup MDT mnt mdt Lustre 1 8 Operations Manual October 2009 cfs21 mount t lustre dev volgroup OSTO mnt ost cfs21 mount t lustre cfs21 main mnt main Note For more information on working with LVM snapshots and a Lustre file system see LVM Snapshots on Lustre Targets 4 3 Basic Lustre Administration Once you have the Lustre file system up and running you can use the procedures in this section to perform these basic Lustre administration tasks m Specifying the File System Name Mounting a Server Unmounting a Server Working with Inactive OSTs m Finding Nodes in the Lustre File System Mounting a Server Without Lustre Service m Specifying Failout Failover Mode for OSTs Running Multiple Lustre File Systems Setting Lustre Parameters m Running the Writeconf Command m Removing and Restoring OSTs m Changing a Server NID m Aborting Recovery m Failover Unmounting a Server without Failover m Unmounting a Server with Failover m Changing the Address of a Failover Node Chapter 4 Configuring Lustre 4 13 4 3 1 4 3 2 4 14 Specifying the File System Name The file system name is limited to 8 characters We have encoded the file system and target information in the disk label so you can mount by label This allows system administrators to move disks around without worrying about issues such as SCSI disk reordering or getting the dev device
92. chmod Note Rebuilding individual POSIX tests is not straightforward due to the reliance on tcc You may have to substitute the edited source files into the source tree following the installation described above and let the existing POSIX install scripts do the work The installation scripts specifically home tet test_sets run_testsets sh contain relevant commands to build the test suite similar to tcc p b s HOME scen bld but it does not work outside the script Chapter 16 POSIX 16 7 16 8 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 7 Benchmarking The benchmarking process involves identifying the highest standard of excellence and performance learning and understanding these standards and finally adapting and applying them to improve the performance Benchmarks are most often used to provide an idea of how fast any software or hardware runs Complex interactions between 1 0 devices caches kernel daemons and other OS components result in behavior that is difficult to analyze Moreover systems have different features and optimizations so no single benchmark is always suitable The variety of workloads that these systems experience also adds in to this difficulty One of the most widely researched areas in storage subsystem is file system design implementation and performance This chapter describes benchmark suites to test Lustre and includes the following sections Bonnie Benchmark
93. d Install the e2fsprogs package Use the rpm ivh command to install the e2fsprogs package For example rpm ivh e2fsprogs lt ver gt e Optional If you want to add optional packages to your Lustre system install them now Chapter 3 Installing Lustre 3 11 3 12 Note If e2fsprogs is already on your system install the Lustre specific version by using rpm Uvh to update the existing e2fsprogs package Run rpm Uvh e2fsprogs lt ver gt The rpm command options force or nodeps are not required to install or update e2fsprogs We specifically recommend that you not use these options 4 Verify that the boot loader grub conf or lilo conf has been updated to load the patched kernel 5 Reboot the patched clients and the servers a If you applied the patched kernel to any clients reboot them Unpatched clients do not need to be rebooted b Reboot the servers Once all machines have rebooted go to Configuring Lustre to configure Lustre Networking LNET and the Lustre file system Lustre 1 8 Operations Manual October 2009 3 3 3 3 1 Installing Lustre from Source Code Installing Lustre from source involves several procedures patching the core kernel configuring it to work with Lustre and creating Lustre and kernel RPMs from source code The easier installation method is to install Lustre from packaged binaries RPMs For more information on this installation method see Installing Lustr
94. debugging tools combined by the operating system and Lustre itself These tools are a Debug logs A circular debug buffer holds a substantial amount of debugging information MBs or more during the first insertion of the kernel module When this buffer fills up it wraps and discards the oldest information Lustre offers additional debug messages that can be written out to this kernel log The debug log holds Lustre internal logging separate from the error messages printed to syslog or console Entries to the Lustre debug log are controlled by the mask set by proc sys 1net debug The log defaults to 5 MB per CPU and is a ring buffer Newer messages overwrite older ones The default log size can be increased as a busy system will quickly overwrite the 5 MB default a Debug daemon The debug daemon controls logging of debug messages a proc sys Inet debug This log contains a mask that can be used to delimit the debugging information written out to the kernel debug logs m Ictl This tool is used to manually dump the log and post process logs that are dumped automatically m leak_finder pl This is useful program which helps find memory leaks in the code m strace This tool allows a system call to be traced a var log messages syslogd prints fatal or serious messages at this log m Crash dumps On crash dump enabled kernels sysrq c produces a crash dump Lustre enhances this crash dump with a log dump the last 64 KB of the log
95. followed by a disk failure before the RAID array can be re synchronized the disk file system needs a file system check and any data that was being written during the power loss may be corrupted m Ifa RAID array does not guarantee before after semantics the same requirement holds We consider this to be a requirement for most arrays that are used with Lustre including the successful and popular DDN arrays With RAID6 this check is not required with a single disk failure but is required with a double failure upon reboot after an abrupt interruption of the system 10 4 Lustre 1 8 Operations Manual October 2009 10 1 4 10 1 5 Performance Tradeoffs Writeback cache can dramatically increase write performance on any type of RAID array Unfortunately unless the RAID array has battery backed cache a feature only found in some higher priced hardware RAID arrays interrupting the power to the array may result in out of sequence writes This causes problems for journaling If writeback cache is enabled a file system check is required after the array loses power Data may also be lost because of this Therefore we recommend against the use of writeback cache when data integrity is critical You should carefully consider whether the benefits of using writeback cache outweigh the risks Formatting Options for RAID Devices When formatting a file system on a RAID device it is beneficial to specify additional parameters at the ti
96. for OSTs 4 18 4 3 8 Running Multiple Lustre File Systems 4 19 4 3 9 Setting Lustre Parameters 4 21 4 3 10 Running the WriteconfCommand 4 24 4 3 11 Removing and Restoring OSTs 4 25 4 3 12 Changing a Server NID 4 27 43 13 Aborting Recovery 4 27 44 More Complex Configurations 4 28 4 4 1 Failover 4 28 45 Operational Scenarios 4 29 4 5 1 Unmounting a Server without Failover 4 31 4 5 2 Unmounting a Server with Failover 4 31 4 5 3 Changing the Address of a Failover Node 4 31 5 Service Tags 5 1 5 1 Introduction to Service Tags 5 1 5 2 Using Service Tags 5 2 5 2 1 Installing Service Tags 5 2 5 22 Discovering and Registering Lustre Components 5 3 5 2 3 Information Registered with Sun 5 6 6 Configuring Lustre Examples 6 1 6 1 Simple TCP Network 6 1 6 1 1 Lustre with Combined MGS MDT 6 1 6 12 Lustre with Separate MGS and MDT 6 3 viii Lustre 1 8 Operations Manual October 2009 More Complicated Configurations 7 1 7 1 7 2 7 3 7 4 Multihomed Servers 7 1 7 11 Modprobe conf 7 1 7 1 2 Start Servers 7 3 7 1 3 Start Clients 7 4 Elan to TCP Routing 7 5 7 2 1 Modprobe conf 7 5 7 2 2 Start servers 7 5 7 2 3 Start clients 7 5 Load Balancing with InfiniBand 7 6 7 3 1 Modprobe conf 7 6 7 3 2 Start servers 7 6 7 3 3 Start clients 7 7 Multi Ra l Configurations with LNET 7 7 Failover 8 1 8 1 8 2 8 3 8 4 What is Failover 8 1 8 1 1 The Power Management Software 8 3 8 1 2 Power Equipment 8 3 8 1 3 Hea
97. given PTLRPC connection It covers two parts of messages the RPC message and BULK data You can set either part in one of the following modes a null No protection a integrity Data integrity protection checksum or signature m privacy Data privacy protection encryption Lustre 1 8 Operations Manual October 2009 112 23 Customized Flavor In most situations you do not need a customized flavor a basic flavor is sufficient for regular use But to some extent you can customize the flavor string The flavor string format is base_flavor bulk nip hash_alg cipher_alg Here are some examples of customized flavors plain bulkn Use plain on the RPC message null protection and no protection on the bulk transfer krb5i bulkn Use krb5i on the RPC message but do not protect the bulk transfer krb5p bulki Use krb5p on the RPC message and protect data integrity of the bulk transfer krb5p bulkp sha512 aes256 Use krb5p on the RPC message and protect data privacy of the bulk transfer by algorithm SHA512 and AES256 Currently Lustre supports these bulk data cryptographic algorithms m Hash adler32 crc32 md5 shal sha256 sha384 sha512 a wp256 wp384 wp512 m Cipher a arc4 aes128 aes192 aes256 m cast128 cast256 a twofish128 twofish256 Chapter11 Kerberos 11 13 11 2 2 4 W225 Specifying Security Flavors If you have not specified a security flavor the CLIENT MDT
98. group of OSTs to be named for file striping purposes If you use OST pools be aware that running the writeconf command erases all pools information as well as any other parameters set via lctl conf_param We recommend that the pools definitions and conf_param settings be executed via a script so they can be reproduced easily after a writeconf is performed Lustre 1 8 Operations Manual October 2009 4 3 11 4 3 11 1 Removing and Restoring OSTs OSTs can be removed from and restored to a Lustre file system Currently in Lustre removing an OST really means that the OST is deactivated in the file system not permanently removed A removed OST still appears in the file system do not create a new OST with the same name Removing an OST from the File System When removing an OST remember that the MDT does not communicate directly with OSTs Rather each OST has a corresponding OSC which communicates with the MDT It is necessary to determine the device number of the OSC that corresponds to the OST Then you use this device number to deactivate the OSC on the MDT To remove an OST from the file system 1 For the OST to be removed determine the device number of the corresponding OSC on the MDT a List all OSCs on the node along with their device numbers Run lctl dl grep osc This is sample 1ct1 dl grep osc output 11 UP osc lustre OST 0000 osc cac94211 4ea5b30f 6a8e 55a0 7519 2 20318ebdb4 5 12 UP osc
99. gt kernel ib lt ver gt lustre modules lt ver gt lustre ldiskfs lt ver gt c Install the utilities userspace packages Use the rpm ivh command to install the utilities packages For example rpm ivh lustre lt ver gt d Install the e2fsprogs package Make sure the e2fsprogs package downloaded in Step 5 is unpacked and use the rpm i command to install it For example rpm i e2fsprogs lt ver gt e Optional If you want to add optional packages to your Lustre system install them now 5 Verify that the boot loader grub conf or lilo conf has been updated to load the patched kernel 6 Reboot the patched clients and the servers a If you applied the patched kernel to any clients reboot them Unpatched clients do not need to be rebooted b Reboot the servers Once all the machines have rebooted the next steps are to configure Lustre Networking LNET and the Lustre file system See Configuring Lustre 3 18 Lustre 1 8 Operations Manual October 2009 Be Ro Installing Lustre with a Third Party Network Stack When using third party network hardware you must follow a specific process to install and recompile Lustre This section provides an installation example describing how to install Lustre 1 6 6 while using the Myricom MX 1 2 7 driver The same process is used for other third party network stacks by replacing MX specific references in Step 2 with the stack specific build and using the p
100. hardware requirements that must be met to configure Lustre for failover Hardware Preconditions m The setup must consist of a failover pair where each node of the pair has access to shared storage If possible the storage paths should be identical nodeA dev sda nodeB dev sda Note A failover pair is a combination of two or more separate nodes Each node has access to the same shared disk m Shared storage can be arranged in an active passive MDS OSS or active active OSS only configuration Each shared resource has a primary default node Heartbeat assumes that the non primary node is secondary for that resource m The two nodes must have one or more communication paths for Heartbeat traffic A communication path can be Dedicated Ethernet a Serial live serial crossover cable Failure of all Heartbeat communication is not good This condition is called split brain Heartbeat software resolves this situation by powering down one node m The two nodes must have a method to control one another s state RPC hardware is the best choice There must be a script to start and stop a given node from the other node STONITH provides soft power control methods SSH meatware but these cannot be used in a production situation m Heartbeat provides a remote ping service that is used to monitor the health of the external network If you wish to use the ipfail service then you must have a very reliable external addr
101. hopcount if omitted defaults to 1 the remote network is adjacent It is an error to specify routes to the same destination with routers on different local networks If the target network string contains no expansions then the hopcount defaults to 1 and may be omitted that is the remote network is adjacent In practice this is true for most multi network configurations It is an error to specify an inconsistent hop count for a given target network This is why an explicit hopcount is required if the target network string specifies more than one network Lustre 1 8 Operations Manual October 2009 30 2 1 4 forwarding This is a string that can be set either to enabled or disabled for explicit control of whether this node should act as a router forwarding communications between all local networks A standalone router can be started by simply starting LNET modprobe ptlrpc with appropriate network topology options Variable Description acceptor The acceptor is a TCP IP service that some LNDs use to establish communications If a local network requires it and it has not been disabled the acceptor listens on a single port for connection requests that it redirects to the appropriate local network The acceptor is part of the LNET module and configured by the following options e secure Accept connections only from reserved TCP ports lt 1023 e all Accept connections from any TCP port NOTE this is
102. is triggered automatically and is transparent to users If an old quota file does not exist or looks broken then the new v2 quota file will be empty In case of an error details can be found in the kernel log of the corresponding MDS OST During conversion of a vl quota file to a v2 quota file the v2 quota file is marked as broken to avoid it being used if a crash occurs The quota module does not use broken quota files keeping quota off In most situations Lustre administrators do not need to set specific versioning options Upgrading Lustre without using quota_type to force specific quota file versions results in quota files being upgraded automatically to the latest version The option ensures backward compatibility preventing a quota file upgrade to a version which is not supported by earlier Lustre versions Lustre 1 8 Operations Manual October 2009 9 1 5 Lustre Quota Statistics Lustre includes statistics that monitor quota activity such as the kinds of quota RPCs sent during a specific period the average time to complete the RPCs etc These statistics are useful to measure performance of a Lustre file system Each quota statistic consists of a quota event and min_time max_time and sum_time values for the event Quota Event Description sync_acq_req sync_rel_req async_acq_req async_rel_req wait_for_blk_quota lquota_chkquota wait_for_ino_quota lquota_chkquota wait_for_blk_quota lquo
103. like Kerberos and libgssapi Chapter 11 Kerberos 11 9 11 2 1 7 Running GSS Daemons If you turn on GSS between an MDT OST or MDT MDT GSS treats the MDT as a client You should run 1gssd on the MDT There are two types of GSS daemons 1gssd and 1svcgssd Before starting Lustre make sure they are running on each node a OST lsvcgssd ma MDT lsvcgssd a CLI none Note Verbose logging can help you make sure Kerberos is set up correctly To use verbose logging and run it in the foreground run lsvcgssd vvv f v increases the verbose level of a debugging message by 1 For example to set the verbose level to 3 run lsvcgssd v v v f runs lsvcgssd in the foreground instead of as daemon We are maintaining a patch against nfs utils and bringing necessary patched files into the Lustre tree After a successful build GSS daemons are built under lustre utils gss and are part of lustre xxxx rpm 11 10 Lustre 1 8 Operations Manual October 2009 11 2 2 11 2 2 1 Types of Lustre Kerberos Flavors There are three major flavors in which you can configure Lustre with Kerberos m Basic Flavors m Security Flavor m Customized Flavor Select a flavor depending on your priorities and preferences Basic Flavors Currently we support six basic flavors null plain krb5n krb5a krb5i and krb5p RPC Message Bulk Data Basic Flavor Authentication Protection Protection Remarks null N A N A N A pla
104. lt gt decimal translations Use GDB gdb p x 15028 2 0x3ab4 Or be echo obase 16 15028 bc 1 Determine a reasonable value for LAST_ID Check on the MDS mount t ldiskfs dev lt mdsdev gt mnt mds od Ax td8 mnt mds lov_objid There is one entry for each OST in OST index order This is what the MDS thinks the last in use object is 2 Determine the OST index for this OST od Ax td4 mnt ost last_rcvd It will have it at offset Ox8c 3 Check on the OST With debugfs check LAST_ID debugfs c R dump 0 0 LAST_ID tmp LAST_ID dev XXX od Ax td8 tmp LAST_ID 4 Check objects on the OST mount rt ldiskfs dev ostdev mnt ost note the ls below is a number one and not a letter L ls 1s mnt ost 0 0 d grep v a z sort k2 n gt tmp objects diskname tail 30 tmp objects diskname This shows you the OST state There may be some pre created orphans check for zero length objects Any zero length objects with IDs higher than LAST_ID should be deleted New objects will be pre created If the OST LAST_ID value matches that for the objects existing on the OST then it is possible the lov_objid file on the MDS is incorrect Delete the lov_objid file on the MDS and it will be re created from the LAST_ID on the OSTs Lustre 1 8 Operations Manual October 2009 If you determine the LAST_ID file on the OST is incorrect that is it does not match what objects exist does not m
105. lustre OST 0001 osc cac94211 4ea5b30f 6a8e 55a0 7519 2 20318ebdb4 5 13 IN osc lustre OST 0000 osc lustre MDT0000 mdtlov_UUID 5 14 UP osc lustre OST 0001 osc lustre MDT0000 mdtlov_UUID 5 b Determine the device number of the OSC that corresponds to the OST to be removed 2 Temporarily deactivate the OSC on the MDT On the MDT run mdt gt lctl device lt devno gt deactivate For example based on the command output in Step 1 to deactivate device 13 the MDT s OSC for OST 0000 the command would be mdt gt lctl device 13 deactivate This marks the OST as inactive on the MDS so no new objects are assigned to the OST This does not prevent use of existing objects for reads or writes Note Do not deactivate the OST on the clients Do so causes errors EIOs and the copy out to fail Chapter 4 Configuring Lustre 4 25 Caution Do not use lct1 conf_param to deactivate the OST It permanently sets a parameter in the file system configuration 3 Discover all files that have objects residing on the deactivated OST Run lfs find obd OST UUID lt mount_point gt 4 Copy not move the files to a new directory in the file system Copying the files forces object re creation on the active OSTs 5 Move not copy the files back to their original directory in the file system Moving the files causes the original files to be deleted as the copies replace them 6 Once all files have been moved pe
106. lustre dir Deletes a default stripe pattern on a given directory New files use the default striping pattern lfs getstripe v mnt lustre filel Lists the detailed object allocation of a given file lfs setstripe pool my_pool c 2 mnt lustre file Creates a file striped on two OSTs from the pool my_pool lfs poollist mnt lustre Lists the pools defined for the mounted Lustre file system mnt lustre lfs poollist my_fs my_pool Lists the OSTs which are members of the pool my_pool in file system my_fs lfs getstripe v mnt lustre filel Lists the detailed object allocation of a given file 1fs find mnt lustre Efficiently lists all files in a given directory and its subdirectories lfs find mnt lustre mtime 30 type f print Recursively lists all regular files in a given directory more than 30 days old Lustre 1 8 Operations Manual October 2009 1fs find obd OST2 UUID mnt lustre Recursively lists all files in a given directory that have objects on OST2 UUID The 1fs check servers command checks the status of all servers MDT and OSTs lfs find mnt lustre pool poolA Finds all directories files associated with poolA lfs find mnt lustre pool Finds all directories files not associated with a pool lfs find mnt lustre pool Finds all directories files associated with pool lfs check servers Checks the status of all servers MDT OST lfs osts Lists all OSTs in the file sys
107. lustre include linux obd_support h for the definitions of individual failure locations The default value is 0 zero sysctl w lustre fail_loc 0x80000122 drop a single reply proc sys lustre dump_on_timeout This triggers dumps of the Lustre debug log when timeouts occur The default value is 0 zero proc sys lustre dump_on_eviction This triggers dumps of the Lustre debug log when an eviction occurs The default value is 0 zero By default debug logs are dumped to the tmp folder this location can be changed via proc 22 4 Lustre 1 8 Operations Manual October 2009 22 1 3 Adaptive Timeouts Lustre 1 8 introduces an adaptive mechanism to set RPC timeouts This feature causes servers to track actual RPC completion times and to report estimated completion times for future RPCs back to clients The clients use these estimates to set their future RPC timeout values If server request processing slows down for any reason the RPC completion estimates increase and the clients allow more time for RPC completion If RPCs queued on the server approach their timeouts then the server sends an early reply to the client telling the client to allow more time In this manner clients avoid RPC timeouts and disconnect reconnect cycles Conversely as a server speeds up RPC timeout values decrease allowing faster detection of non responsive servers and faster attempts to reconnect to a server s failover partner Note In Lustr
108. m IOR Benchmark m IOzone Benchmark 17 1 17 1 17 2 Bonnie Benchmark Bonnie is a benchmark suite that having aim of performing a number of simple tests of hard drive and file system performance Then you can decide which test is important and decide how to compare different systems after running it Each Bonnie test gives a result of the amount of work done per second and the percentage of CPU time utilized There are two sections to the program s operations The first is to test the I O throughput in a fashion that is designed to simulate some types of database applications The second is to test creation reading and deleting many small files in a fashion similar to the usage patterns Bonnie is a benchmark tool that test hard drive and file system performance by sequential I O and random seeks Bonnie tests file system activity that has been known to cause bottlenecks in I O intensive applications To install and run the Bonnie benchmark 1 Download the most recent version of the Bonnie software http www coker com au bonnie 2 Install and run the Bonnie software per the ReadMe file accompanying the software Sample output Version 1 03 Sequential Output Sequential Input Random Per Chr Block Rewrite Per Chr Block Seeks MachineSize K sec CP K sec CP K sec CP K sec CP K sec CP sec SCP mds 2G 3811822 21245 10 51967 10 90 00 SSS Sequential Create R
109. mdsdev mnt mds cd mnt mds getfattr R d m P gt ea bak tar czvf backup file tgz sparse cd umount mnt mds 1 Make a mount point for the file system Run mkdir p mnt mds 2 Mount the file system Run mount t ldiskfs mdsdev mnt mds Lustre 1 8 Operations Manual October 2009 15 1 4 3 Change to the mount point being backed up Run cd mnt mds 4 Back up the EAs Run getfattr R d m P gt ea bak Note In most distributions the get fattr command is part of the attr package If the getfattr command returns errors like Operation not supported then the kernel does not correctly support EAs Stop and use a different backup method or contact us for assistance 5 Verify that the ea bak file has properly backed up the EA data on the MDS Without this EA data the backup is not useful Look at this file with more or a text editor For each file it should have an item similar to this file ROOT mds_md5sum3 txt trusted lov O0sOAVRCWEAAABXOKUCAAAAAAAAAAAAAAAAAAAQAAEFAAADD5 QOAAAAAAAAAAAAAAAA AAAAAAAEAAAA 6 Back up all file system data Run tar czvf backup file tgz sparse Note In Lustre 1 6 7 and later the sparse option reduces the size of the backup file Be sure to use it in the tar command 7 Change directory out of the mounted file system Run ed 8 Unmount the file system Run umount mnt mds Backing Up an OST Follow
110. means that OSS distribution does not count in the weighting but the stripe assignment is still done via a weighting if OST2 has twice as much free space as OST1 then OST2 is twice as likely to be used but it is not guaranteed to be used 24 5 24 12 OST Pools Lustre 1 8 introduces the OST pool feature which enables users to group OSTs together to make object placement more flexible A pool is a name associated with an arbitrary subset of OSTs in a Lustre cluster OST pools follow these rules a An OST can be a member of multiple pools m No ordering of OSTs in a pool is defined or implied m OST membership in a pool is flexible and can change over time When an OST pool is defined it can be used to allocate files When file or directory striping is set to a pool only OSTs in the pool are candidates for striping If a stripe_index is specified which refers to an OST that is not a member of the pool an error is returned OST pools are used only at file creation If a pool s definition changes an OST is add or removed or the pool is destroyed already created files are not affected Note It is an error EINVAL to create a file using an empty pool Lustre 1 8 Operations Manual October 2009 24 5 1 Note Files created in a pool are not accessible from clients or servers running Lustre 1 6 5 or earlier an error will be reported to the client We recommend that either Lustre 1 6 6 or later
111. mode it means that the servers MDS OSS judge there is a stop of file system in an unclean state In other words unsaved data may be in the client cache To save this data the file system re starts in recovery mode and makes the clients write the data to disk 19 1 Recovering Lustre In Lustre recovery mode the servers attempt to contact all clients and request they replay their transactions If all clients are contacted and they are recoverable they have not rebooted then recovery proceeds and the file system comes back with the cached client side data safely saved to disk If one or more clients are not able to reconnect due to hardware failures or client reboots then the recovery process times out which causes all clients to be expelled In this case if there is any unsaved data in the client cache it is not saved to disk and is lost This is an unfortunate side effect of allowing Lustre to keep data consistent on disk 19 1 19 2 19 21 19 2 Types of Failure Different types of failure can cause Lustre to enter recovery mode Client compute node failure MDS failure and failover a OST failure a Transient network partition a Network failure m Disk state loss a Down node m Disk state of multiple out of sync systems Currently all failure and recovery operations are based on the notion of connection failure All imports or exports associated with a given connection are considered as f
112. nodes echo 0 gt sys module ptlrpc at_max Note Changing adaptive timeouts status at runtime may cause transient timeout reconnect recovery etc 1 The specific sub directory in ptlrpc containing the parameters is system dependent Chapter 22 LustreProc 22 7 224 52 Interpreting Adaptive Timeout Information Adaptive timeout information can be read from proc fs lustre timeouts files for each service and client or with the 1ct1 command This is an example from proc fs lustre timeouts files cfs21 cat proc fs lustre ost OSS ost_io timeouts This is an example using the 1ct1 command lctl get_param n ost ost_io timeouts This is the sample output service cur 33 worst 34 at 1193427052 0d0h26m40s ago 1 1 33 2 The ost_io service on this node is currently reporting an estimate of 33 seconds The worst RPC service time was 34 seconds and it happened 26 minutes ago The output also provides a history of service times In the example there are 4 bins of adaptive_timeout_history with the maximum RPC time in each bin reported In 0 150 seconds the maximum RPC time was 1 with the same result in 150 300 seconds From 300 450 seconds the worst maximum RPC time was 33 seconds and from 450 600s the worst time was 2 seconds The current estimated service time is the maximum value of the 4 bins 33 seconds in this example Service times as reported by the servers are also tracked in the clie
113. number gt lt nonnegative decimal gt lt hexadecimal gt Note For networks using numeric addresses e g elan the address range must be specified in the lt numaddr_range gt syntax For networks using IP addresses the address range must be in the lt ipaddr_range gt For example if elan is using numeric addresses 1 2 3 4 elan is incorrect Lustre 1 8 Operations Manual October 2009 CHAPTER 26 Lustre Operating Tips This chapter describes tips to improve Lustre operations and includes the following sections m Adding an OST to a Lustre File System m A Simple Data Migration Script m Adding Multiple SCSI LUNs on Single HBA m Failures Running a Client and OST on the Same Machine Improving Lustre Metadata Performance While Using Large Directories 26 1 26 1 Adding an OST to a Lustre File System To add an OST to existing Lustre file system 1 Add anew OST by passing on the following commands run mkfs lustre fsname spfs ost mgsnode mds16 tcp0 dev sda mkdir p mnt test ost0 mount t lustre dev sda mnt test ost0 2 Migrate the data possibly The file system is quite unbalanced when new empty OSTs are added New file creations are automatically balanced If this is a scratch file system or files are pruned at a regular interval then no further work may be needed Files existing prior to the expansion can be rebalanced with an in place copy which can be done with a simple scr
114. objects on OSTs When a file is comprised of more than one object Lustre stripes the file data across them in a round robin fashion Users can configure the number of stripes the size of each stripe and the servers that are used One of the most frequently asked Lustre questions is How should I stripe my files and what is a good default The short answer is that it depends on your needs A good rule of thumb is to stripe over as few objects as will meet those needs and no more 24 1 24 1 1 24 1 1 1 24 1 1 2 Advantages of Striping There are two reasons to create files of multiple stripes bandwidth and size Bandwidth There are many applications which require high bandwidth access to a single file more bandwidth than can be provided by a single OSS For example scientific applications which write to a single file from hundreds of nodes or a binary executable which is loaded by many nodes when an application starts In cases like these stripe your file over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file In our experience the requirement is as quickly as possible which usually means all OSSs Note This assumes that your application is using enough client nodes and can read write data fast enough to take advantage of this much OSS bandwidth The largest useful stripe count is bounded by the I O rate of your clients jobs divided by the performance per OSS
115. offer the hard consistency guarantees of top end enterprise RAID arrays Hardware RAID guarantees that the value of any block is exactly the before or after value and that ordering of writes is preserved With software RAID an interrupted write operation that spans multiple blocks can frequently leave a stripe in an inconsistent state that is not restored to either the old or the new value Normally such interruptions are caused by an abrupt shutdown of the system If the array functions without disk failures but experiences sudden power down incidents such as interrupted writes on journal file systems these events can affect file data and data in the journal Metadata itself is re written from the journal during recovery and is correct Because the journal uses a single block to indicate a complete transaction has committed after other journal writes have completed the journal remains valid File data can be corrupted when overwriting file data this is a known problem with incomplete writes and caches Recovery of the disk file systems with software RAID is similar to recovery without software RAID Using Lustre servers with disk file systems does not change these guarantees Problems can arise if after an abrupt shutdown a disk fails on restart In this case even single block writes provide no guarantee that as an example the journal will not be corrupted Follow these requirements m If the power down of a system using software RAID is
116. on device dev sdb has started Shortly afterwards this output appears Lustre temp OST0000 received MDS connection from 10 2 0 1 tcp0 Lustre MDS temp MDT0000 temp OSTO000_UUID now active resetting orphans Lustre 1 8 Operations Manual October 2009 6 Create the client mount the file system on the client On the client node run root client1 mount t lustre 10 2 0 1 tcp0 temp lustre This command generates this output Lustre Client temp client has started Verify that the file system started and is working by running the df dd and 1s commands on the client node a Run the df command root client1 lfs df h This command generates output similar to this Filesystem Size Used Avail Use Mounted on dev mapper VolGroup00 LogVol00 7 2G 2 4G 4 5G 35 dev sdal 99M 29M 65M 31 boot tmpfs 62M 0 62M 0 dev shm 10 2 0 1 tcp0 temp 30M 8 5M 20M 30 lustre b Run the dd command root client1i cd lustre root client1 lustre dd if dev zero of lustre zero dat bs 4M count 2 This command generates output similar to this 2 0 records in 2 0 records out 8388608 bytes 8 4 MB copied 0 159628 seconds 52 6 MB s c Run the 1s command root client1 lustre ls lsah This command generates output similar to this total 8 0M 4 0K drwxr xr x 2 root root 4 0K Oct 16 15 27 8 0K drwxr xr x 25 root root 4 0K Oct 16 15 27 8 0M rw r r 1 root root 8 0M Oct 16 15 27 zero dat C
117. only for outgoing messages but also to hold state for bulk transfers requested by incoming messages This pool should scale with the total number of peers To enable the building of the Portals LND ptllnd ko configure with this option configure with portals lt path to portals headers gt Variable Description ntx Total number of messaging descriptors 256 concurrent_peers 1152 peer_hash_table_size 101 cksum 0 timeout 50 portal 9 rxb_npages 64 cpus Maximum number of concurrent peers Peers that attempt to connect beyond the maximum are not allowed Number of hash table slots for the peers This number should scale with concurrent_peers The size of the peer hash table is set by the module parameter peer_hash_table_size which defaults to a value of 101 This number should be prime to ensure the peer hash table is populated evenly It is advisable to increase this value to 1001 for 10000 peers Set to non zero to enable message not RDMA checksums for outgoing packets Incoming packets are always check summed if necessary independent of this value Amount of time in seconds that a request can linger in a peers active queue before the peer is considered dead Portal ID to use for the ptlind traffic Number of pages in an RX buffer Lustre 1 8 Operations Manual October 2009 Variable Description credits 128 peercredits 8 max_msg_size 512
118. operations are permitted During the recovery window only clients that were connected at the time of MDS failure are permitted to reconnect ClientUpcall a user space policy program manages the re connection to a new or rebooted MDS ClientUpcall is responsible to set up necessary portals routes and connections and indicates which connection UUID should replace the failed one OST Failure When an OST fails or is severed from the client Lustre marks the corresponding OSC as inactive and the LOV avoids making stripes for new files on that OST Operations that operate on the whole file such as determining file size or unlinking skips inactive OSCs and OSCs that become inactive during the operation Attempts to read from or write to an inactive stripe result in an EIO error being returned to the client As with the MDS failover case Lustre invokes the ClientUpcall when it detects an OST failure If and when the upcall indicates that the OST is functioning again Lustre reactivates an OSC in question and makes file data from stripes on the newly returned OST available for reading and writing To force an OST recovery unmount the OST and then mount it again If the OST was connected to clients before it failed then a recovery process starts after the remount enabling clients to reconnect to the OST and replay transactions in their queue When the OST is in recovery mode all new client connections are refused until the recovery fini
119. ost_io service thread so these buffers do not need to be allocated and freed for each I O request File system metadata A reasonable amount of RAM needs to be available for file system metadata While no hard limit can be placed on the amount of file system metadata if more RAM is available then the disk I O is needed less often to retrieve the metadata Network transport If you are using TCP or other network transport that uses system memory for send receive buffers this must also be taken into consideration Failover configuration If the OSS node will be used for failover from another node then the RAM for each journal should be doubled so the backup server can handle the additional load if the primary server fails OSS read cache OSS read cache provides read only caching of data on an OSS using the regular Linux page cache to store the data Just like caching from a regular file system in Linux OSS read cache uses as much physical memory as is available Because of these memory requirements the following calculations should be taken as determining the absolute minimum RAM required in an OSS node Chapter 3 Installing Lustre 3 7 3 8 Calculating OSS Memory Requirements The minimum recommended RAM size for an OSS with two OSTs is computed below 1 5 MB per OST IO thread 512 threads 768 MB e1000 RX descriptors RxDescriptors 4096 for 9000 byte MTU 128 MB Operating system overhead 512 MB 400 MB journal size
120. receive buffers to post typically everything apart from bulk data Number of message envelopes to reserve for the small receive buffer queue This determines a breakpoint in the number of concurrent senders Below this number communication attempts are queued but above this number the pre allocated envelope queue will fill causing senders to back off and retry This can have the unfortunate side effect of starving arbitrary senders who continually find the envelope queue is full when they retry This parameter should therefore be increased if envelope queue overflow is suspected Number of large receive buffers to post typically for routed bulk data Number of message envelopes to reserve for the large receive buffer queue For more information on message envelopes see the ep_envelopes_small option above Smallest non routed PUT that will be RDMA d Smallest non routed GET that will be RDMA d Lustre 1 8 Operations Manual October 2009 30 2 4 RapidArray LND The RapidArray LND ralnd is connection based and uses the acceptor to establish connections with its peers It is limited to a single instance which uses all both RapidArray devices present It load balances over them using the XOR of the source and destination NIDs to determine which device to use for communication The address within network is determined by the address of the single IP interface that may be specified by the networks modu
121. same as the limit on regular files in all later versions of Lustre due to a small ext3 format change In fact Lustre is tested with ten million files in a single directory On a properly configured dual CPU MDS with 4 GB RAM random lookups in such a directory are possible at a rate of 5 000 files second Chapter 32 System Limits 32 3 32 9 MDS Space Consumption A single MDS imposes an upper limit of 4 billion inodes The default limit is slightly less than the device size of 4 KB meaning 512 MB inodes for a file system with MDS of 2 TB This can be increased initially at the time of MDS file system creation by specifying the mkfsoptions i 2048 option on the add mds config line for the MDS For newer releases of e2fsprogs you can specify i 1024 to create 1 inode for every 1 KB disk space You can also specify N num inodes to set a specific number of inodes The inode size I should not be larger than half the inode ratio i Otherwise mke2fs will spin trying to write more number of inodes than the inodes that can fit into the device For more information see Options to Format MDT and OST File Systems 32 10 Maximum Length of a Filename and Pathname This limit is 255 bytes for a single filename the same as in an ext3 file system The Linux VFS imposes a full pathname length of 4096 bytes 32 11 Maximum Number of Open Files for Lustre File Systems Lustre does not impose maximum number o
122. server does not know how much time the RPC will take so it asks for a fixed value Default value is 30 When a server finds a queued request about to time out and needs to send an early reply out the server adds the at_extra value up to its estimate If the time expires the Lustre client will enter recovery status and reconnect to restore it to normal status If you see multiple early replies for the same RPC asking for multiple 30 second increases change the at_extra value to a larger number to cut down on early replies sent and therefore network load Sets the minimum lock enqueue time Default value is 100 The 1dlm_enqueue time is the maximum of the measured enqueue estimate influenced by at_min and at_max parameters multiplied by a weighting factor and the 1dlm_enqueue_min setting LDLM lock enqueues were based on the obd_timeout value now they have a dedicated minimum value Lock enqueues increase as the measured enqueue times increase similar to adaptive timeouts In future releases the default will be 600 adaptive timeouts will be enabled This default was chosen as a reasonable time in which to send a reply from the point at which it was sent This default was chosen as a balance between sending too many early replies for the same RPC and overesti mating the actual completion time In Lustre 1 8 adaptive timeouts are enabled by default To disable adaptive timeouts at run time set at_max to 0 on all
123. shell command After every run PIOS executes the post run command Typically this is used to clear and collect statistics for the run or to start and stop statistics gathering during the run The timestamp is passed to both pre run and post run For convenience PIOS understands byte specifiers and uses K k for kilobytes 2 lt lt 10 M m for megabytes 2 lt lt 20 G g for gigabytes 2 lt lt 30 T t for terabytes 2 lt lt 40 Download the PIOS test tool at http downloads clusterfs com public tools benchmarks pios Lustre 1 8 Operations Manual October 2009 18 3 1 Synopsis pios chunksize c values chunksize_low a value chunksize_high b value chunksize incr g value offset o values offset_low m value offset_high q value offset_incr r value regioncount n values regioncount_low i value regioncount_high j value regioncount_incr k value threadcount t values threadcount_low 1 value threadcount_high h value threadcount_incr e value regionsize s values regionsize_low A value regionsize_high B value regionsize incr C value directio d posixio x cowio w cleanup L threaddelay T ms regionnoise I shift chunknoise N bytes fpp F verify V values prerun P pre command postrun R post command path p output file path Chapter 18 Lustre I O Kit 18 13 18 3 2 18 14
124. size up to a maximum of 32 The maximum number of threads MDS_MAX_THREADS is 512 Note The OSS and MDS automatically start new service threads dynamically in response to server loading within a factor of 4 The default is calculated the same way as before as explained in OSS Service Thread Count Setting the _num_threads module parameter disables the automatic thread creation behavior Chapter 20 Lustre Tuning 20 3 20 1 2 1 I O Scheduler Select the best I O scheduler for your setup Try different I O schedulers kernel parameter elevator on old kernels or echo lt scheduler gt gt sys block lt dev gt queue scheduler because they behave differently depending on storage and load Benchmark all I O schedulers and select the best one for your setup For more information on I O schedulers see http www linuxjournal com article 6931 http www redhat com magazine 008jun05 features schedulers 20 2 20 2 0 1 20 2 0 2 LNET Tunables This section describes LNET tunables Transmit and receive buffer size With Lustre release 1 4 7 and later ksocklnd now has separate parameters for the transmit and receive buffers options ksocklnd tx_buffer_size 0 rx_buffer_size 0 If these parameters are left at the default value 0 the system automatically tunes the transmit and receive buffer size In almost every case this default produces the best performance Do not attempt to tune these parameters unl
125. subsystems and debug types used in Lustre are as follows m Standard Subsystems mdc mds osc ost obdclass obdfilter llite ptlrpc portals Ind ldlm lov m Debug Types Types Description trace Entry Exit markers dimtrace Locking related information inode super ext2 Anything from the ext2_debug malloc Print malloc or free information cache Cache related information info General information ioctl IOCTL related information blocks Ext2 block allocation information net Networking warning buffs other dentry portals Entry Exit markers page Bulk page handling error Error messages emerg rpctrace For distributed debugging ha Failover and recovery related information Lustre 1 8 Operations Manual October 2009 23 1 1 Format of Lustre Debug Messages Lustre uses the CDEBUG and CERROR macros to print the debug or error messages To print the message the CDEBUG macro uses portals_debug_msg portals linux oslib debug c The message format is described below along with an example Parameter Description subsystem 800000 debug mask 000010 smp_processor_id 0 sec used 10818808 47 677302 stack size 1204 pid 2973 host pid if uml or zero 31070 file line functional debug message as_dev c 144 create_write_buffers kmalloced obj 24 at a375571c tot 17447717 Chapter 23 Lustre Debugging 23 3 23 2 23 4 Tools for Lustre Debugging The Lustre system offers
126. system and mount it mkdir p mnt data mdt mount t lustre dev sda mnt data mdt Start Lustre on all the four OSTs mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sda mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sdd mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sdal mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sdb Chapter 6 Configuring Lustre Examples 6 3 6 1 2 3 6 4 8 Make a mount point on all the OSTs for the file system and mount it mkdir p mnt data ost0 mount t lustre dev sda mnt data ost0 mkdir p mnt data ostl mount t lustre dev sdd mnt data ostl1 mkdir p mnt data ost2 mount t lustre dev sdal mnt data ost2 mkdir p mnt data ost3 mount t lustre dev sdb mnt data ost3 mount t lustre mdsnode tcp0 datafs mnt datafs Configuring Lustre with a CSV File A new utility script usr sbin lustre_config can be used to configure Lustre 1 6 and later This script enables you to automate formatting and setup of disks on multiple nodes Describe your entire installation in a Comma Separated Values CSV file and pass it to the script The script contacts multiple Lustre targets simultaneously formats the drives updates modprobe conf and produces HA configuration files using definitions in the CSV file The lustre_config h option shows several samples of CSV files Note The CSV file format is a f
127. taken directly from the POSIX paper referenced above ACLs on a Lustre file system work exactly like ACLs on any Linux file system They are manipulated with the standard tools in the standard manner Below we create a directory and allow a specific user access root client lustre umask 027 root client lustre mkdir rain root client lustre ls ld rain drwxr x 2 root root 4096 Feb 20 06 50 rain root client lustre getfacl rain file rain owner root group root user rwx group r x other root client lustre setfacl m user chirag rwx rain root client lustre ls ld rain drwxrwx 2 root root 4096 Feb 20 06 50 rain root client lustre getfacl omit heade rain user rwx user chirag rwx group r x mask rwx other Chapter 25 Lustre Security 25 3 29 2 25 21 25 2 2 Using Root Squash Lustre 1 6 introduced root squash functionality a security feature which controls super user access rights to an Lustre file system Before the root squash feature was added Lustre users could run rm rf as root and remove data which should not be deleted Using the root squash feature prevents this outcome The root squash feature works by re mapping the user ID UID and group ID GID of the root user to a UID and GID specified by the system administrator via the Lustre configuration management server MGS The root squash feature also enables the Lustre administrator to speci
128. the size of read call if larger after the second sequential read on a file descriptor Random reads are done at the size of the read call only no readahead Reads to non contiguous regions of the file reset the readahead algorithm and readahead is not triggered again until there are sequential reads again To disable readahead set this tunable to 0 The default value is 40 MB proc fs lustre llite lt fsname gt lt uid gt max_read_ahead_whole_mb This tunable controls the maximum size of a file that is read in its entirety regardless of the size of the read Chapter 22 LustreProc 22 19 22 2 6 2 Tuning Directory Statahead When the 1s 1 process opens a directory its process ID is recorded When the first directory entry is stated with this recorded process ID a statahead thread is triggered which stats ahead all of the directory entries in order The 1s 1 process can use the stated directory entries directly improving performance proc fs lustre llite statahead_max This tunable controls whether directory statahead is enabled and the maximum statahead count By default statahead is active To disable statahead set this tunable to echo 0 gt proc fs lustre llite statahead max To set the maximum statahead count n set this tunable to echo n gt proc fs lustre llite statahead_ max The maximum value of n is 8192 proc fs lustre llite statahead_status This is a read only interface that indicat
129. the underlying Idiskfs file system has not unmounted gracefully due to a crash for example re run quotacheck to obtain accurate quota information Lustre 1 6 5 and 1 4 12 use journaled quota so it is not necessary to run quotacheck after an unclean shutdown In certain failure situations e g when a broken Lustre installation or build is used re run quotacheck after checking the server kernel logs and fixing the root problem The 1 s command includes several command options to work with quotas a quotaon enables disk quotas on the specified file system The file system quota files must be present in the root directory of the file system a quotaoff disables disk quotas on the specified file system a quota displays general quota information disk usage and limits m setquota specifies quota limits and tunes the grace period By default the grace period is one week Usage lfs quotaon ugf lt filesystem gt lfs quotaoff ug lt filesystem gt lfs quota v o obd_uuid u lt username gt g lt groupname gt lt filesystem gt lfs quota t lt u g gt lt filesystem gt lfs setquota lt u user g group gt lt username groupname gt b lt block softlimit gt B lt block hardlimit gt i lt inode softlimit gt I lt inode hardlimit gt lt filesystem gt Examples In all of the examples below the file system is mnt lustre To turn on user and group quotas run lf
130. the verification is done from timestamps read from the first location of files previously written in the test If sequence is given then each run verifies the timestamp accordingly If a single timestamp is given then it is verified with all files written Chapter 18 Lustre I O Kit 18 17 18 34 PIOS Examples To create a 1 GB load with a different number of threads In one file pios t 1 2 4 8 16 32 64 128 n 128 c 1M s 8M o 8M load posixio p mnt lustre In multiple files pios t 1 2 4 8 16 32 64 128 n 128 c 1M s 8M o 8M load posixio fpp p mnt lustre To create a 1 GB load with a different number of chunksizes on Idiskfs with direct I O In one file pios t 32 n 128 c 128K 256K 512K 1M 2M 4M s 8M o 8M load directio p mnt lustre In multiple files pios t 32 n 128 c 128K 256K 512K 1M 2M 4M s 8M o 8M load directio fpp p mnt lustre To create a 32 MB to 128 MB load with different RegionSizes on a Solaris zpool In one file pios t 8 n 16 c 1M A 2M B 8M C 100 o 8M load posixio p myzpool In multiple files pios t 8 n 16 c 1M A 2M B 8M C 100 o 8M load posixio fpp p myzpool To read and verify timestamps Create a load with PIOS pios t 40 n 1024 c 256K s 4M o 8M load posixio p mnt lustre Keep the same parameters to read pios t 40 n 1024 c 256K s 4M o 8M load posixio p mnt lustre verify 18
131. time In case of a failure one node takes over for the other Chapter 8 Failover 8 5 To configure this for the shared disk the shared disk must provide multiple partitions each OST is the primary server for one partition and the secondary server for the other partition The active passive configuration doubles the hardware cost without improving performance and is seldom used for OST servers 8 2 OST Failover The OST has two operating modes failover and failout The default mode is failover In failover mode clients attempt to connect to each OSS node configured to serve the OST until one of them responds with it active Data on the OST is written synchronously and the clients replay transactions which were in progress and uncommitted to disk before the OST failure In the typical OST failover scenario an OSS node fails and the other node mounts the OST typically done by Linux HA Heartbeat When this happens no applications see any errors In failout mode when the underlying hardware has failed or the connection to storage has failed one reason to use multipath IO Lustre returns IO errors to the application 8 3 MDS Failover The MDS has only one failover mode active passive as only one MDS may be active at a given time In a failover configuration there are two MDSs each with access to the same MDT Either MDS can mount the MDT but not both at the same time 8 4 8 6 Configuring Lustre for F
132. to the console m debugfs Interactive file system debugger m Lustre subsystem asserts In case of asserts a log writes at tmp lustre_log lt timestamp gt m lfs This Lustre utility helps get to the extended attributes of a Lustre file among other things Lustre diagnostic tool This utility helps users report and create logs for Lustre bugs Lustre 1 8 Operations Manual October 2009 23 2 1 a GNU tar gtar This modified version of the gtar utility can back up and restore extended attributes i e file striping for Lustre Files backed up using gtar are restored per the backed up striping information The backup procedure does not use default striping rules Note Normal gtar does not store restore Lustre attributes To use this functionality you must download the Lustre patched tar utility modified gtar available here http downloads lustre org public tools lustre tar Debug Daemon Option to Ictl The debug_daemon allows users to control the Lustre kernel debug daemon to dump the debug_kernel buffer to a user specified file This functionality uses a kernel thread on top of debug_kernel debug_kerne1 another sub command of 1ct1 continues to work in parallel with debug_daemon command Debug_daemon is highly dependent on file system write speed File system writes operation may not be fast enough to flush out all the debug_buf fer if Lustre file system is under heavy system load and continue to CD
133. txqueuelen 0 RX bytes 314203 306 8 KiB TX bytes 129834 126 7 KiB Link encap Ethernet HWaddr 4C 00 10 AC 61 E0 inet6 addr fe80 4e00 10ff feac 61e0 64 Scope Link UP BROADCAST RUNNING SLAVE MULTICAST MTU 1500 Metric 1 RX packets 1581 errors 0 dropped 0 overruns 0 frame 0 TX packets 448 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 162084 158 2 KiB TX bytes 67245 65 6 KiB Interrupt 193 Base address 0x8c00 Link encap Ethernet HWaddr 4C 00 10 AC 61 E0 inet6 addr fe80 4e00 10ff feac 61e0 64 Scope Link UP BROADCAST RUNNING SLAVE MULTICAST MTU 1500 Metric 1 RX packets 1513 errors 0 dropped 0 overruns 0 frame 0 TX packets 444 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 152299 148 7 KiB TX bytes 64517 63 0 KiB Interrupt 185 Base address 0x6000 Lustre 1 8 Operations Manual October 2009 12 5 1 Examples This is an example of modprobe conf for bonding Ethernet interfaces eth1 and eth2 to bondO cat etc modprobe conf alias eth0 8139too alias scsi_hostadapter sata_via alias scsi_hostadapter1 usb storage alias snd card 0 snd via82xx options snd card 0 index 0 options snd via82xx index 0 alias bond0 bonding options bond0 mode balance alb miimon 100 options lnet networks tcp alias ethl via rhine cat etc sysconfig network scripts ifcfg bond0 DEVICE bond0 BOOTPROTO none NETMASK 255 255 255 0 IPADDR 192 168 10 79 As
134. values The values set for the MDS must match the values set on the OSTs Chapter 9 Configuring Quotas 9 9 9 1 4 9 1 4 1 9 10 The quota_bunit_sz parameter displays bytes however 1fs setquota uses KBs The quota_bunit_sz parameter must be a multiple of 1024 A proper minimum KB size for Lfs setquota can be calculated as Size in KBs quota_bunit_sz number of OSTS 1 1024 We add one 1 to the number of OSTs as the MDS also consumes KBs As inodes are only consumed on the MDS the minimum inode size for 1fs setquota is equal to quota_iunit_sz Note Setting the quota below this limit may prevent the user from all file creation Known Issues with Quotas Using quotas in Lustre can be complex and there are several known issues Granted Cache and Quota Limits In Lustre granted cache does not respect quota limits In this situation OSTs grant cache to Lustre client to accelerate I O Granting cache causes writes to be successful in OSTs even if they exceed the quota limits and will overwrite them The sequence is 1 A user writes files to Lustre 2 If the Lustre client has enough granted cache then it returns success to users and arranges the writes to the OSTs 3 Because Lustre clients have delivered success to users the OSTs cannot fail these writes Because of granted cache writes always overwrite quota limitations For example if you set a 400 GB quota on user A and use IOR to wr
135. version that is the number of the last transaction transno in which the inode was changed m When an inode is about to be changed a pre operation version of the inode is saved in the client s data m The client keeps the pre operation inode version and the post operation version transaction number for replay and sends them in the event of a server failure m If the pre operation version matches then the request is replayed The post operation version is assigned on all inodes modified in the request Note An RPC can contain up to four pre operation versions because several inodes can be involved in an operation In the case of a rename operation four different inodes can be modified 1 There are two scenarios under which client RPCs are not replayed 1 Non functioning or isolated clients do not reconnect and they cannot replay their RPCs causing a gap in the replay sequence These clients get errors and are evicted 2 Functioning clients connect but they cannot replay some or all of their RPCs that occurred after the gap caused by the non functioning isolated clients These clients get errors caused by the failed clients With VBR these requests have a better chance to replay because the gaps are only related to specific files that the missing client s changed 2 Usually there are two inodes a parent and a child Chapter 19 Lustre Recovery 19 5 19 3 1 19 6 During normal operation th
136. with many clusters in the 10 000 20 000 client range Other Lustre installations provide aggregated disk storage and bandwidth of up to 1 000 OSTs running on more than 450 OSSs Several Lustre file systems with a capacity of 1 PB or more allowing storage of up to 2 billion files have been in use since 2006 Performance Lustre deployments in production environments currently offer performance of up to 100 GB s In a test environment a performance of 130 GB s and 13 000 creates s has been sustained Lustre single client node throughput has been measured at 2 GB s max and OSS throughput at 2 5 GB s max Lustre has been run at 240 GB sec on the Spider file system at Oak Ridge National Laboratories POSIX compliance The full POSIX test suite passes on Lustre clients In a cluster POSIX compliance means that most operations are atomic and clients never see stale data or metadata High availability Lustre offers shared storage partitions for OSS targets OSTs and a shared storage partition for the MDS target MDT Security In Lustre it is an option to have TCP connections only from privileged ports Group membership handling is server based POSIX access control lists ACLs are supported Open source Lustre is licensed under the GNU GPL Additionally Lustre offers these features Interoperability Lustre runs on a variety of CPU architectures and mixed endian clusters and interoperability between adjacent Lustre software relea
137. with on screen computer output AaBbCc123 Book titles new words or terms words to be emphasized Replace command line variables with real names or values Edit your login file Use 1s a to list all files 2 You have mail su Password Read Chapter 6 in the User s Guide These are called class options You must be superuser to do this To delete a file type rm filename Note Characters display differently depending on browser settings If characters do not display correctly change the character encoding in your browser to Unicode UTF 8 A backslash continuation character is used to indicate that commands are too long to fit on one text line Lustre 1 8 Operations Manual September 2009 Third Party Web Sites Sun is not responsible for the availability of third party web sites mentioned in this document Sun does not endorse and is not responsible or liable for any content advertising products or other materials that are available on or through such sites or resources Sun will not be responsible or liable for any actual or alleged damage or loss caused by or in connection with the use of or reliance on any such content goods or services that are available on or through such sites or resources Preface xxiii xxiv Lustre 1 8 Operations Manual September 2009 Revision History BookTitle Part Number Rev Date Comments Lustre 1 8 Operations Manual 821 0035 10 A June 200
138. without changing the value Note Currently the 1ru_size parameter can only be set temporarily with 1ct1 set_param it cannot be set permanently To disable LRU sizing run this command on the Lustre clients lctl set_param 1dlm namespaces osc 1lru_size s NR_CPU 100 Replace NR_CPU value with the number of CPUs on the node To determine the number of locks being granted lctl get_param 1dlm namespaces pool limit 22 28 Lustre 1 8 Operations Manual October 2009 22211 Changing MDS and OSS Thread Counts In Lustre 1 8 and later MDS and OSS thread counts minimum and maximum can be set via the min max _thread_count tunable For each service a new proc fs lustre service thread_ min max started entry is created The tunable service portal thread min max started can be used to set the minimum and maximum thread counts or get the current number of running threads for these service portal combinations Service Portal mdt MDS mds normal metadata ops mdt MDS mds_readpage metadata readdir mdt MDS mds_setattr metadata setattr ost OSS ost normal data ost OSS ost_io bulk data IO ost OSS ost_create MDS pre create Idlm services ldlm_canceld DLM lock cancel ldlm services ldlm_cbd DLM lock grant m To temporarily set this tunable run lctl get set _param service portal thread_ min max started m To permanently set this tunable run lctl conf_param service portal threa
139. 005 debugfs gt stats shows superblock and group summary information debugfs gt ls shows directory listing debugfs gt stat lt inum gt shows inode information for inode number lt inum gt debugfs gt stat name shows inode information for inode name debugfs gt cd dir change into directory dir ROOT is start of Lustre visible namespace debugfs gt quit Once you have assessed the damage possibly with the assistance of Lustre Support depending on the nature of the corruption then fixing it is the next step Often it is prudent to make a backup of the file system metadata time and space permitting in case there is a problem or if it is unclear whether e2fsck will make the correct action in most cases it will To make a metadata backup run root mds e2image dev sda bigplace sda e2image Lustre 1 8 Operations Manual October 2009 In most cases running e2fsck fp device will fix most types of corruption The e2fsck program has been used for many years and has been tested with a huge number of different corruption scenarios If you suspect serious corruption or do not expect e2fsck to fix the problem then consider running a manual check e2fsck f device The limitation of the manual check is that it is interactive and can be quite lengthy if there are a lot of problems How do I clean up a device with Ictl 1 Run lconf cleanup force 2 If that does not work then start Ictl if
140. 009 18 2 2 obdfilter_survey The obdfilter_survey script processes sequential I O with varying numbers of threads and objects files by using 1ct1 to drive the echo_client connected to local or remote obdfilter instances or remote obdecho instances It can be used to characterize the performance of the following Lustre components OSTs The script exercises one or more instances of obdfilter directly The script may run on one or more nodes for example when the nodes are all attached to the same multi ported disk subsystem Tell the script the names of all obdfilter instances which should be up and running already If some instances are on different nodes specify their hostnames too for example nodel ost1 Alternately you can pass parameter case disk to the script The script automatically detects the local obdfilter instances All obdfilter instances are driven directly The script automatically loads the obdecho module if required and creates one instance of echo_client for each obdfilter instance Network The script drives one or more instances of the obdecho server via instances of echo_client running on one or more nodes Pass the parameters case network and target lt hostname ip_of_server gt to the script For each nework case the script does the required setup Striped File System Over the Network The script drives one or more instances of obdfilter via instances of echo_client running on one or more no
141. 035660 0000120 Max Write 173 89 MiB sec 182 33 MB sec Max Read 278 49 MiB sec 292 02 MB sec Run finished Fri Sep 29 11 43 56 2006 Lustre 1 8 Operations Manual October 2009 17 3 IOzone Benchmark IOZone is a file system benchmark tool which generates and measures a variety of file operations ozone has been ported to many machines and runs under many operating systems Iozone is useful to perform a broad file system analysis of a vendor s computer platform The benchmark tests file I O performance for the operations like read write re read re write read backwards read strided fread fwrite random read write pread pwrite variants aio_read aio_write mm etc The IOzone benchmark tests file I O performance for the following operations read write re read re write read backwards read strided fread fwrite random read write pread pwrite variants aio_read aio_write and mmap To install and run the IOzone benchmark 1 Download the most recent version of the IOZone software from this location http www iozone org 2 Install the IOZone software per the ReadMe file accompanying the IOZone software Chapter 17 Benchmarking 17 5 17 6 3 Run the IOZone software per the ReadMe file accompanied with the IOZone software Sample Output Tozone Performance Test of File I O Version Revision 3 263 Compiled for 32 bit mode Build linux Contributors William Norcott Don Capps Isom Craw
142. 1 lustre Once you have these files created you can run the conversion tool usr lib heartbeat haresources2cib py c basic ha cf basic haresources gt basic cib xml 2 Examine the cib xml file The first section in the XML file is lt attributes gt The default values should be fine for most installations The actual resources are defined in the lt primitive gt section The default behavior of Heartbeat is an automatic failback of resources when a server is restored To avoid this you must add a parameter to the lt primitive gt definition You may also like to reduce the timeouts In addition the current version of the script does not correctly name the parameters lt cib generated true admin_epoch 0 epoch 0 num_updates 0 have_quorum true ignore_dtd false num_peers 2 ccm_transition 1 cib last written Thu Aug 9 09 50 12 2007 gt lt configuration gt lt crm_config gt lt nodes gt lt node id 00e8c292 2a28 4492 bcfs fb2625ablc61 uname oss162 spsoftware com type normal gt lt node id e370be9a 24f4 46a5 99ac 41a88c5fa344 uname oss161 spsoftware com type normal gt lt nodes gt lt resources gt lt constraints gt lt configuration gt lt cib gt a Copy the modified resource file to var lib heartbeat crm cib xml b Start the Heartbeat software c After startup Heartbeat re writes the cib xml adding a lt node gt section and status information Do not alter tho
143. 1 8 o2ib0 This is an example for IB clients to access TCP servers via 8 IB TCP routers options lnet ip2nets tcp0 10 10 0 o2ib0 ib0 192 168 10 1 128 routes tcp 192 168 10 1 8 o2ib0 o2ib 10 10 0 1 8 tcp0 This specifies bi directional routing TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks For more information on ip2nets Modprobe conf Note Configure IB network interfaces on a different subnet than LAN interfaces Lustre 1 8 Operations Manual October 2009 Caution For options ip2nets routes and networks several best practices must be followed or configuration errors occur Best Practice 1 If you add a comment to any of the options mentioned above position the semicolon after the comment If you fail to do so some nodes are not properly initialized because LNET silently ignores everything following the character which begins the comment until it reaches the next semicolon This is subtle no error message is generated to alert you to the problem This example shows the correct syntax options lnet ip2nets pt10 192 168 0 89 93 comment with semicolon AFTER comment pt11 192 168 0 92 96 comment In this example the following is ignored comment with semicolon AFTER comment This example shows the wrong syntax options lnet ip2nets pt10 192 168 0 89 93 comment with semicolon BEFORE comment pt11 192 168 0 92 96
144. 1 8 Operations Manual October 2009 The examples below are from RedHat systems For setup use etc sysconfig networking scripts ifcfg The OSDL website referenced below includes detailed instructions for other configuration methods instructions to use DHCP with bonding and other setup details We strongly recommend you use this website http linux net osdl org index php Bonding 6 Check proc net bonding to determine status on bonding There should be a file there for each bond interface cat proc net bonding bond0 Ethernet Channel Bonding Driver v3 0 3 March 23 2006 Bonding Mode load balancing round robin MII Status up MII Polling Interval ms 0 Up Delay ms 0 Down Delay ms 0 Slave Interface eth0 MII Status up Link Failure Count 0 Permanent HW addr 4c 00 10 ac 61 e0 Slave Interface eth1 MII Status up Link Failure Count 0 Permanent HW addr 00 14 2a 7c 40 1d Chapter 12 Bonding 12 7 12 8 7 Use ethtool or ifconfig to check the interface state ifconfig lists the first bonded interface as bond0 ifconfig bondo eth0 eth1 Link encap Ethernet HWaddr 4C 00 10 AC 61 E0 inet addr 192 168 10 79 Bcast 192 168 10 255 Mask 255 255 255 0 inet6 addr fe80 4e00 10ff feac 61e0 64 Scope Link UP BROADCAST RUNNING MASTER MULTICAST MTU 1500 Metric 1 RX packets 3091 errors 0 dropped 0 overruns 0 frame 0 TX packets 880 errors 0 dropped 0 overruns 0 carrier 0 collisions 0
145. 100 clients and 100 OSSs each with one OST If each file has exactly one object and the load is distributed evenly there is no contention and the disks on each server can manage sequential I O If each file has 100 objects then the clients all compete with one another for the attention of the servers and the disks on each node seek in 100 different directions In this case there is needless contention Increased Risk Increased risk is evident when you consider the example of striping each file across all servers In this case if any one OSS catches on fire a small part of every file is lost By comparison if each file has exactly one stripe you lose fewer files but you lose them in their entirety Most users would rather lose some of their files entirely than all of their files partially Stripe Size Choosing a stripe size is a small balancing act but there are reasonable defaults The stripe size must be a multiple of the page size For safety Lustre s tools enforce a multiple of 64 KB the maximum page size on ia64 and PPC64 nodes so users on platforms with smaller pages do not accidentally create files which might cause problems for ia64 clients Although you can create files with a stripe size of 64 KB this is a poor choice Practically the smallest recommended stripe size is 512 KB because Lustre sends 1 MB chunks over the network This is a good amount of data to transfer at one time Choosing a smaller stripe size may hin
146. 18 Lustre 1 8 Operations Manual October 2009 18 4 18 4 1 18 4 1 1 LNET Self Test LNET self test helps site administrators confirm that Lustre Networking LNET has been properly installed and configured and that underlying network software and hardware are performing according to expectations LNET self test is a kernel module that runs over LNET and LNDs It is designed to m Test the connection ability of the Lustre network m Run regression tests of the Lustre network m Test performance of the Lustre network Basic Concepts of LNET Self Test This section describes basic concepts of LNET self test utilities and a sample script Modules To run LNET self test these modules must be loaded libcfs Inet Inet_selftest and one of the kInds i e ksockInd ko2iblnd To load all necessary modules run modprobe lnet_selftest recursively loads the modules on which LNET self test depends The LNET self test cluster has two types of nodes a Console node A single node that controls and monitors the test cluster It can be any node in the test cluster m Test nodes The nodes that run tests Test nodes are controlled by the user via the console node the user does not need to log into them directly The console and test nodes require all previously listed modules to be loaded The userspace test node does not require these modules Note Test nodes can be in either kernel or userspace A console user can i
147. 2 H HA software 3 4 handling timeouts 27 23 HBA adding SCSI LUNs 26 5 Heartbeat configuration with STONITH 8 13 without STONITH 8 10 Heartbeat V1 failover setup 8 9 Heartbeat V2 failover setup 8 18 l I O options end to end client checksums 24 16 I O tunables 22 12 improving Lustre metadata performance with large directories 26 6 Infinicon InfiniBand iib 2 2 installing Lustre SNMP module 14 2 POSIX 16 2 installing Lustre from RPMs 3 9 from source code 3 13 installing Lustre debugging tools 3 4 installing Lustre environmental requirements 3 5 installing Lustre HA software 3 4 installing Lustre memory requirements 3 6 installing Lustre prerequisites 3 2 installing Lustre required software 3 3 installing Lustre required tools utilities 3 3 interconnects supported 3 2 interoperability 13 2 interpreting adaptive timeouts 22 8 IOR benchmark 17 3 IOzone benchmark 17 5 K Kerberos Lustre setup 11 2 Lustre Kerberos flavors 11 11 key features 1 3 L Ictl 31 8 Ictl tool 23 7 lfs command 27 2 lfs getstripe display files and directories 24 4 setting file layouts 24 6 lfsck command 27 12 listing Lustre parameters 4 23 llapi 24 18 Ilapi command 29 1 llog_reader utility 31 21 Ilstat sh utility 31 20 LND 2 1 LNET 1 16 configuring 2 5 routers 2 11 starting 2 13 stopping 2 14 LNET self test commands 18 22 concepts 18 19 Load balancing with InfiniBand modprob
148. 2 Running I O Kit Tests As mentioned above the I O kit contains these test tools m sgpdd_survey m obdfilter_survey m ost_Survey 18 2 Lustre 1 8 Operations Manual October 2009 18 2 1 sgpdd_survey Use the sgpdd_survey tool to test bare metal performance while bypassing as much of the kernel as possible This script requires the sgp_dd package although it does not require Lustre software This survey may be used to characterize the performance of a SCSI device by simulating an OST serving multiple stripe files The data gathered by this survey can help set expectations for the performance of a Lustre OST exporting the device The script uses sgp_dd to carry out raw sequential disk I O It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths The script spawns variable numbers of sgp_dd instances each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files The device s used must meet one of the two tests described below SCSI device Must appear in the output of sg_map make sure the kernel module sg is loaded Raw device Must appear in the output of raw qa If you need to create raw devices in order to use the sgpdd_survey tool note that raw device 0 cannot be used due to a bug in certain versions of the raw utility including that shipped with RHEL4U4 You may not mix raw and S
149. 2 11 2 1 Configuring Kerberos for Lustre 11 2 11 2 2 Types of Lustre Kerberos Flavors 11 11 Bonding 12 1 12 1 12 2 12 3 12 4 12 5 Network Bonding 12 1 Requirements 12 2 Using Lustre with Multiple NICs versus Bonding NICs 12 4 Bonding Module Parameters 12 5 Setting Up Bonding 12 5 12 5 1 Examples 12 9 Configuring Lustre with Bonding 12 11 12 6 1 Bonding References 12 11 Upgrading and Downgrading Lustre 13 1 13 1 13 2 13 3 13 4 13 5 Supported Upgrades 13 2 Lustre Interoperability 13 2 Upgrading Lustre 1 6 x to 1 8 x 13 3 13 3 1 Performing a Complete File System Upgrade 13 4 13 3 2 Performing a Rolling Upgrade 13 6 Upgrading Lustre 1 8 x to the Next Minor Version 13 8 Downgrading from Lustre 1 8 x to 1 6 x 13 8 13 5 1 Performing a Complete File System Downgrade 13 9 13 5 2 Performing a Rolling Downgrade 13 11 Contents xi 14 Lustre SNMP Module 14 1 141 Installing the Lustre SNMP Module 14 2 142 Building the Lustre SNMP Module 14 2 143 Using the Lustre SNMP Module 14 3 15 Backup and Restore 15 1 15 1 Lustre Backups 15 1 15 1 1 File System level Backups 15 1 15 1 2 Device level Backups 15 2 15 1 3 Backing Up the MDS 15 2 15 1 4 Backing Up an OST 15 3 15 1 5 Performing File level Backups 15 4 15 2 Restoring from a File level Backup 15 4 15 3 LVM Snapshots on Lustre Targets 15 5 15 3 1 Creating LVM Snapshot Volumes 15 6 15 3 2 Deleting Old Snapshots 15 8 15 3 3 Changing Snapshot Volume Size 15 8 16
150. 2 and 8 OSTs up to 8 TB each The MDT OSTs and Lustre clients can run concurrently in any mixture on a single node However a typical configuration is an MDT on a dedicated node two or more OSTs on each OSS node and a client on each of a large number of compute nodes a Object Storage Target OST The OST stores file data chunks of user files as data objects on one or more OSSs A single Lustre file system can have multiple OSTs each serving a subset of file data There is not necessarily a 1 1 correspondence between a file and an OST To optimize performance a file may be spread over many OSTs A Logical Object Volume LOV manages file striping across many OSTs Lustre clients Lustre clients are computational visualization or desktop nodes that run Lustre software that allows them to mount the Lustre file system The Lustre client software consists of an interface between the Linux Virtual File System and the Lustre servers Each target has a client counterpart Metadata Client MDC Object Storage Client OSC and a Management Client MGC A group of OSCs are wrapped into a single LOV Working in concert the OSCs provide transparent access to the file system Clients which mount the Lustre file system see a single coherent synchronized namespace at all times Different clients can write to different parts of the same file at the same time while other clients can read from the file Lustre includes several additional
151. 20 14 Lustre 1 8 Operations Manual October 2009 20 7 Lockless I O Tunables The lockless I O tunable feature allows servers to ask clients to do lockless I O liblustre style where the server does the locking on contended files The lockless I O patch introduces these tunables 20 8 OST side proc fs lustre ldlm namespaces filter lustre contended_locks If the number of lock conflicts in the scan of granted and waiting queues at contended_locks is exceeded the resource is considered to be contended contention_seconds The resource keeps itself in a contended state as set in the parameter max_nolock_bytes Server side locking set only for requests less than the blocks set in the max_nolock_bytes parameter If this tunable is set to zero 0 it disables server side locking for read write requests Client side proc fs lustre llite lustre contention_seconds llite inode remembers its contended state for the time specified in this parameter Client side statistics The proc fs lustre llite lustre stats file has new rows for lockless I O statistics lockless_read_bytes and lockless write bytes To count the total bytes read or written the client makes its own decisions based on the request size The client does not communicate with the server if the request size is smaller than the min_nolock_size without acquiring locks by the client Data Checksums To avoid the risk of data corruption on t
152. 22 18 ost_survey tool 18 11 OSTs adding 4 10 P performance tips 21 7 performing direct I O 24 15 Perl 3 3 PIOS examples 18 18 PIOS I O mode COW I O 18 14 DIRECT I O 18 14 POSIX I O 18 14 PIOS I O modes 18 14 PIOS parameter ChunkSize c 18 15 Index 5 Offset o 18 16 RegionCount n 18 15 RegionSize s 18 15 ThreadCount t 18 15 PIOS tool 18 12 platforms supported 3 2 plot Ilstat sh utility 31 20 Portals LND Catamount 30 18 Linux 30 15 POSIX debugging 16 5 installing 16 2 running tests against Lustre 16 4 POSIX I O 18 14 power equipment 8 3 power management software 8 3 prerequisites 3 2 proc entries debug support 22 30 free space distribution 22 11 LNET information 22 9 locating filesystems and servers 22 2 locking 22 28 timeouts 22 3 Q QSW LND 30 10 Quadrics Elan 2 2 quota limits 9 11 quota statistics 9 13 quotas administering 9 4 allocating 9 7 creating files 9 4 enabling 9 2 file formats 9 12 granted cache 9 10 known issues 9 10 limits 9 11 statistics 9 13 working with 9 1 R ra RapidArray 2 2 RAID Index 6 Lustre 1 8 Operations Manual October 2009 creating an external journal 10 6 formatting options 10 5 handling degraded arrays 10 7 insights into disk performance measurement 10 7 performance tradeoffs 10 5 reliability best practices 10 3 selecting storage for MDS or OSTs 10 2 software RAID 10 8 understanding double failures with
153. 24 5 OST Pools 24 12 24 5 1 Working with OST Pools 24 13 24 5 2 Tips for Using OST Pools 24 15 24 6 Performing Direct I O 24 15 24 6 1 Making File System Objects Immutable 24 15 247 Other I O Options 24 16 24 7 1 LustreChecksums 24 16 24 8 Striping Using llapi 24 18 xviii Lustre 1 8 Operations Manual October 2009 25 26 Lustre Security 25 1 25 1 Using ACLs 25 1 25 1 1 How ACLs Work 25 1 25 12 Using ACLs with Lustre 25 2 25 1 3 Examples 25 3 Using Root Squash 25 4 25 2 1 Configuring Root Squash 25 4 25 2 2 Enabling and Tuning Root Squash 25 4 25 2 3 Tips on Using Root Squash 25 6 Lustre Operating Tips 26 1 26 1 26 2 26 3 26 4 26 5 Adding an OST to a Lustre File System 26 2 A Simple Data Migration Script 26 3 Adding Multiple SCSI LUNs on Single HBA 26 5 Failures Running a Client and OST on the Same Machine 26 5 Improving Lustre Metadata Performance While Using Large Directories 26 6 Contents xix Part V Reference 27 User Utilities mani 27 1 lfs 27 2 lfsck 27 12 Filefrag 27 20 27 1 27 2 27 3 27 4 27 5 Mount 27 22 Handling Timeouts 27 23 28 Lustre Programming Interfaces man2 28 1 User Group Cache Upcall 28 1 28 1 28 1 1 28 1 2 28 1 3 28 1 4 Name 28 1 Description 28 2 Parameters 28 3 Data structures 28 3 29 Setting Lustre Properties man3 29 1 Using Ilapi 29 1 29 1 29 1 1 29 1 2 29 1 3 29 1 4 29 1 5 Ilapi_file_create 29 1 Ilapi_file_get_s
154. 3 Run the obdfilter_survey script with the parameter case netdisk For example nobjhi 2 thrhi 2 size 1024 case netdisk sh obdfilter survey To perform a manual run 1 Run the obdfilter_survey script and tell the script the names of all echo_client instances which should be up and running already nobjhi 2 thrhi 2 size 1024 targets lt osc_name gt sh obdfilter survey Lustre 1 8 Operations Manual October 2009 18 2 2 4 Output Files When the obdfilter_survey script runs it creates a number of working files and a pair of result files All files start with the prefix given by rslt File Description rslt summary Same as stdout rslt script_ Per host test script files rslt detail_tmp Per OST result files rslt detail Collected result files for post mortem The obdfilter_survey script iterates over the given number of threads and objects performing the specified tests and checks that all test processes have completed successfully Note The obdfilter_survey script may not clean up properly if it is aborted or if it encounters an unrecoverable error In this case a manual cleanup may be required possibly including killing any running instances of Ictl local or remote removing echo_client instances created by the script and unloading obdecho Chapter 18 Lustre I O Kit 18 9 18 2 2 5 18 2 2 6 Script Output The summary file and stdout of the obdfilter_survey script contain li
155. 5 Restore the file system backup Run tar xzvpf backup file 15 4 1 InLustre each OST object has an EA that contains the MDT inode number and stripe index for the object The EA s striping information includes the location of file data on the OST Lustre 1 8 Operations Manual October 2009 6 Restore the file system EAs Run setfattr restore ea bak not required for OST devices 7 Remove the recovery logs now invalid Run rm OBJECTS CATALOGS 8 Remove the lov_objids file if the MDT is being restored from backup but the OSTs have been modified since the backup was created The lov_objids file will be recreated when the MDS next connects to the OSTs Note If only one of the MDT or OST file systems is being restored from backup but the rest of the file system has been modified since the backup was created run the 1fsck tool part of e2fsprogs to make sure that the file system is coherent It is not necessary to run this tool if the backup of all device file systems occurs at the same time after the Lustre file system is stopped The file system should be immediately usable without running lfsck There may be few I O errors caused by reading from files that are present on the MDS but not on the OSTs Files created after the MDS backup are not visible or accessible 15 3 LVM Snapshots on Lustre Targets Another backup option is to leverage the Linux LVM snapshot mechanism to maintain mu
156. 6 rpm libpopt gt popt 1 7 274 i586 rpm librpm gt rpm 4 1 1 222 1586 rpm glib gt glib 2 6 1 2 1586 rpm glib devel gt glib devel 2 6 1 2 i586 rpm Chapter 8 Failover 8 9 8 5 1 1 Configuring Heartbeat This section describes basic configuration of Heartbeat with and without STONITH Note LNET does not support virtual IP addresses The IP address specified in the haresources file should be a dummy address valid but unused With later releases of Heartbeat you may avoid the use of virtual IPs but it is required in earlier releases Basic Configuration Without STONITH The http linux ha org website has several guides covering basic setup and initial testing of Heartbeat We suggest that you read them 1 Configure and test the Heartbeat setup before adding STONITH Let us assume there are two nodes nodeA and nodeB nodeA owns ost1 and nodeB owns ost2 Both the nodes are with dedicated Ethernet eth0 having serial crossover link dev ttySO Consider that both nodes are pinging to a remote host 192 168 0 3 for health 2 Create etc ha d ha cf m This file must be identical on both the nodes m Follow the specific order of the directives m Sample ha cf file Suggested fields logging debugfile var log ha debug logfile var log ha log logfacility local0 Required fields Timing keepalive 2 deadtime 30 initdead 120 If using serial Heartbeat baud 19200 serial dev ttys0 For E
157. 9 10 10 b Create a RAID array for an external journal On the OSS run mdadm create lt array_device gt 1 lt raid_level gt n lt active_devices gt x lt spare_devices gt lt block_devices gt where lt array_device gt lt raid_level gt lt active_devices gt lt spare_devices gt lt block_devices gt RAID array to create in the form of dev mdX Architecture of the RAID array RAID 1 is recommended for external journals Number of active disks in the RAID array including mirrors Number of spare disks initially assigned to the RAID array More disks may be brought in via spare pooling see below List of the block devices used for the RAID array wildcards may be used For the worked example the command is mdadm create dev md20 1 1 n 2 x 0 dev dsk cOt0d20p1 dev dsk c1t0d20p1 This command output displays mdadm array dev md20 started We now have two arrays a RAID 6 array for the OST dev md20 and a RAID 1 array for the external journal dev md20 The arrays will now be re synced a process which re synchronizes the various disks in the array so their contents match The arrays may be used during the re sync process including formatting the OSTs but performance will not be as high as usual The re sync progress may be monitored by reading the proc mdstat file Next you need to create a RAID array for an MDT In this example a RAID 10 array is create
158. 9 First release of Lustre 1 8 manual Lustre 1 8 Operations Manual 821 0035 10 B October 2009 Second release of Lustre 1 8 manual PART I Lustre Architecture Lustre is a storage architecture for clusters The central component is the Lustre file system a shared file system for clusters The Lustre file system is currently available for Linux and provides a POSIX compliant UNIX file system interface The Lustre architecture is used for many different kinds of clusters It is best known for powering seven of the ten largest high performance computing HPC clusters in the world with tens of thousands of client systems petabytes PBs of storage and hundreds of gigabytes per second GB sec of I O throughput Many HPC sites use Lustre as a site wide global file system servicing dozens of clusters on an unprecedented scale CHAPTER 1 Introduction to Lustre This chapter describes Lustre software and components and includes the following sections Introducing the Lustre File System Lustre Components Lustre Systems Files in the Lustre File System Lustre Configurations Lustre Networking Lustre Failover and Rolling Upgrades These instructions assume you have some familiarity with Linux system administration cluster systems and network technologies 1 1 1 1 1 2 Introducing the Lustre File System Lustre is a storage architecture for clusters The central component is the Lustre file system which is available fo
159. AID configuration does not allow lt chunk_size gt to fit evenly into 1 MB select lt chunk_size gt such that lt stripe_width gt is close to 1 MB but not larger For example RAID6 with 6 disks has 4 data and 2 parity disks so we get lt chunksize gt lt 1024kB 4 either 256kB 128kB or 64kB The lt stripe_width gt value must equal lt chunksize gt lt disks gt lt parity_disks gt Use it for OST file systems only not MDT file systems mkfs lustre mountfsoptions stripe lt stripe width blocks gt External journal Use RAID1 with two partitions of 400 MB or more each from disks on different controllers To set up the journal device dev mdfJ run mke2fs O journal _ dev b 4096 dev mdJ Then run reformat on the file system device dev mdX specifying the RAID geometry to the underlying Idiskfs file system where lt chunk_blocks gt lt chunksize gt 4096 lt stripe_width_blocks gt lt stripe_width gt 4096 mkfs lustre reformat mkfsoptions j J device dev mdJ E stride lt chunk_blocks gt dev mdx Reliability Best Practices It is considered mandatory that you use disk monitoring software so rebuilds happen without any delay We recommend backups of the metadata file systems This can be done with LVM snapshots or using raw partition backups Chapter 10 RAID 10 3 10 1 3 Understanding Double Failures with Hardware and Software RAID5 Software RAID does not
160. CSI devices in the test specification Caution The sgpdd_survey script overwrites the device being tested which results in the LOSS OF ALL DATA on that device Exercise caution when selecting the device to be tested Chapter 18 Lustre I O Kit 18 3 18 4 The sgpdd_survey script must be customized according to the particular device being tested and also according to the location where it should keep its working files Customization variables are described explicitly at the start of the script When the sgpdd_survey script runs it creates a number of working files and a pair of result files All files start with the prefix given by the script variable rslt S rslt _ lt date time gt summary same as stdout S rslt _ lt date time gt _ tmp files S rslt _ lt date time gt detail collected tmp files for post mortem The summary file and stdout should contain lines like this total_size 8388608K rsz 1024 thr 1 crg 1 180 45 MB s 1 x 180 50 180 50 MB s The number immediately before the first MB s is bandwidth computed by measuring total data and elapsed time The remaining numbers are a check on the bandwidths reported by the individual sgp_dd instances If there are so many threads that the sgp_dd script is unlikely to be able to allocate I O buffers then ENOMEM is printed If one or more sgp_dd instances do not successfully report a bandwidth number then failed is printed Lustre 1 8 Operations Manual October 2
161. Cs complete in finite time in the presence of failures These timeouts should always be printed as console messages If Lustre timeouts are not accompanied by LNET timeouts then you need to increase the lustre timeout on both servers and clients Specific Lustre timeouts are described below proc sys lustre timeout This is the time period that a client waits for a server to complete an RPC default is 100s Servers wait half of this time for a normal client RPC to complete and a quarter of this time for a single bulk request read or write of up to 1 MB to complete The client pings recoverable targets MDS and OSTs at one quarter of the timeout and the server waits one and a half times the timeout before evicting a client for being stale Note Lustre sends periodic PING messages to servers with which it had no communication for a specified period of time Any network activity on the file system that triggers network traffic toward servers also works as a health check proc sys lustre ldlm_timeout This is the time period for which a server will wait for a client to reply to an initial AST lock cancellation request where default is 20s for an OST and 6s for an MDS If the client replies to the AST the server will give it a normal timeout half of the client timeout to flush any dirty data and release the lock Chapter 22 LustreProc 22 3 proc sys lustre fail_loc This is the internal debugging failure hook See
162. D 8 Start Lustre Once all the machines have rebooted the next steps are to configure Lustre Networking LNET and the Lustre file system See Configuring Lustre Chapter 3 Installing Lustre 3 21 3 22 Lustre 1 8 Operations Manual October 2009 CHAPTER 4 Configuring Lustre You can use the administrative utilities provided with Lustre to set up a system with many different configurations This chapter shows how to configure a simple Lustre system comprised of a combined MGS MDT an OST and a client and includes the following sections m Configuring the Lustre File System m Additional Lustre Configuration m Basic Lustre Administration More Complex Configurations m Operational Scenarios 4 1 4 1 Configuring the Lustre File System A Lustre file system consists of four types of subsystems a Management Server MGS a Metadata Target MDT Object Storage Targets OSTs and clients We recommend running these components on different systems although technically they can co exist on a single system Together the OSSs and MDS present a Logical Object Volume LOV which is an abstraction that appears in the configuration It is possible to set up the Lustre system with many different configurations by using the administrative utilities provided with Lustre Some sample scripts are included in the directory where Lustre is installed If you have installed the Lustre source code the scripts are located in the lus
163. DR 192 168 10 79 Use the free IP Address of your network TWORK 192 168 10 0 TMASK 255 255 255 0 SERCTL no OOTPROTO none NBOOT yes w w o waz Z H Chapter 12 Bonding 12 5 3 Attach one or more slave interfaces to the bond interface Modify the eth0 and eth1 configuration files using a VI text editor a Use the VI text editor to open the eth0 configuration file vi etc sysconfig network scripts ifcfg eth0 b Modify append the eth0 file as follows DEVICE eth0 USERCTL no ONBOOT yes MASTER bond0 SLAVE yes BOOTPROTO none c Use the VI text editor to open the eth1 configuration file vi etc sysconfig network scripts ifcfg ethl d Modify append the eth1 file as follows DEVICE eth1 USERCTL no ONBOOT yes MASTER bond0 SLAVE BOOTPROTO none 4 Set up the bond interface and its options in etc modprobe conf Start the slave interfaces by your normal network method vi etc modprobe conf a Append the following lines to the file alias bond0 bonding options bond0 mode balance alb miimon 100 b Load the bonding module modprobe bonding ifconfig bond0O up ifenslave bond0 eth0 eth1 5 Start restart the slave interfaces using your normal network method Note You must modprobe the bonding module for each bonded interface If you wish to create bond0 and bond1 two entries in modprobe conf are required 12 6 Lustre
164. DT can be started after the MGS starts Lustre 1 8 Operations Manual October 2009 13 3 Upgrading Lustre 1 6 x to 1 8 x Two upgrade paths are supported to meet the upgrade requirements of different Lustre environments Complete file system All servers and clients are shut down and upgraded at the same time See Performing a Complete File System Upgrade a Rolling upgrade Individual servers or their failover partners and clients are upgraded one at a time so the file system never goes down See Performing a Rolling Upgrade Note If you upgrade some Lustre components to 1 8 x but not others such as running 1 8 clients in a file system with 1 6 OSTs and run a mixed environment you may see one or more warnings similar to this LustreError 3877 0 socklnd_cb c 2228 ksocknal_recv_hello Unknown protocol version 2 x expected from 192 168 2 43 This warning is given when the 1 6 and 1 8 components use different protocols It can be safely ignored because the Lustre components negotiate a common protocol In this example the 1 8 clients fall back to use the 1 6 protocol with the 1 6 OSTs Chapter 13 Upgrading and Downgrading Lustre 13 3 13 3 1 13 4 Performing a Complete File System Upgrade This procedure describes a complete file system upgrade in which 1 8 x Lustre packages are installed on multiple 1 6 x servers and clients requiring a file system shut down If you want to upgrade one Lustre co
165. EBUG to the debug_ buffer Debug_daemon put DEBUG MARKER Trace buffer full into the debug_buf fer to indicate debug_buffer is overlapping itself before debug_daemon flush data to a file Users can use lct1 control to start or stop Lustre daemon from dumping the debug_buf fer toa file Users can also temporarily hold daemon from dumping the file Use of the debug_daemon sub command to 1ct1 can provide the same function Chapter 23 Lustre Debugging 23 5 23 2 1 1 Ictl Debug Daemon Commands This section describes 1ct1 daemon debug commands Ictl debug_daemon start file megabytes Initiates the debug_daemon to start dumping debug_buffer into a file The file can be a system default file as shown in proc sys 1net debug_path After Lustre starts the default path is tmp lustre log HOSTNAME Users can specify a new filename for debug_daemon to output debug_buf fer The new file name shows up in proc sys 1net debug_path Megabytes is the limitation of the file size in MBs The daemon wraps around and dumps data to the beginning of the file when the output file size is over the limit of the user specified file size To decode the dumped file to ASCII and order the log entries by time run lctl debug_file file gt newfile The output is internally sorted by the 1ct1 command using quicksort debug_daemon stop Completely shuts down the debug_daemon operation and flushes the file output Otherwise debug_daemon is shut down as
166. For example targets oss01 oss01 sdb oss01 oss01 sdd oss02 oss02 sdi obdfilter survey Running obdfilter_survey Against a Network The obdfilter_survey script can only be run automatically against a network no manual test is supported To run the network test a specific Lustre setup is needed Make sure that these configuration requirements have been met m Install all Lustre modules including obdecho m Start Ictl and check the device list which must be empty m Use a password less entry between the client and server machines to avoid having to type the password To perform an automatic run 1 Run the obdfilter_survey script with the parameters case netdisk and targets lt hostname ip_of_server gt For example nobjhi 2 thrhi 2 size 1024 targets lt hostname ip_of_server gt case network sh obdfilter survey On the server side you can see the statistics at proc fs lustre obdecho lt echo_srv gt stats where echo_srv is the obdecho server created by the script Chapter 18 Lustre I O Kit 18 7 13 2 2 3 18 8 Running obdfilter_survey Against a Network Disk The obdfilter_survey script can be run automatically or manually against a network disk To run the network disk test create a Lustre configuration using normal methods no special setup is needed To perform an automatic run 1 Set up the Lustre file system with the required OSTs 2 Verify that the obdecho ko module is present
167. How do I configure recoverable failover object servers How do I resize an MDS OST file system How do I backup restore a Lustre file system How do I control multiple services on one node independently What extra resources are required for automated failover Is there a way to tell which OST is being used by a client process I need multiple SCSI LUNs per HBA what is the best way to do this A 1 A 2 Can I run Lustre in a heterogeneous environment 32 and 64 bit machines How to build and configure Infiniband support for Lustre Can the same Lustre file system be mounted at multiple mount points on the same client system How do I identify files affected by a missing OST How To New Lustre network configuration How to fix bad LAST_ID on an OST Why can t I run an OST and a client on the same machine Information on the Socket LND socklnd protocol Information on the Lustre Networking LNET protocol Explanation of previously skipped similar messages in Lustre logs What should I do if I suspect device corruption Example disk errors How do I clean up a device with Ictl What is the default block size for Lustre How do I determine which Lustre server MDS OST was connected to a particular storage device Does the mount option bind allow mounting a Lustre file system to multiple directories on the same client system What operations take place in Lustre when a new file is created Questions ab
168. ID Client PID and NID xid rq_xid length Size of the request message phase e New waiting to be handled or could not be unpacked svc specific Interpret unpacked or being handled e Complete handled Service specific request printout Currently the only service that does this is the OST which prints the opcode if the message has been unpacked successfully 23 6 Using LWT Tracing Lustre offers a very lightweight tracing facility called LWT It prints fixed size requests into a buffer and is much faster than LDEBUG The LWT tracking facility is very successful to debug difficult problems LWT trace based records that are dumped contain Current CPU m Process counter Pointer to file m Pointer to line in the file m 4 void pointers An 1ct1 command dumps the logs to files Chapter 23 Lustre Debugging 23 17 23 18 Lustre 1 8 Operations Manual October 2009 pant IV Lustre for Users This part includes chapters on Lustre striping and I O options security and operating tips CHAPTER 24 Striping and I O Options This chapter describes file striping and I O options and includes the following sections m File Striping m Displaying Files and Directories with lfs getstripe m lfs setstripe Setting File Layouts m Managing Free Space in Lustre m OST Pools m Performing Direct I O m Other I O Options Striping Using llapi 24 1 File Striping Lustre stores files of one or more
169. LMT was developed by Lawrence Livermore National Lab LLNL and continues to be maintained by LLNL 2 Lustre client monitoring is not supported systems For each OST the current read rate write rate in MB s CPU and full are displayed For each file system basis aggregate MB s is shown This is a sample LMT screen FIGURE 21 1 LMT sample screen File Configure ater Iseratcha Isoatchb For more information on LMT including the setup procedure see http sourceforge net projects Imt Red Hat Cluster Manager The Red Hat Cluster Manager provides high availability features that are essential for data integrity application availability and uninterrupted service under various failure conditions You can use the Cluster Manager to test MDS OST failure in Lustre clusters To use Cluster Manager to test MDS failover specific hardware is required a compute node OSTs and two machines to act as the active and failover MDSs The MDS nodes need to be able to see the same shared storage so you need to prepare a shared disk for the Cluster Manager and the MDSs Several RPM packages are also required along with certain configuration changes 21 2 3 The Lustre Group has made several scripts available for MDS failover testing Lustre 1 8 Operations Manual October 2009 For more information on the Cluster Manager bundled in the Red Hat Cluster Suite see the Red Hat Cluster Suite Supporting documentation is
170. LND name Note Depending on the Linux distribution options with included commas may need to be escaped using single and or double quotes Worst case quotes look like options lnet networks tcp0 elan0 routes tcp 2 10 elan0 Additional quotes may confuse some distributions Check for messages such as lnet Unknown parameter networks After modprobe LNET remove the additional single quotes modprobe conf in this case Additionally the refusing connection no matching NID message generally points to an error in the LNET module configuration Note By default Lustre ignores the loopback 100 interface Lustre does not ignore IP addresses aliased to the loopback In this case specify all Lustre networks The liblustre network parameters may be set by exporting the environment variables LNET_NETWORKS LNET_IP2NETS and LNET_ROUTES Each of these variables uses the same parameters as the corresponding modprobe option Note it is very important that a liblustre client includes ALL the routers in its setting of LNET_ROUTES A liblustre client cannot accept connections it can only create connections If a server sends remote procedure call RPC replies via a router to which the liblustre client has not already connected then these RPC replies are lost Lustre 1 8 Operations Manual October 2009 2 4 1 1 Note Liblustre is not for general use It was created to work with specific hardware Cray and s
171. Lustre 12 11 module parameters 12 5 references 12 11 requirements 12 2 setting up 12 5 bonding NICs 12 4 Bonnie benchmark 17 2 build tool compiler 3 3 building Lustre SNMP module 14 2 C calculating MDS memory requirements 3 6 OSS memory requirements 3 7 capacity system 1 14 Cisco Topspin cib 2 2 client read write extents survey 22 16 offset survey 22 15 clients adding 4 10 command filefrag 27 20 lfsck 27 12 llapi 29 1 mount 27 22 command lfs 27 2 complicated configurations multihomed servers 7 1 components Lustre 1 5 configuration module setup 4 10 configuration example Lustre 4 5 configuration more complex failover 4 28 Index 1 configuring adaptive timeouts 22 6 LNET 2 5 root squash 25 4 configuring Lustre 4 2 COW I O 18 14 Cray Seastar 2 2 D DDN tuning setting maxcmds 20 11 setting readahead and MF 20 8 setting segment size 20 9 setting write back cache 20 10 debugging adding debugging to source code 23 10 controlling the kernel debug log 23 7 daemon 23 5 debugging in UML 23 12 finding Lustre UUID of an OST 23 16 finding memory leaks 23 9 Ictl tool 23 7 looking at disk content 23 14 messages 23 2 printing to var log messages 23 9 Ptlrpc request history 23 16 sample Ictl run 23 10 tcpdump 23 16 tools 23 4 tracing lock traffic 23 9 Debugging failures 16 5 debugging tools 3 4 designing a Lustre network 2 3 device level backup 15 2
172. MDS just replace osc with mdc and ost with mds in the above command I need multiple SCSI LUNs per HBA what is the best way to do this The packaged kernels are configured approximately the same as the upstream RedHat and SuSE packages Currently RHEL does not enable CONFIG_SCSI_MULTI_LUN because it is said to causes problems with some SCSI hardware If you need to enable this you must set option scsi_mod max_scsi_luns xx xx is typically 128 in either modprobe conf 2 6 kernel or modules conf 2 4 kernel Passing this option as a kernel boot argument in grub conf or lilo conf will not work unless the kernel is compiled with CONFIG_SCSI_MULT_LUN y Can I run Lustre in a heterogeneous environment 32 and 64 bit machines As of Lustre v1 4 2 this is supported with different word sizes It is also supported for clients with different endianness for example i368 and PPC One limitation is that the PAGE_SIZE on the client must be at least as large as the PAGE_SIZE of the server In particular ia64 clients with large pages up to 64KB pages can run with i386 servers 4KB pages If i386 clients are running with ia64 servers the ia64 kernel must be compiled with 4kB PAGE_SIZE How do I clean up a device with Ictl How do I destroy this object using Ictl based on the following information Ictl gt device_list 0 UP obdfilter ost003_s1 ost003_s1_UUID 3 1 UP ost OSS OSS_UUID 2 2 UP echo_client ost003_s1_client 2b98a
173. MDS node run root mds mount t lustre dev sdb mnt mdt This command generates this output Lustre temp MDTO000 new disk initializing Lustre 3009 0 lproc_mds c 262 lprocfs_wr_group_upcall temp MDT0000 group upcall set to usr sbin 1l_getgroups Lustre temp MDT0000 mdt set parameter group_upcall usr sbin 1_getgroups Lustre Server temp MDTO000 on device dev sdb has started Lustre 1 8 Operations Manual October 2009 4 Create the OSTs In this example the OSTs ost1 and ost2 are being created or different OSSs oss1 and oss2 a Create ost1 On oss1 node run root oss1 mkfs lustre ost fsname temp mgsnode 10 2 0 1 tcp0 dev sdc The command generates this output Permanent disk data Target temp OSTffff Index unassigned Lustre FS temp Mount type ldiskfs Flags 0x72 OST needs_index first_time update Persistent mount opts errors remount ro extents mballoc Parameters mgsnode 10 2 0 1 tcp checking for existing Lustre data not found device size 16MB 2 6 18 formatting backing filesystem ldiskfs on dev sdc target name temp OSTffff 4k blocks 0 options I 256 q O dir_index uninit_ groups F mkfs_cmd mkfs ext2 j b 4096 L temp OSTffff I 256 q 0O dir_index uninit_groups F dev sdc Writing CONFIGS mountdata b Create ost2 On oss2 node run root oss2 mkfs lustre ost fsname temp mgsnode 10 2 0 1 tcp0 dev sdd The command generates this o
174. MDT size needed to support the file system When calculating the MDT size the only important factor is the number of files to be stored in the file system This determines the number of inodes needed which drives the MDT sizing For more information see Sizing the MDT and Planning for Inodes Make sure the MDT is properly sized before performing the next step as a too small MDT can cause the space on the OSTs to be unusable b Create the MGS MDT file system on the block device On the MDS node run mkfs lustre fsname lt fsname gt mgs mdt lt block device name gt The default file system name fsname is lustre Note If you plan to generate multiple file systems the MGS should be on its own dedicated block device 4 Mount the combined MGS MDT file system on the block device On the MDS node run mount t lustre lt block device name gt lt mount point gt 5 Create the OST2 On the OSS node run mkfs lustre ost fsname lt fsname gt mgsnode lt NID gt lt block device name gt You can have as many OSTs per OSS as the hardware or drivers allow You should only use only 1 OST per block device Optionally you can create an OST which uses the raw block device and does not require partitioning Note If the block device has more than 8 TB of storage it must be partitioned because of the ext3 file system limitation Lustre can support block devices with multiple partitions but they are
175. OK 4K 4K 8K 8K 16K 16K 32K PID 11424 OK 4K 4K 8K 8K 16K 16K 32K 32K 64K 64K 128K PID 11426 OK 4K PID 11429 OK 4K OO E E E o E OOO Oo 1213828762 204440 read calls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cums oS G 6 6 oOo OO secs usecs write calls cum 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 100 100 0 0 0 0 0 0 0 0 0 20 100 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 100 100 100 100 100 100 Chapter 22 LustreProc 22 17 22 2 5 Watching the OST Block I O Stream 22 18 Similarly there is a brw_stats histogram in the obdfilter directory which shows you the statistics for number of I O requests sent to the disk their size and whether they are contiguous on the disk or not cat proc fs lustre obdfilter lustre OST0000 brw_stats snapshot_time 1174875636 764630 secs usecs read write pages per brw brws cum rpcs cum 0 o 0 0 o o0 read write discont pages rpcs cums rpcs cums 0 o 0 0 o 0 read write discont blocks rpcs cums rpcs cums 0 o 0 0 o o0 read write dio frags rpcs cums rpcs cums 0 o 0 0 o o0 read write disk ios in flight rpcs cums rpcs cums 0 o o0 0 o 0 read write io time 1 1000s rpcs cums rpcs cums 0 0 0 0 o o0 read write disk io size rpcs cums rpcs cums 0 o 0 0
176. POSIX 16 1 16 1 Installing POSIX 16 2 16 2 Running POSIX Tests Against Lustre 16 4 16 3 Isolating and Debugging Failures 16 5 17 Benchmarking 17 1 17 1 Bonnie Benchmark 17 2 17 2 IOR Benchmark 17 3 17 3 IOzone Benchmark 17 5 xii Lustre 1 8 Operations Manual October 2009 18 19 Lustre I O Kit 18 1 18 1 Lustre I O Kit Description and Prerequisites 18 1 18 1 1 Downloading anI OKit 18 2 18 1 2 Prerequisites to Using an I O Kit 18 2 182 Running I O Kit Tests 18 2 18 2 1 sgpdd_survey 18 3 18 2 2 obdfilter_survey 18 5 18 2 3 ost_survey 18 11 18 3 PIOS Test Tool 18 12 18 3 1 Synopsis 18 13 18 3 2 PIOSI O Modes 18 14 18 3 3 PIOS Parameters 18 15 18 3 4 PIOS Examples 18 18 18 4 LNET Self Test 18 19 18 4 1 Basic Concepts of LNET Self Test 18 19 18 4 2 LNET Self Test Commands 18 22 Lustre Recovery 19 1 19 1 Recovering Lustre 19 1 19 2 Types of Failure 19 2 19 2 1 Client Failure 19 2 19 2 2 MDS Failure and Failover 19 3 19 2 3 OST Failure 19 3 19 2 4 Network Partition 19 4 19 3 Version based Recovery 19 5 19 3 1 Delayed Recovery 19 6 19 3 2 Working with VBR 19 7 19 3 3 Tips for Using VBR 19 7 Contents xiii Part III Lustre Tuning Monitoring and Troubleshooting 20 Lustre Tuning 20 1 20 1 20 2 20 3 20 4 20 5 20 6 20 7 20 8 Module Options 20 2 20 1 1 OSS Service Thread Count 20 2 20 1 2 MDS Service Thread Count 20 3 LNET Tunables 20 4 Options to Format MDT and OST File Systems 20 5 20 3 1 Plan
177. Printing a large number of messages to the kernel console can dramatically slow down the system As this happens with IRQs disabled and for a slow console it severely impacts overall system performance when there are large number of messages For example LustreError 559 0 genops c 1292 obd_export_evict_by_nid evicting b155 37b b426 ccc2 f 0a9 bfb 00000000 at adminstrative request LustreError 559 0 genops c 1292 obd_export_evict_by_nid previously skipped 2 similar messages In this case the similar messages are reported for the exact line of source without matching the text Therefore this is expected output for evictions of more than one client Lustre 1 8 Operations Manual October 2009 What should I do if I suspect device corruption Example disk errors Keep these points in mind when trying to recover from device induced corruption m Stop using the device as soon as possible if you have a choice The longer corruption is present on a device the greater the risk that it will cause further corruption Normally ext3 marks the file system read only if any corruption is detected or if there are I O errors when reading or writing metadata to the file system This can only be cleared by shutting down Lustre on the device use force or reboot if necessary m Proceed carefully If you take incorrect action you can make an otherwise recoverable situation worse ext3 has very robust metadata formats
178. S FOR A PARTICULAR PURPOSE OR NON INFRINGEMENT ARE DISCLAIMED EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID This work is licensed under a Creative Commons Attribution Share Alike 3 0 United States License To view a copy of this license and obtain more information about Creative Commons licensing visit Creative Commons Attribution Share Alike 3 0 United States or send a letter to Creative Commons 171 2nd Street Suite 300 San Francisco California 94105 USA Od mom Ca Adobe PostScript LD mom Ca Adobe PostScript Part I Contents Lustre Architecture 1 Introduction to Lustre 1 1 1 1 1 2 1 3 1 4 1 5 1 6 1 7 Introducing the Lustre File System 1 2 1 1 1 Lustre Key Features 1 3 Lustre Components 1 5 1 2 1 Lustre Networking LNET 1 7 1 2 2 Management Server MGS 1 7 Lustre Systems 1 8 Files in the Lustre File System 1 10 1 41 Lustre File System and Striping 1 12 1 4 2 Lustre Storage 1 13 1 4 3 Lustre System Capacity 1 14 Lustre Configurations 1 14 Lustre Networking 1 16 Lustre Failover and Rolling Upgrades 1 17 2 Understanding Lustre Networking 2 1 2 1 Introduction to LNET 2 1 2 2 Supported Network Types 2 2 2 3 Designing Your Lustre Network 2 3 2 3 1 Identify All Lustre Networks 2 3 2 3 2 Identify Nodes to Route Between Networks 2 3 2 3 3 Identify Network Interfaces to Include Exclude from LNET 2 3 2 3 4 Determine Cluster wide Module Configuration 2 4 2 3 5 Deter
179. System Configuration Utilities man8 31 23 31 5 9 31 24 llobdstat The Ilobdstat utility displays OST statistics Synopsis llobdstat ost_name interval Description The llobdstat utility displays a line of OST statistics for a given OST at specified intervals in seconds Option Description ost_name Name of the OBD for which statistics are requested interval Time interval in seconds after which statistics are refreshed Example llobdstat liane OST0002 1 usr bin llobdstat on proc fs lustre obdfilter liane OST0002 stats Processor counters run at 2800 189 MHz Read 1 21431e 07 Write 9 93363e 08 create destroy 24 1499 stat 34 punch 18 NOTE cx create dx destroy st statfs pu punch Timestamp Read delta ReadRate Write delta WriteRate 1217026053 0 00MB 0 00MB s 0 00MB 0 00MB s 1217026054 0 00MB 0 00MB s 0 00MB 0 00MB s 1217026055 0 00MB 0 00MB s 0 00MB 0 00MB s 1217026056 0 00MB 0 00MB s 0 00MB 0 00MB s 1217026057 0 00MB 0 00MB s 0 00MB 0 00MB s 1217026058 0 00MB 0 00MB s 0 00MB 0 00MB s 1217026059 0 00MB 0 00MB s 0 00MB 0 00MB s st 1 Files The Ilobdstat files are located at proc fs lustre obdfilter lt ostname gt stats Lustre 1 8 Operations Manual October 2009 31 5 10 llstat The llstat utility displays Lustre statistics Synopsis llstat c g i interval stats_file Description The llstat utility displays statistics from any of the Lustre statis
180. Time in seconds that communications may be stalled before the LND completes them with failure Number of normal message descriptors for locally initiated communications that may block for memory callers block when this pool is exhausted Number of reserved message descriptors for communications that may not block for memory This pool must be sized large enough so it is never exhausted Maximum number of queue pairs and therefore the maximum number of peers that the instance of the LND may communicate with Used to construct HCA device names by appending the device number Used to construct IPoIB interface names by appending the same device number as is used to generate the HCA device name Used to construct IPoIB interface names by appending the same device number as is used to generate the HCA device name Low level QP parameter Only change it from the default value if so advised Low level QP parameter Only change it from the default value if so advised Lustre 1 8 Operations Manual October 2009 Variable Description rnr_nak_ timer 0x10 Wc fmr_remaps 1000 cksum 0 W Low level QP parameter Only change it from the default value if so advised Controls how often FMR mappings may be reused before they must be unmapped Only change it from the default value if so advised Boolean that determines if messages NB not RDMAs should be check summed This is a diagnostic feature that
181. To redirect the strace output to a file to review at a later time strace o lt filename gt lt program gt lt args gt Use the ff option along with o to save the trace output in filename pid where pid is the process ID of the process being traced Use the ttt option to timestamp all lines in the strace output so they can be correlated to operations in the lustre kernel debug log If the debugging is done in UML save the traces on the host machine In this example hostfs is mounted on r strace o r tmp vi strace Chapter 23 Lustre Debugging 23 13 23 4 23 14 Looking at Disk Content In Lustre the inodes on the metadata server contain extended attributes EAs that store information about file striping EAs contain a list of all object IDs and their locations that is the OST that stores them The lfs tool can be used to obtain this information for a given file via the getstripe sub command Use a corresponding Ifs setstripe command to specify striping attributes for a new file or directory The lfs getstripe utility is written in C it takes a Lustre filename as input and lists all the objects that form a part of this file To obtain this information for the file mnt lustre frog in Lustre file system run lfs getstripe mnt lustre frog OBDs 0 OSC_localhost_UUID 1 OSC_localhost_2_UUID 2 OSC_localhost_3_UUID obdix objid 0 F7 1 4 The debugfs tool is provided by the e2fsprogs package
182. U 1500 Metric 1 RX packets 3651769 errors 0 dropped 0 overruns 0 frame 0 TX packets 1643480 errors 0 dropped 0 overruns 0 carrier collisions 0 txqueuelen 100 Interrupt 9 Base address 0x1400 Lustre 1 8 Operations Manual October 2009 255 252 0 255 252 0 255 252 0 12 6 12 6 1 Configuring Lustre with Bonding Lustre uses the IP address of the bonded interfaces and requires no special configuration It treats the bonded interface as a regular TCP IP interface If needed specify bond0 using the Lustre networks parameter in etc modprobe options lnet networks tcp bond0 Bonding References We recommend the following bonding references In the Linux kernel source tree see documentation networking bonding txt http linux ip net html ether bonding html http www sourceforge net projects bonding This is the bonding SourceForge website http linux net osdl org index php Bonding This is the most extensive reference and we highly recommend it This website includes explanations of more complicated setups including the use of DHCP with bonding Chapter 12 Bonding 12 11 12 12 Lustre 1 8 Operations Manual October 2009 CHAPTER 13 Upgrading and Downgrading Lustre The chapter describes how to upgrade and downgrade between different Lustre versions and includes the following sections Supported Upgrades Lustre Interoperability Upgrading Lustre 1 6 x to 1 8 x Upgrading Lustre 1 8 x to t
183. Using the previous set quota example running this lfs quota command lfs quota u bob v mnt lustre displays this command output Disk quotas for user bob uid 500 Filesystem blocks quota limit grace files quota limit grace mnt lustre 0 307200 309200 0 10000 11000 lustre MDTO000_UUID 0 0 102400 0 0 5000 lustre OSTO000_UUID 0 0 102400 lustre OSTO0001_UUID 0 0 102400 Lustre 1 8 Operations Manual October 2009 9 1 3 Quota Allocation The Linux kernel sets a default quota size of 1 MB For a block the default is 128 MB For files the default is 5120 Lustre handles quota allocation in a different manner quota must be properly set or users may experience unnecessary failures The file system block quota is divided up among the OSTs within the file system Each OST requests an allocation which is increased up to the quota limit The quota allocation is then quantized to reduce the number of quota related request traffic By default Lustre supports both user and group quotas to limit disk usage and file counts The quota system in Lustre is completely compatible with the quota systems used on other file systems The Lustre quota system distributes quotas from the quota master Generally the MDS is the quota master for both inodes and blocks All OSTs and the MDS are quota slaves to the OSS nodes The minimum transfer unit is 100 MB to avoid performance impacts for quota adjustments The file system block quota is divided u
184. Verify that the ea bak file has properly backed up your EA data on the MDS Without this EA data your backup is not useful You can look at this file with more or a text editor and it should have an item for each file like file ROOT mds_md5sum3 txt trusted lov0s0AVRCWEAAABXOKUCAAAAAAAAAAAAAAAAAAAQAAEFAAADD5 QOAAAAA AAAAAAAAAAAAAAAAAAFAAAA 6 Back up all file system data Type tar czvf backup file tgz 7 Change out of the mounted file system Type cd 8 Unmount the file system Type umount mnt mds Follow the same process on each of the OST device file systems The backup of the EAs described in Step 4 is not currently required for OST devices but this may change in the future To restore the file level backup you need to format the device restore the file data and then restore the EA data A 14 Lustre 1 8 Operations Manual October 2009 9 10 11 12 13 14 15 Format the new device The easiest way to get the optimal ext3 parameters is to use lconf reformat config xml ONLY ON THE NODE being restored If there are multiple services on the node then this reformats all of the devices on that node and should NOT be used Instead use the step below a For MDS file systems use mke2fs j J size 400 I inode_size i 4096 dev where inode_size is at least 512 and possibly larger if you have a default stripe count gt 10 inode_size power_of_2_ gt _than 384 stripe_coun
185. When 1fs quotacheck is run Lustre must NOT be performing any write operations Failure to follow this caution may cause the statistic information of quota to be inaccurate For example the number of blocks used by OSTs for users or groups will be inaccurate which can cause unexpected quota problems 2 Run the 1fs command with the quotacheck option lfs quotacheck ug mnt lustre By default quota is turned on after quotacheck completes Available options are m u checks the user disk quota information m g checks the group disk quota information The 1fs quotacheck command checks all objects on all OSTs and the MDS to sum up for every UID GID It reads all Lustre metadata and re computes the number of blocks inodes that each UID GID has used If there are many files in Lustre it may take a long time to complete Note User and group quotas are separate If either quota limit is reached a process with the corresponding UID GID cannot allocate more space on the file system Note When 1fs quotacheck runs it creates a quota file a sparse file with a size proportional to the highest UID in use and UID GID distribution As a general rule if the highest UID in use is large then the sparse file will be large which may affect functions such as creating a snapshot Lustre 1 8 Operations Manual October 2009 Note For Lustre 1 6 releases before version 1 6 5 and 1 4 releases before version 1 4 12 if
186. X quota Proc quota entries are copied to proc fs lustre lquota To maintain compatibility old quota proc entries in the following folders are not deleted in the current Lustre release although they may be deprecated in the future proc fs lustre obdfilter lustre OSTXXXX and proc fs lustre mds lustre MDTXXXX m Only use the quota entries in proc fs lustre lquota Chapter 9 Configuring Quotas 9 15 9 16 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 0 RAID This chapter describes software and hardware RAID and includes the following sections Considerations for Backend Storage m Insights into Disk Performance Measurement a Lustre Software RAID Support 10 1 10 1 10 1 1 Considerations for Backend Storage Lustre s architecture allows it to use any kind of block device as backend storage The characteristics of such devices particularly in the case of failures vary significantly and have an impact on configuration choices This section surveys issues and recommendations regarding backend storage Selecting Storage for the MDS or OSTs MDS The MDS does a large amount of small writes For this reason we recommend that you use RAID1 for MDT storage If you require more capacity for an MDT than one disk provides we recommend RAID1 0 or RAID10 LVM is not recommended at this time for performance reasons OST A quick calculation shown below makes it clear that without furthe
187. _dqacq function Chapter 9 Configuring Quotas 9 13 9 1 5 1 9 14 Quota Event Description nowait_for_pending_blk_quota_req On the MDS or OSTs there is one thread sending a qctxt_wait_pending_dqacq quota request for a specific UID GID for block quota at any time When threads enter qctxt_wait_pending_dqacq they do not need to wait This is done in the qctxt_wait_pending_dqacq function nowait_for_pending_ino_quota_req On the MDS there is one thread sending a quota qctxt_wait_pending_dqacq request for a specific UID GID for inode quota at any time When threads enter qctxt_wait_pending_dqacq they do not need to wait This is done in the qctxt_wait_pending_dqacq function quota_ctl The quota_ctl statistic is generated when 1fs setquota lfs quota and so on are issued adjust_qunit Each time qunit is adjusted it is counted Interpreting Quota Statistics Quota statistics are an important measure of a Lustre file system s performance Interpreting these statistics correctly can help you diagnose problems with quotas and may indicate adjustments to improve system performance For example if you run this command on the OSTs cat proc fs lustre lquota lustre OST0000 stats You will get a result similar to this snapshot_time 1219908615 506895 secs usecs async_acq_req 1 samples us 32 32 32 async_rel_req 1 samples us 5 5 5 nowait_for_pending_blk_quota_req qctxt_wait_pending_dqacq 1 samples
188. _status c lprocfs_add_vars 80 The tool displays the following output to show the leaks found Leak 32bytes allocated at a23a8fc service c ptlrpc_init_svc 144 debug file line 241 Printing to var log messages To dump debug messages to the console set the corresponding debug mask in the printk flag sysctl w lnet printk 1 This slows down the system dramatically It is also possible to selectively enable or disable this for particular flags using sysctl w lnet printk vfstrace sysctl w lnet printk vfstrace Tracing Lock Traffic Lustre has a specific debug type category for tracing lock traffic Use letl gt filter all_types lctl gt show dlmtrace lctl gt debug_kernel filename Chapter 23 Lustre Debugging 23 9 23 2 7 23 2 8 Sample Ictl Run bash 2 04 Ictl lctl gt debug_kernel tmp lustre_logs log_all Debug log 324 lines 324 kept 0 dropped lctl gt filter trace Disabling output of type trace letl gt debug_kernel tmp lustre_logs log_notrace Debug log 324 lines 282 kept 42 dropped lctl gt show trace Enabling output of type trace letl gt filter portals Disabling output from subsystem portals letl gt debug_kernel tmp lustre_logs log_noportals Debug log 324 lines 258 kept 66 dropped Adding Debugging to the Lustre Source Code In the Lustre source code the debug infrastructure provides a number of macros which aid in debugging or reporting serio
189. a route is disabled and possibly new the sent packets counter is set to 0 When the route is first re used that is an elapsed disable time is found the sent packets counter is incremented to 1 and incremented for all further uses of the route If the route has been used for 100 packets successfully then the sent packets counter should be with a value of 100 Set the timeout to 0 zero so future errors no longer double the timeout Note The router_ping_timeout is consistent with the default LND timeouts You may have to increase it on very large clusters if the LND timeout is also increased For larger clusters we suggest increasing the check interval Lustre 1 8 Operations Manual October 2009 2 4 2 1 LNET Routers All LNET routers that bridge two networks are equivalent They are not configured as primary or secondary and load is balanced across all available routers With the router checker configured Lustre nodes can detect router health status avoid those that appear dead and reuse the ones that restore service after failures There are no hard requirements regarding the number of LNET routers although there should enough to handle the required file serving bandwidth and a 25 margin for headroom Comparing 32 bit and 64 bit LNET Routers By default at startup LNET routers allocate 544M i e 139264 4K pages of memory as router buffers The buffers can only come from low system memory i e ZONE_DMA
190. ad by decreasing the number of OST threads ost_num_threads module parameter to ost ko module The optimum number of OST threads varies for each particular configuration Variables include the number of OSTs on each OSS number and speed of disks RAID configuration and available RAM You may want to start with a number of OST threads equal to twice the number of actual disks on the node If you use RAID 5 or RAID 6 subtract any disks used for parity 1 for RAID 5 2 for RAID 6 While monitoring the aggregate OSS I O performance increase the number of OST service threads using lctl set_param ost OSS ost_io threads_max N until the throughput is maximized Note that if there are too many threads the latency for individual I O requests can become very high and should be avoided Set the desired maximum thread count permanently using the method described above MDS Service Thread Count The mds_num_threads parameter enables the number of MDS service threads to be specified at module load time on the MDS node options mds mds_num_threads N Tip After startup the minimum and maximum number of OSS thread counts can be set via the min max _thread_count tunable To change the tunable at runtime use lctl get set conf _param service min max _thread_count For details see Changing MDS and OSS Thread Counts At this time we have not tested to determine the optimal number of MDS threads The default value varies based on server
191. address gt is the network address and lt network gt is an identifier for the network network type instance Examples First TCP network 192 73 220 107 tcp0 Second TCP network 10 10 1 50 tcp1 Elan 2 elan The nid syntax for the generic client is still valid Modules modprobe conf Network hardware and routing are now configured via module parameters specified in the usual locations Depending on your kernel version and Linux distribution this may be etc modules conf etc modprobe conf or etc modprobe conf local All old Lustre configuration lines should be removed from the module configuration file The RPM install should do this but check to be certain The base module configuration requires two lines alias lustre llite options lnet networks tcp0 A full list of options can be found at Module Parameters on page 37 Detailed examples can be found in the section Configuring the Lustre Network Some brief examples Example 1 Use eth1 instead of eth0 options Inet networks tcp0 eth1 A 24 Lustre 1 8 Operations Manual October 2009 Example 2 Servers have two tcp networks and one Elan network Clients are either TCP or Elan Servers options Inet networks tcp0 eth0 eth1 elan0 Elan clients options Inet networks elan0 TCP clients options Inet networks tcp0 Portals Compatibility If you are upgrading Lustre on all clients and servers at the same time then you may skip this section
192. ailed explanation of ACLs on Linux refer to the SuSE Labs article Posix Access Control Lists on Linux http www suse de agruen acl linux acls online We have implemented ACLs according to this model Lustre supports the standard Linux ACL tools setfacl getfacl and the historical chacl normally installed with the ACL package 25 1 25 1 2 25 2 Note ACL support is a system range feature meaning that all clients have ACL enabled or not You cannot specify which clients should enable ACL Using ACLs with Lustre Lustre supports POSIX Access Control Lists ACLs An ACL consists of file entries representing permissions based on standard POSIX file system object permissions that define three classes of user owner group and other Each class is associated with a set of permissions read r write w and execute x m Owner class permissions define access privileges of the file owner m Group class permissions define access privileges of the owning group m Other class permissions define access privileges of all users not in the owner or group class The 1s 1 command displays the owner group and other class permissions in the first column of its output for example rw r for a regular file with read and write access for the owner class read access for the group class and no access for others Minimal ACLs have three entries Extended ACLs have more than the three entries Extended ACLs also contain
193. ailed if any of them do Client Failure Lustre supports for recovery from client failure based on the revocation of locks and other resources so surviving clients can continue their work uninterrupted If a client fails to timely respond to a blocking AST from the Distributed Lock Manager or a bulk data operation times out the system removes the client from the cluster This action allows other clients to acquire locks blocked by the dead client and it also frees resources such as file handles and export data associated with the client This scenario can be caused by a client node system failure or a network partition Lustre 1 8 Operations Manual October 2009 19 2 2 19 23 MDS Failure and Failover Reliable Lustre operation requires that the MDS have a peer configured for failover including the use of a shared storage device for the MDS backing file system When a client detects an MDS failure it connects to the new MDS and launches the MetadataReplay function MetadataReplay ensures that the replacement MDS re accumulates the state resulting from transactions whose effects were visible to clients but which were not committed to disk Transaction numbers ensure that the operations replay occurs in the same order as the original integration Additionally clients inform the new server of their existing lock state including locks that have not yet been granted All metadata and lock replay must complete before new non recovery
194. ailover For OST failover multiple OSS nodes are configured to be able to serve the same OST Only one OSS node can serve the OST at a time An OST can be moved back and forth between OSS nodes using the umount mount commands as long as the OSSs can access the same disk Lustre 1 8 Operations Manual October 2009 8 4 1 8 4 2 Note Defining an OST for failover does not require that more than OSS be defined for it You can provide failover service i e no I O errors to clients using a single OSS In this configuration if the OST fails clients are blocked until the OST becomes active again For MDT failover two MDSs are configured to serve the same MDT Only one MDS node can serve the MDT at a time To add a failover partner to a Lustre configuration use the ailnode option This may be done at creation time with mkfs lustre or at a later time with tunefs lustre For a failover example see More Complicated Configurations For an explanation of the mkfs lustre and tunefs lustre utilities see mkfs lustre and tunefs lustre Caution Lustre s OST failover functionality does not protect against corruption caused by a disk failure If the storage media i e physical disk used for an OST fails Lustre cannot recover it This is why we strongly recommended that some form of RAID be used for OSTs Lustre assumes that the storage is reliable and it adds no redundancy to for OSTs or the MDT Starting Stoppi
195. ainst using them for MDTs to reduce complexity No more than 102 400 file system blocks will ever be used for a journal For Lustre s standard 4 KB block size this corresponds to a 400 MB journal A larger partition can be created but only the first 400 MB will be used Additionally a copy of the journal is kept in RAM on the OSS Therefore make sure you have enough memory available to hold copies of all the journals To create an external journal perform these steps for each OST on the OSS 1 Create a 400 MB or larger journal partition RAID 1 is recommended In this example dev sdb is a RAID 1 device run sfdisk uC dev sdb lt lt EOF gt 50 L gt EOF 2 Create a journal device on the partition Run mke2fs b 4096 O journal dev dev sdb1 3 Create the OST In this example dev sdc is the RAID 6 device to be used as the OST run mkfs lustre ost mgsnode mds osib mkfsoptions J device dev sdb1 dev sdc 4 Mount the OST as usual 3 Performance is affected because while writing large sequential data small I O writes are done to update metadata This small sized I O can affect performance of large sequential I O with disk seeks 10 6 Lustre 1 8 Operations Manual October 2009 10 1 6 Handling Degraded RAID Arrays Lustre 1 8 2 and later versions include functionality that notifies Lustre if an external RAID array is degraded resulting in a degraded OST and prevents new files from
196. al October 2009 Device Operations Option Description Ictl get_param n lt path_name gt Gets the Lustre or LNET parameters from the specified lt path_name gt Use the n option to get only the parameter value and skip the pathname in the output NOTE Lustre tunables are not always accessible using procfs interface as it is platform specific As a solution Ictl get set _param has been introduced as a platform independent interface to the Lustre tunables Avoid direct references to proc fs sys lustre Inet For future portability use Ictl get set _param instead Ictl set_param n lt path_name gt Sets the specified value to the Lustre or LNET parameter indicated by the pathname Use the n option to skip the pathname in the output NOTE Lustre tunables are not always accessible using procfs interface as it is platform specific As a solution Ictl get set _param has been introduced as a platform independent interface to the Lustre tunables Avoid direct references to proc fs sys lustre Inet For future portability use Ictl get set _param instead conf_param lt device gt lt parameter gt Sets a permanent configuration parameter for any device via the MGS This command must be run on the MGS node activate Re activates an import after the de activate operation deactivate Running 1ctl deactivate on the MDS stops new objects from being allocated on the OST Running 1ctl deactivate on Lustre cl
197. al October 2009 Linux LVM LV Logical Volume The CSV line format is hostname LV lv name operation mode options lv size vg name Where Variable Supported Type hostname Hostname of the node in the cluster LV Marker of the LV line lv name Name of the logical volume to be created optional or path of the logical operation mode options lv size vg name volume to be removed required by the remove mode Operations mode either create or remove Default is create A catchall for other lvcreate lvremove options for example i 2 1 128 Size kKKmMgGtT to be allocated for the new LV Default is megabytes MB Name of the VG in which the new LV is created Chapter 6 Configuring Lustre Examples 6 7 6 8 Lustre target The CSV line format is hostname module_opts device name mount point device type fsname mgs nids index format options mkfs options mount options failover nids Where Variable Supported Type hostname Hostname of the node in the cluster It must match uname n module_opts device name mount point device type fsname mgs nids index format options mkfs options mount options failver nids Lustre networking module options Use the newline character n to delimit multiple options Lustre target block device or loopback file Lustre target mount point Lustre target type mgs mdt ost mgs mdt mdt mgs Lustre file system name lim
198. al ways m The lctl conf _ param value overwrites the parameter s previous value If the new value uses an incorrect syntax then the system continues with the old parameters and the previously correct value is lost on remount That is be careful doing root squash tuning mkfs lustre and tunefs lustre do not perform syntax checking If the root squash parameters are incorrect they are ignored on mount and the default values are used instead Root squash parameters are parsed with rigorous syntax checking The root_squash parameter should be specified as lt decnum gt lt decnum gt The nosquash_nids parameter should follow LNET NID range list syntax LNET NID range syntax lt nidlist gt lt nidrange gt lt nidrange gt lt nidrange gt lt addrrange gt lt net gt lt addrrange gt lt ipaddr_range gt lt numaddr_range gt lt ipaddr_range gt lt numaddr_range gt lt numaddr_range gt lt numaddr_range gt lt numaddr_range gt lt numaddr_range gt lt number gt lt expr_list gt lt expr_list gt lt range_expr gt lt range_expr gt lt range_expr gt lt number gt lt number gt lt number gt lt number gt lt number gt lt number gt lt net gt lt netname gt lt netname gt lt number gt lt netname gt SS lo tept e27b feib Mop nib BE vib ra elan gm mx ptl lt
199. ameters can also be changed with the 1ct1 conf param command For example lctl conf_param Lustre mdt root_squash 1000 100 letl conf_param Lustre mdt nosquash_nids tcp Note When using the 1ct1 conf_param command keep in mind 1ctl conf_param must be run on a live MGS 1ctl conf_param causes the parameter to change on all MDSs 1ctl conf_param is to be used once per a parameter The nosquash_nids list can be cleared with lctl conf_param Lustre mdt nosquash_nids NONE OR letl conf_param Lustre mdt nosquash_nids clear If the nosquash_nids value consists of several NID ranges e g 0 elan 1 elan1 the list of NID ranges must be quoted with single or double quotation marks List elements must be separated with a space For example mkfs lustre param mdt nosquash_nids 0 elanl 1 elan2 dev sdal lctl conf_param Lustre mdt nosquash_nids 24 elan 15 elani These are examples of incorrect syntax mkfs lustre param mdt nosquash_nids 0 elanl 1 elan2 dev sdal lctl conf_param Lustre mdt nosquash_nids 24 elan 15 elanl To check root squash parameters use the lctl get_param command lctl get_param mdt Lustre MDT0000 root_squash letl get_param mdt Lustre MDT000 nosquash_nids Note An empty nosquash_nids list is reported as NONE Chapter 25 Lustre Security 25 5 25 2 5 25 6 Tips on Using Root Squash Lustre configuration management limits root squash in sever
200. amount of data into each I O RPC and attempts to keep a consistent number of issued RPCs in progress at a time Lustre exposes several tuning variables to adjust behavior according to network conditions and cluster size Each OSC has its own tree of these tunables For example ls d proc fs lustre osc OSC_client_ost1_MNT_client_2 localhost proc fs lustre osc OSC_um10_ost1_MNT_localhost proc fs lustre osc OSC_um10_ost2_MNT_localhost proc fs lustre osc OSC_um10_ost3_MNT_localhost ls proc fs lustre osc OSC_um10_ost1_MNT_localhost blocksizefilesfreemax_dirty_mb ost_server_uuid stats and so on RPC stream tunables are described below proc fs lustre osc lt object name gt max_dirty_mb This tunable controls how many MBs of dirty data can be written and queued up in the OSC POSIX file writes that are cached contribute to this count When the limit is reached additional writes stall until previously cached writes are written to the server This may be changed by writing a single ASCII integer to the file Only values between 0 and 512 are allowable If 0 is given no writes are cached Performance suffers noticeably unless you use large writes 1 MB or more proc fs lustre osc lt object name gt cur_dirty_bytes This tunable is a read only value that returns the current amount of bytes written and cached on this OSC Lustre 1 8 Operations Manual October 2009 proc fs lustre osc lt object name gt max_
201. and can often recover a large amount of data even when a significant portion of the device is bad m Keep a log of all actions and output in a safe place If you perform multiple file system checks and or actions to repair the file system save all logs They may provide valuable insight into problems encountered Normally the first thing to do is a read only file system check after the Lustre service MDS or OST has been stopped If it is not possible to stop the service you can run a read only file system check when the device is in use If running a file system check while the device is in use e2fsck cannot always coordinate data gathered at the start of the run with data gathered later in the run and will report incorrect file system errors The number of errors is dependent upon the length of check approximately equal to the device size and the load on the file system In this situation you should run e2fsck multiple times on the device and look for errors that are persistent across runs and ignore transient errors To run a read only file system check we recommend that you use the latest e2fsck available at http www sun com software products lustre get jsp On the system with the suspected bad device in the example below dev sda is used run root mds script root e2fsck 1 sda Script started file is root e2fsck 1 sda root mds e2fsck fn dev sda e2fsck 1 35 lfck8 05 Feb 2005 Warning skipping journal reco
202. andom Create Create Read Delete Create Read Delete files sec CP sec CP sec CP sec CP sec CP sec CP 16 510 O 283 1 465 O 291 1 mds 2G 38118 22 21245 10 51967 10 90 0 0 16 510 0 2 83 1 465 0 291 1 Lustre 1 8 Operations Manual October 2009 Version 1 03 Sequential Output Sequential Input Random Per Chr Block Rewrite Per Chr Block Seeks MachineSize K sec CP K sec CP K sec CP K sec CP K sec SCP sec SCP mds 2G 27460 92 41450 25 21474 10 19673 60 52871 10 88 0 0 Create Read Delete Create Read Delete files sec CP sec CP sec CP_ sec CP sec SCP sec CP 16 29681 99 30412 90 29568 99 28077 82 mds 2G 27460 92 41450 25 21474 10 19673 60 52871 10 88 0 0 16 296 81 99 30412 90 29568 99 28077 82 12 IOR Benchmark Use the IOR_Survey script to test the performance of the lustre file systems It uses IOR Interleaved or Random a script used for testing performance of parallel file systems using various interfaces and access patterns IOR uses MPI for process synchronization Under the control of compile time defined constants and to a lesser extent environment variables I O is done via MPI IO The data are written and read using independent parallel transfers of equal sized blocks of contiguous bytes that cover the file with no gaps and that do not over
203. ands The Ist utility takes a number of command line arguments The first argument is the command name and subsequent arguments are command specific Session This section lists Ist session commands Process Environment LST_SESSION The 1st utility uses the LST_SESSION environmental variable to identify the session locally on the self test console node This should be a numeric value that uniquely identifies all session processes on the node It is convenient to set this to the process ID of the shell both for interactive use and in shell scripts Almost all 1st commands require LST_SESSION to be set 18 22 Lustre 1 8 Operations Manual October 2009 new_session timeout SECONDS force NAME Creates a new session timeout SECONDS Console timeout value of the session The session ends automatically if it remains idle i e no commands are issued for this period force Ends conflicting sessions This determines who wins when one session conflicts with another For example if there is already an active session on this node then this attempt to create a new session fails unless the force flag is specified However if the force flag is specified then the other session is ended Similarly if this session attempts to add a node that is already owned by another session the force flag allows this session to steal the node name A human readable string to print when listing sessions or reporting sessi
204. apter5 Service Tags 5 5 TLI 5 6 Information Registered with Sun The service tag registration process collects the following product registration agentry and system information Data Name Description Product Information Lustre specific information Instance identifier Product name Product identifier Product vendor Product version Parent name Parent identifier Customer tag Time stamp Source Container Node type client MDS OSS or MGS Unique identifier for that instance of the gear Name of the gear Unique identifier for the gear being registered Vendor of the gear Version of the gear Parent gear of the registered gear Unique identifier for the parent of the gear Optional customer defined value Day and time that the gear is registered Where the gear identifiers came from Name of the gear s container Registration Agentry Information Agentry Identifier Agentry Version Registry Identifier System Information Host System Release Architecture Platform Manufacturer CPU manufacturer HostID Serial number Lustre 1 8 Operations Manual October 2009 Unique value for that instance of the agentry Value of the agentry File version containing product registration information System hostname Operating System Operating system version Physical hardware architecture Hardware platform Hardware manufacturer CPU manufacturer System host ID System chassis serial number CHAPTER 6 Co
205. as it is difficult to know which OST holds the end of the file until you check all the OSTs As all the clients are using the same O_APPEND there is significant locking overhead m The second client cannot get all locks until the end of the writing of the first client as the taking serializes all writes from the clients m To avoid deadlocks the taking of these locks occurs in a known consistent order As a client cannot know which OST holds the next piece of the file until the client has locks on all OSTS there is a need of these locks in case of a striped file Slowdown Occurs During Lustre Startup When Lustre starts the Lustre file system needs to read in data from the disk For the very first mdsrate run after the reboot the MDS needs to wait on all the OSTs for object pre creation This causes a slowdown to occur when Lustre starts up After the file system has been running for some time it contains more data in cache and hence the variability caused by reading critical metadata from disk is mostly eliminated The file system now reads data from the cache Log Message Out of Memory on OST When planning the hardware for an OSS node consider the memory usage of several components in the Lustre system If insufficient memory is available an out of memory message can be logged During normal operation several conditions indicate insufficient RAM on a server node m kernel Out of memory and or oom killer messag
206. ata read from disk during a read request is kept in memory and available for later read requests for the same data without having to re read it from disk By default read cache is enabled read_cache_enable 1 When the OSS receives a read request from a client it reads data from disk into its memory and sends the data as a reply to the requests If read cache is enabled this data stays in memory after the client s request is finished and the OSS skips reading data from disk when subsequent read requests for the same are received The read cache is managed by the Linux kernel globally across all OSTs on that OSS and the least recently used cache pages will be dropped from memory when the amount of free memory is running low If read cache is disabled read_cache_enable 0 then the OSS will discard the data after the client s read requests are serviced and for subsequent read requests the OSS must read the data from disk Chapter 22 LustreProc 22 21 22 22 To disable read cache on all OSTs of an OSS run root oss1 lctl set_param obdfilter read_cache_enable 0 To re enable read cache on one OST run root oss1 lctl set_param obdfilter OST_name read_cache_enable 1 To check if read cache is enabled on all OSTs on an OSS run root oss1 lctl get_param obdfilter read_cache_enable writethrough_cache_enable controls whether data sent to the OSS as a write request is kept in the read cache and availa
207. atch the MDS lov_objid value then you have decided on a proper value for LAST_ID Once you have decided on a proper value for LAST_ID use this repair procedure 1 Access mount t ldiskfs dev ostdev mnt ost 2 Check the current od Ax td8 mnt ost 0 0 LAST_ID 3 Be very safe only work on backups cp mnt ost O 0 LAST_ ID tmp LAST_ID 4 Convert binary to text xxd tmp LAST_ID tmp LAST_ID asc 5 Fix vi tmp LAST_ID asc 6 Convert to binary xxd r tmp LAST_ID asc tmp LAST_ID new 7 Verify od Ax td8 tmp LAST_ID new 8 Replace cp tmp LAST_ID new mnt ost O 0 LAST_TD 9 Clean up umount mnt ost Appendix A Lustre Knowledge Base A 27 A 28 Why can t I run an OST and a client on the same machine Consider the case of a client with dirty file system pages in memory and memory pressure A kernel thread is woken to flush dirty pages to the file system and it writes to local OST The OST needs to do an allocation in order to complete the write The allocation is blocked waiting for the above kernel thread to complete the write and free up some memory This is a deadlock Also if the node with both a client and OST crash then the OST waits during recovery for the client that was mounted on that node to recover However since the client crashed it is considered a new client to the OST and is blocked from mounting until recovery completes As a result this is currently considered a double failure and r
208. ats or OST brw_stats mentioned above Number of groups scanned grps column should be small If it reaches a few dozen often then either your disk file system is pretty fragmented or mballoc is doing something wrong in the group selection part Chapter 22 LustreProc 22 25 22 29 22 26 mballoc3 Tunables Lustre version 1 6 1 and later includes mballoc3 which was built on top of mballoc2 By default mballoc3 is enabled and adds these features m Pre allocation for single files helps to resist fragmentation m Pre allocation for a group of files helps to pack small files into large contiguous chunks m Stream allocation helps to decrease the seek rate The following mballoc3 tunables are available Field Description stats Enables disables the collection of statistics Collected statistics can be found in proc fs ldiskfs2 lt dev gt mb_history max_to_scan Maximum number of free chunks that mballoc finds before a final decision to avoid livelock min_to_scan Minimum number of free chunks that mballoc finds before a final decision This is useful for a very small request to resist fragmentation of big free chunks order2_req For requests equal to 2 N where N gt order2_req a very fast search via buddy structures is used stream_req Requests smaller or equal to this value are packed together to form large write I Os Lustre 1 8 Operations Manual October 2009 The following tunables providing m
209. b1 checking for existing Lustre data found Lustre data Reading CONFIGS mountdata Read previous values Target main MDT0000 15 6 Lustre 1 8 Operations Manual October 2009 Index 0 Lustre FS main Mount type ldiskfs Flags 0x5 MDT MGS Persistent mount opts errors remount ro iopen_nopriv user_xattr Parameters Permanent disk data Target back MDT0000 Index 0 Lustre FS back Mount type ldiskfs Flags 0x105 MDT MGS writeconf Persistent mount opts errors remount ro iopen_nopriv user_xattr Parameters Writing CONFIGS mountdata cfs21 tunefs lustre reformat fsname back writeconf dev volgroup OSTb1 checking for existing Lustre data found Lustre data Reading CONFIGS mountdata Read previous values Target main OST0000 Index 0 Lustre FS main Mount type ldiskfs Flags 0x2 OST Persistent mount opts errors remount ro extents mballoc Parameters mgsnode 192 168 0 21 tcp Permanent disk data Target back OSTO000 Index 0 Lustre FS back Mount type ldiskfs Flags 0x102 OST writeconf Persistent mount opts errors remount ro extents mballoc Parameters mgsnode 192 168 0 21 tcp Writing CONFIGS mountdata When renaming an FS we must also erase the last_rcvd file from the snapshots cfs21 mount t ldiskfs dev volgroup MDTb1 mnt mdtback cfs21 rm mnt mdtback last_rcvd Chapter 15 Backup and Restore 15 7 cfs21 umount mnt mdtback cfs21 mount t ldiskfs dev volgroup OSTb1 mnt ostbac
210. backup Note You cannot write to the holes of such files without having 1 sck recreate the objects Generally it is easier to delete these files and restore them from backup To fix inodes with duplicate objects 1fsck copies the duplicate object to a new object and assign that to one of the files if the c option is given One of the files will be okay and one will likely contain garbage 1fsck cannot by itself tell which one is correct Chapter 27 User Utilities man1 27 19 218 Filefrag The e2fsprogs package contains the filefrag tool which reports the extent of file fragmentation Synopsis filefrag belsv files Description The filefrag utility reports the extent of fragmentation in a given file Initially filefrag attempts to obtain extent information using FIEMAP ioctl which is efficient and fast If FIEMAP is not supported then filefrag uses FIBMAP Note Lustre only supports FIEMAP ioctl FIBMAP ioctl is not supported In default model filefrag returns the number of physically discontiguous extents in the file In extent or verbose mode each extent is printed with details For Lustre the extents are printed in device offset order not logical offset order 27 20 1 The default mode is faster than the verbose extent mode Lustre 1 8 Operations Manual October 2009 Options The options and descriptions for the filefrag utility are listed below Option Descrip
211. be applied from the stack quilt push or removed from the stack quilt pop You can query the contents of the series file quilt series the contents of the stack quilt applied quilt previous quilt top and the patches that are not applied at a particular moment quilt next quilt unapplied You can edit and refresh update patches with Quilt as well as revert inadvertent changes and fork or clone patches and show the diffs before and after work A variety of Quilt packages RPMs SRPMs and tarballs are available from various sources Use the most recent version you can find Quilt depends on several other utilities e g the coreutils RPM that is only available in RedHat 9 For other RedHat kernels you have to get the required packages to successfully install Quilt If you cannot locate a Quilt package or fulfill its dependencies you can build Quilt from a tarball available here http savannah nongnu org projects quilt For additional information on using Quilt including its commands see the introduction to Quilt and the quilt 1 man page Get the Lustre Source and Unpatched Kernel The Lustre Group supports several Linux unpatched kernels for use with Lustre and provides a series of patches for each one The Lustre patches are maintained in the kernel_patch directory bundled with the Lustre source code The unpatched kernels are also available for download 1 Verify that all of the Lustre installation requirements have be
212. be used prior to upgrading to have a downgrade path available and or that Lustre 1 8 be used without using OST pools until there is no concern about downgrading to 1 6 5 or earlier Working with OST Pools OST pools are defined in the configuration log on the MGS Use the 1ct1 command to m Create destroy a pool m Add remove OSTs in a pool m List pools and OSTs in a specific pool The 1ct1 command MUST be run on the MGS Another requirements for managing OST pools is either to have the MDT and MGS on the same node or to have a Lustre client mounted on the MGS node if it is separate from the MDS This is needed to validate the pools commands being run are correct Caution Running the writeconf command on the MDS will erase all pools information as well as any other parameters set via lctl conf_param We recommend that the pools definitions and conf_param settings be executed via a script so they can be reproduced easily after a writeconf is performed To create a new pool lctl pool_new lt fsname gt lt poolname gt Note The pool name is an ASCII string up to 16 characters To add the named OST to a pool letl pool_add lt fsname gt lt poolname gt lt ost_list gt Where lt ost_list gt is lt fsname gt OST lt index_range gt _UUID and lt index_range gt is lt ost_index_start gt lt ost_index_end gt lt index_range gt or lt ost_index_start gt lt ost_index_end gt lt step gt I
213. bject storage m QOS Quality of Service QOS considers an OST s available blocks speed and the number of existing objects etc Using these criteria the MDS selects OSTs with more free space more often than OSTs with less free space a RR Round Robin RR allocates objects evenly across all OSTs The RR stripe allocator is faster than QOS and used often because it distributes space usage load best in most situations maximizing network balancing and improving performance Whether QOS or RR is used depends on the setting of the qos_threshold_rr proc tunable The qos_threshold_rr variable specifies a percentage threshold where the use of QOS or RR becomes more less likely The qos_threshold_rr tunable can be set as an integer from 0 to 100 and results in this stripe allocation behavior a If gos_threshold_rr is set to 0 then QOS is always used a If qos_threshold_rr is set to 100 then RR is always used a The larger the gos_threshold_rr setting the greater the possibility that RR is used instead of QOS Chapter 22 LustreProc 22 11 222 22241 22 12 Lustre I O Tunables The section describes I O tunables proc fs lustre llite lt fsname gt lt uid gt max_cache_mb cat proc fs lustre llite lustre ce63ca00 max_cached_mb 128 This tunable is the maximum amount of inactive data cached by the client default is 3 4 of RAM Client I O RPC Stream Tunables The Lustre engine always attempts to pack an optimal
214. ble for later reads or if it is discarded from cache when the write is completed By default writethrough cache is enabled writethrough_cache_enable 1 When the OSS receives write requests from a client it receives data from the client into its memory and writes the data to disk If writethrough cache is enabled this data stays in memory after the write request is completed allowing the OSS to skip reading this data from disk if a later read request or partial page write request for the same data is received If writethrough cache is disabled writethrough_cache_enabled 0 then the OSS discards the data after the client s write request is completed and for subsequent read request or partial page write request the OSS must re read the data from disk Enabling writethrough cache is advisable if clients are doing small or unaligned writes that would cause partial page updates or if the files written by one node are immediately being accessed by other nodes Some examples where this might be useful include producer consumer I O models or shared file writes with a different node doing I O not aligned on 4096 byte boundaries Disabling writethrough cache is advisable in the case where files are mostly written to the file system but are not re read within a short time period or files are only written and re read by the same node regardless of whether the I O is aligned or not To disable writethrough cache on all OSTs of an OSS run
215. by running Ist add_group NID but the user cannot actively add a userspace test node to the test session However the console user can passively accept a test node to the test session while the test node runs Ist client to connect to the console Chapter 31 System Configuration Utilities man8 31 27 Utilities LNET self test includes two user utilities lst and Istclient Ist is the user interface for the self test console run on console node It provides a list of commands to control the entire test system such as create session create test groups etc Istclient is the userspace self test program which is linked with userspace LNDs and LNET A user can invoke Istclient to join a self test session 1lstclient sesid CONSOLE_NID group NAME Example This is an example of an LNET self test script which simulates the traffic pattern of a set of Lustre servers on a TCP network accessed by Lustre clients on an IB network connected via LNET routers with half the clients reading and half the clients writing bin bash export LST_SESSION lst new_session read write st add_group servers 192 168 10 8 10 12 16 tcp st add_group readers 192 168 1 1 253 2 o2ib st add_ group writers 192 168 1 2 254 2 o2ib st add_batch bulk_rw lst add_test batch bulk_rw from readers to servers check simple size 1M lst add_test batch bulk_rw from writers to servers heck full size 4K start running a lst r
216. call to use flock may be blocked if another process is holding an incompatible lock Locks created using flock are applicable for an open file table entry Therefore a single process may hold only one type of lock shared or exclusive on a single file Subsequent flock calls on a file that is already locked converts the existing lock to the new lock mode Example mount t lustre o flock mds tcp0 lustre mnt client You can check it in etc mtab It should look like mds tcp0 lustre mnt client lustre rw flock 00 31 22 Lustre 1 8 Operations Manual October 2009 31 5 8 _getgroups The _getgroups utility handles Lustre user group cache upcall Synopsis l_getgroups v d mdsname uid 1_getgroups v s Options Option Description d Debug prints values to stdout instead of Lustre S Sleep mlock memory in core and sleeps forever V Verbose Logs start stop to syslog mdsname MDS device name Description The group upcall file contains the path to an executable file that when properly installed is invoked to resolve a numeric UID to a group membership list This utility should complete the mds_grp_downcall_data structure and write it to the proc fs lustre mds mds service group_info pseudo file The l_getgroups utility is the reference implementation of the user or group cache upcall Files The l_getgroups files are located at proc fs lustre mds mds service group_upcall Chapter 31
217. ch to run the test sets For this setting specify mnt lustre TESTROOT Do NOT install pseudo languages 16 2 Lustre 1 8 Operations Manual October 2009 7 When the system displays this prompt Install scripts into TESTROOT BIN Do not immediately respond Using another terminal as stopping the script does not work replace the files home tet test_sets scen exec and home tet test_sets scen bld with myscen exec and myscen bld downloaded earlier cp myscen bld home tet test_sets scen bld cp myscen exec home tet test_sets scen exec This limits the tests run only to the relevant file systems and avoids additional hours of other tests on sockets math stdio libc shell and so on 8 Continue with the installation a Build the test sets It proceeds to build and install all of the file system tests b Run the test sets Even though it is running them on a local file system this is a valuable baseline to compare with the behavior of Lustre It should put the results into home tet test_sets results 0002e journal Rename or symlink this directory to home tet test_sets results ext3 journal or to the name of the local file system on which the test was run Running the full test takes about five minutes Do not re run any failed test Results are in a lengthy table at home tet test_sets results report 9 Save the test suite to run further tests on a Lustre file system Tar up the tests so that you do
218. chapter describes public programming interfaces to control various aspects of Lustre from userspace These interfaces are generally not guaranteed to remain unchanged over time although we will make an effort to notify the user community well in advance of major changes This chapter includes the following section m User Group Cache Upcall 28 1 28 1 1 User Group Cache Upcall This section describes user and group upcall Note For information on a universal UID GID see Environmental Requirements Name Use proc fs lustre mds mds service group_upcall to look up a given user s group membership 28 1 28 1 2 Description The group upcall file contains the path to an executable that when properly installed is invoked to resolve a numeric UID to a group membership list This utility should complete the mds_grp_downcall_data data structure see Data structures and write it to the proc fs lustre mds mds service group_info pseudo file For a sample upcall program see lustre utils 1_getgroups c in the Lustre source distribution Primary and Secondary Groups The mechanism for the primary secondary group is as follows 28 1 2 1 28 2 The MDS issues an upcall set per MDS to map the numeric UID to the supplementary group s If there is no upcall or if there is an upcall and it fails supplementary groups will be added as supplied by the client as they are now The default upcall is usr sbin 1_get
219. cked before the file is written Similarly inode usage for specific functions can be controlled if a user over uses the allocated space Lustre quota enforcement differs from standard Linux quota support in several ways m Quotas are administered via the 1fs command post mount a Quotas are distributed as Lustre is a distributed file system which has several ramifications m Quotas are allocated and consumed in a quantized fashion m Client does not set the usrquota or grpquota options to mount When quota is enabled it is enabled for all clients of the file system started automatically using quota_type or started manually with lfs quotaon 9 1 1 9 2 Caution Although quotas are available in Lustre root quotas are NOT enforced lfs setquota u root limits are not enforced lfs quota u root usage includes internal Lustre data that is dynamic in size and does not accurately reflect mount point visible block and inode usage Enabling Disk Quotas Use this procedure to enable configure disk quotas in Lustre 1 If you have re complied your Linux kernel be sure that CONFIG_QUOTA and CONFIG_QUOTACTL are enabled Also verify that CONFIG_QFMT_V1 and or CONFIG_QFMT_V2 are enabled Quota is enabled in all Linux 2 6 kernels supplied for Lustre 2 Start the server 3 Mount the Lustre file system on the client and verify that the lquota module has loaded properly by using the 1smod command lsmod roo
220. command which communicates with multiple services in a single system call you may have to wait for multiple timeouts Appendix A Lustre Knowledge Base A 3 A 4 How do I abort recovery Why would I want to If an MDS or OST is not gracefully shut down for example a crash or power outage occurs the next time the service starts it is in recovery mode This provides a window for any existing clients to re connect and re establish any state which may have been lost in the interruption By doing so the Lustre software can completely hide failure from user applications The recovery window ends when either m All clients which were present before the crash have reconnected or m A recovery timeout expires This timeout must be long enough to for all clients to detect that the node failed and reconnect If the window is too short some critical state may be lost and any in progress applications receive an error To avoid this the recovery window of Lustre 1 x is conservatively long If a client which was not present before the failure attempts to connect it receives an error and a message about recovery displays on the console of the client and the server New clients may only connect after the recovery window ends If the administrator knows that recovery will not succeed because the entire cluster was rebooted or because there was an unsupported failure of multiple nodes simultaneously then the administrator can abort recovery
221. components LNET and the MGS described in the following sections Chapter 1 Introduction to Lustre 1 5 1 6 FIGURE 1 1 Lustre components in a basic cluster Metadata Server MDS Metadata Target MDT ee Interconnect Ethernet IB etc os 1 Lustre Clients Lustre 1 8 Operations Manual October 2009 Object Storage 5 Object Storage Servers OSSs Targets OSTs 1524 1 2 2 Lustre Networking LNET Lustre Networking LNET is an API that handles metadata and file I O data for file system servers and clients LNET supports multiple heterogeneous interfaces on clients and servers LNET interoperates with a variety of network transports through Network Abstraction Layers NAL Lustre Network Drivers LNDs are available for a number of commodity and high end networks including Infiniband TCP IP Quadrics Elan Myrinet MX and GM and Cray In clusters with a Lustre file system servers and clients communicate with one another over a custom networking API known as Lustre Networking LNET while the disk storage behind the MDSs and OSSs is connected to these servers using traditional SAN technologies Key features of LNFT include m RDMA when supported by underlying networks such as Elan Myrinet and InfiniBand a Support for many commonly used network types such as InfiniBand and IP m High availability and recovery features enabling transparent recovery in conjunction with failover servers a Sim
222. connection defaults to plain and all other connections use null Specifying Flavors by Mount Options When mounting OST or MDT devices add the mount option shown below to specify the security flavor mount t lustre o sec plain dev sdal mnt mdt This means all connections to this device will use the plain flavor You can split this sec flavor as mount t lustre o sec_mdt flavorl1 sec_cli flavor1 dev sda mnt mdt This means connections from other MDTs to this device will use flavorl and connections from all clients to this device will use flavor2 Specifying Flavors by On Disk Parameters You can also specify the security flavors by specifying on disk parameters on OST and MDT devices tune2fs o security rpc mdt flavorl o security rpc cli flavor2 device On disk parameters are overridden by mount options Mounting Clients Root on client node mounts Lustre without any special tricks 11 14 Lustre 1 8 Operations Manual October 2009 11 2 2 6 Rules Syntax and Examples The general rules and syntax for using Kerberos are lt target gt srpc flavor lt network gt lt direction gt flavor lt target gt This could be file system name or specific MDT OST device name For example lustre lustre MDT0000 lustre OST0001 lt network gt LNET network name of the RPC initiator For example tcp0 elan1 o2ib0 lt direction gt This could be one of cli2mdt cli2ost mdt2mdt or mdt2ost I
223. connections and export information When the server restarts the clients create a new connection to it Unmounting a Server with Failover To stop a server MDS or OSS with failover run umount f lt MDS OSS mount point gt This stops the server and preserves client export information When the server restarts the clients reconnect and resume in progress transactions Changing the Address of a Failover Node To change the address of a failover node e g to use node X instead of node Y run this command on the OSS OST partition tunefs lustre erase params failnode lt NID gt lt device gt Chapter 4 Configuring Lustre 4 31 4 32 Lustre 1 8 Operations Manual October 2009 CHAPTER D service Tags This chapter describes the use of service tags with Lustre and includes the following sections Introduction to Service Tags Using Service Tags 5 1 Introduction to Service Tags Service tags are part of an IT asset inventory management system provided by Sun A service tag is a unique identifier for a piece of hardware or software gear that enables usage data about the tagged item to be shared over a local network in standard XML format The service tag program is used for a number of Sun products including hardware software and services and has now been implemented for Lustre Service tags are provided for each MGS MDS OSS node and Lustre client Using service tags enables automatic discovery and t
224. count of nodes in the from group The second number of distribute is a subset of server count of nodes in the to group only nodes in two correlative subsets will talk The following examples are illustrative Clients C1 C2 C3 C4 C5 C6 Server S1 S2 S3 distribute 1 1 C1 gt S1 C2 gt S2 C3 gt S3 C4 gt S1 C5 gt S2 C6 gt S3 gt means test conversation distribute 2 1 C1 C2 gt S1 C3 C4 gt S2 C5 C6 gt S3 distribute 3 1 C1 C2 C3 gt S1 C4 C5 C6 gt S2 NULL gt S3 distribute 3 2 C1 C2 C3 gt S1 S52 C4 C5 C6 gt S3 S1 distribute 4 1 C1 C2 C3 C4 gt S1 C5 C6 gt S2 NULL gt S3 distribute 4 2 C1 C2 C3 C4 gt S1 S2 C5 C6 gt S3 S1 distribute 6 3 C1 C2 C3 C4 C5 C6 gt S1 52 S3 Chapter 18 Lustre I O Kit 18 27 18 28 There are only two test types ping There are no private parameters for the ping test brw The brw test can have several options read write Read or write The default is read size K M I O size can be bytes KB or MB i e size 1024 size 4K size 1M The default is 4K bytes check full simple A data validation check checksum of data The default is no check As an example lst add group clients 192 168 1 10 17 tcp lst add_group servers 192 168 10 100 103 tcp lst add_batch bulkperf lst add_test batch bulkperf loop 100 concurrency 4 distribute 4 2 from clients
225. ction when a peer sends a NAK Example When it has timed out this node Dumps peer debug and the history on receiving a NAK Sets intervals to check some peers for timed out communications while the application blocks for communications to complete The communications timeout in seconds The time in seconds after which the ptllnd prints a warning if it blocks for a longer time during connection establishment cleanup after an error or cleanup during shutdown Lustre 1 8 Operations Manual October 2009 The following environment variables can be set to configure the PTLLND s behavior Variable Description PTLLND_ PORTAL The portal ID PID to use for the ptlind traffic 9 PTLLND_PID The virtual PID on which to contact servers 9 PTLLND_PEERCREDITS The maximum number of concurrent sends that are 8 outstanding to a single peer at any given instant PTLLND MAX _MESSAGE SIZE The maximum messages size This MUST be the 512 same on all nodes in a cluster PTLLND_MAX_MSGS_PER_BUFFER The number of messages in a receive buffer 64 Receive buffer will be allocated of size PTLLND_MAX_MSGS_PER_BUFFER times PTLLND_MAX_MESSAGE SIZE PTLLND_MSG_SPARE Additional receive buffers posted to portals 256 PTLLND PEER HASH SIZE Number of hash table slots for the peers 101 PTLLND_EQ SIZE Size of the Portals event queue that is maximum 1024 number of events in the queue Chapter 30 Configu
226. d is the network dropping packets m Consider what was happening on the cluster at the time Does this relate to a specific user workload or a system load condition Is the condition reproducible Does it happen at a specific time day week or month To recover from this problem you must restart Lustre services using these file systems There is no other way to know that the I O made it to disk and the state of the cache may be inconsistent with what is on disk Identifying a Missing OST If an OST is missing for any reason you may need to know what files are affected Although an OST is missing the files system should be operational From any mounted client node generate a list of files that reside on the affected OST It is advisable to mark the missing OST as unavailable so clients and the MDS do not time out trying to contact it 1 Generate a list of devices and determine the OST s device number Run Ictl dl The 1ct1 dl command output lists the device name and number along with the device UUID and the number of references on the device 2 Deactivate the OST on the OSS at the MDS Run lctl device lt OST device name or number gt deactivate The OST device number or device name is generated by the 1ct1 dl command The deactivate command prevents clients from creating new objects on the specified OST although you can still access the OST for reading 21 10 Lustre 1 8 Operations Manual October 2009
227. d 3 lov lustre mdtlov lustre mdtlov_UUID 4 mds lustre MDT0000 lustre MDT0000_UUID 7 osc lustre OST0000 osc lustre mdtlov_UUID 5 osc lustre OST0001 osc lustre mdtlov_UUID 5 UP lov lustre clilov ce63ca00 08ac6584 6c4a 3536 2c6d Ge GG ahaa uu UU UP mdc lustre MDT0000 mdc ce63ca00 08ac6584 6c4a 3536 2c6d UP osc lustre OST0000 osc ce63ca00 08ac6584 6c4a 3536 2c6d b36c f9chbdaa05 10 UP osc lustre OST0001 osc ce63ca00 08ac6584 6cl4a 3536 2c6d b36c f9chbdaa05 Lustre 1 8 Operations Manual October 2009 221 2 Or from the device label at any time e2label dev sda lustre MDT0000 Lustre Timeouts Lustre uses two types of timeouts a LND timeouts that ensure point to point communications complete in finite time in the presence of failures These timeouts are logged with the S_LND flag set They may not be printed as console messages so you should check the Lustre log for D_NETERROR messages or enable printing of D NETERROR messages to the console echo neterror gt proc sys Inet printk Congested routers can be a source of spurious LND timeouts To avoid this increase the number of LNET router buffers to reduce back pressure and or increase LND timeouts on all nodes on all connected networks You should also consider increasing the total number of LNET router nodes in the system so that the aggregate router bandwidth matches the aggregate server bandwidth Lustre timeouts that ensure Lustre RP
228. d clients receive I O errors until they reconnect Note If you are using loopback devices use the d flag This flag cleans up loop devices and can always be safely specified Chapter 4 Configuring Lustre 4 15 4 3 4 4 3 5 4 16 Working with Inactive OSTs To mount a client or an MDT with one or more inactive OSTs run commands similar to this client gt mount o exclude testfs OST0000 t lustre uml1 testfs mnt testfs client gt cat proc fs lustre lov testfs clilov target_obd To activate an inactive OST on a live client or MDT use the lctl activate command on the OSC device For example letl device 7 activate Note A colon separated list can also be specified For example exclude testfs OST0000 testfs OSTOO001 Finding Nodes in the Lustre File System There may be situations in which you need to find all nodes in your Lustre file system or get the names of all OSTs To get a list of all Lustre nodes run this command on the MGS cat proc fs lustre mgs MGS live Note This command must be run on the MGS In this example file system lustre has three nodes lustre MDT0000 lustre OST0000 and lustre OSTO001 cfs21 tmp cat proc fs lustre mgs MGS live fsname lustre flags 0x0 gen 26 lustre MDT0000 lustre OST0000 lustre OSTO001 To get the names of all OSTs run this command on the MDS cat proc fs lustre lov lt fsname gt mdtlov target_obd Note
229. d in addition to old parameters they do not replace them To erase all old tunefs lustre parameters and just use newly specified parameters run tunefs lustre erase params param lt new parameters gt The tunefs lustre command can be used to set any parameter settable in a proc fs lustre file and that has its own OBD device so it can be specified as lt obd fsname gt lt obdtype gt lt proc_file_name gt lt value gt For example tunefs lustre param mdt group_upcall NONE dev sdal Chapter 31 System Configuration Utilities man8 31 5 31 6 Options The tunefs lustre options are listed and explained below Option Description comment comment Sets a user comment about this disk ignored by Lustre dryrun Only prints what would be done does not affect the disk erase params Removes all previous parameter information failnode nid Sets the NID s of a failover partner This option can be repeated as needed fsname filesystem_name The Lustre file system of which this service will be a part The default file system name is lustre index index Forces a particular OST or MDT index mountfsoptions opts Sets permanent mount options equivalent to the setting in etc fstab mgs Adds a configuration management service to this target msgnode nid Sets the NID s of the MGS node required for all targets other than the MGS nomgs Removes a configuration management service
230. d recovery 19 6 VBR introduction 19 5 VBR tips 19 7 VBR working with 19 7 Lustre I O kit downloading 18 2 obdfilter_survey tool 18 5 ost_survey tool 18 11 PIOS I O modes 18 14 PIOS tool 18 12 prerequisites to using 18 2 running tests 18 2 sgpdd_survey tool 18 3 Lustre Network Driver LND 2 1 Lustre Networking LNET 1 16 Lustre SNMP module Index 4 Lustre 1 8 Operations Manual October 2009 building 14 2 installing 14 2 using 14 3 lustre_config sh utility 31 19 lustre_createcsv sh utility 31 19 lustre_req_history sh utility 31 20 lustre_up14 sh utility 31 19 LVM snapshots 15 5 M mani filefrag 27 20 lfs 27 2 lfsck 27 12 mount 27 22 man2 user group cache upcall 28 1 man3 llapi 29 1 man5 LNET options 30 3 module options 30 2 MX LND 30 20 OpenIB LND 30 14 Portals LND Catamount 30 18 Portals LND Linux 30 15 QSW LND 30 10 RapidArray LND 30 11 VIB LND 30 12 mang extents_stats utility extents_stats utility 31 20 Ictl 31 8 llog_reader utility 31 21 llstat sh 31 20 loadgen utility 31 21 lr_reader utility 31 21 lustre_config sh 31 19 lustre_createcsv sh utility 31 19 lustre_req_history sh 31 20 lustre_up14 sh utility 31 19 mkfs lustre 31 2 mount lustre 31 15 offset_stats utility 31 21 plot Ilstat sh 31 20 tunefs lustre 31 5 vfs_ops_stats utility vfs_ops_stats utility 31 20 mballoc history 22 24 mballoc3 tunables 22 26 MDS failover 8 6 failover con
231. d run e2fsck against that storage before restarting Lustre m As per the Lustre requirement the shared storage used for failover is completely cache coherent This ensures that if one server takes over for another it sees the most up to date and accurate copy of the data In case of the failover of the server if the shared storage does not provide cache coherency between all of its ports then Lustre can produce an error If you know the exact reason for the error then it is safe to proceed with no further action If you do not know the reason then this is a serious issue and you should explore it with your disk vendor If the error occurs during failover examine your disk cache settings If it occurs after a restart without failover try to determine how the disk can report that a write succeeded then lose the Data Device corruption or Disk Errors Lustre Error Slow Start_Page_Write The slow start_page_write message appears when the operation takes an extremely long time to allocate a batch of memory pages Use these pages to receive network traffic first and then write to disk Chapter 21 Lustre Monitoring and Troubleshooting 21 19 21 4 18 21 4 19 21 4 20 Drawbacks in Doing Multi client O_APPEND Writes It is possible to do multi client O_APPEND writes to a single file but there are few drawbacks that may make this a sub optimal solution These drawbacks are m Each client needs to take an EOF lock on all the OSTs
232. d with 4 disks dev dsk c0t0d1 c0t0d3 c1t0d1 and c1t0d3 For smaller arrays RAID 1 could be used Lustre 1 8 Operations Manual October 2009 c Create a RAID array for an MDT On the MDT run mdadm create lt array_device gt 1 lt raid_level gt n lt active_devices gt x lt spare_devices gt lt block_devices gt where lt array_device gt RAID array to create in the form of dev mdX lt raid_level gt Architecture of the RAID array RAID 1 or RAID 10 is recommended for MDTs lt active_devices gt Number of active disks in the RAID array including mirrors lt spare_devices gt Number of spare disks initially assigned to the RAID array More disks may be brought in via spare pooling see below lt block_devices gt List of the block devices used for the RAID array wildcards may be used For the worked example the command is mdadm create 1 10 n 4 x 0 dev md10 dev dsk c 01 t0d 13 This command output displays mdadm array dev md10 started If you creating many arrays across many servers we recommend scripting this process Note Do not use the assume clean option when creating arrays This could lead to data corruption on RAID 5 and will cause array checks to show errors with all RAID types Chapter10 RAID 10 11 3 Set up the mdadm tool The mdadm tool enables you to monitor disks for failures you will receive a notification It also enables you to manage spare
233. d95 28a6 ebb2 10e4 46a3ceef9007 Appendix A Lustre Knowledge Base A 19 A 20 1 Try lconf cleanup force 2 If that does not work start Ictl if it is not running already Then starting with the highest numbered device and working backward clean up each device root Ilctl letl gt cfg device ost003_s1_ client tl gt cleanup force tl gt detach tl gt cfg_ device OSS tl gt cleanup force detach cfg_device ost003_s1 cleanup force aaaaaaaa VV OV OV detach At this point it should also be possible to unload the Lustre modules How to build and configure Infiniband support for Lustre The distributed kernels do not yet include 3rd party Infiniband modules As a result our Lustre packages can not include IB network drivers for Lustre either however we do distribute the source code You will need to build your Infiniband software stack against the supplied kernel and then build new Lustre packages If this is outside your realm of expertise and you are a Lustre enterprise support customer we can help m Volatire To build Lustre with Voltaire Infiniband sources add with vib lt path to voltaire sources gt as an argument to the configure script To configure Lustre use nettype vib nid lt IPoIB address gt m OpenlB generation 1 Mellanox Gold To build Lustre with OpenIB Infiniband sources add with openib lt path_to_openib sources gt as an argument to the configure script T
234. d_ min max started Note Currently the maximum thread count values are only advisory because Lustre does not reduce the number of service threads in use even if that number exceeds the thread_max value Lustre does not stop service threads once they are started Chapter 22 LustreProc 22 29 22 3 Debug Support proc sys Inet debug By default Lustre generates a detailed log of all operations to aid in debugging The level of debugging can affect the performance or speed you achieve with Lustre Therefore it is useful to reduce this overhead by turning down the debug level to improve performance Raise the debug level when you need to collect the logs for debugging problems The debugging mask can be set with symbolic names instead of the numerical values that were used in prior releases The new symbolic format is shown in the examples below Note All of the commands below must be run as root note the nomenclature To verify the debug level used by examining the sysctl that controls debugging run sysctl Inet debug 1net debug ioctl neterror warning error emerg ha config console To disable debugging except for network error debugging run this command on all concerned nodes sysctl w lnet debug neterror 1net debug neterror To completely disable debugging run this command on all concerned nodes sysctl w lnet debug 0 inet debug 0 To set an appropriate debug level for a producti
235. de is not a router e up down indicates this node is a router e auto_fail must be enabled max Maximum number of concurrent sends from this peer rtr Routing buffer credits min Minimum routing buffer credits seen tx Send credits min Minimum send credits seen queue Total bytes in active queued sends Chapter 22 LustreProc 22 9 22 10 Credits work like a semaphore At start they are initialized to allow a certain number of operations 8 in this example LNET keeps a track of the minimum value so that you can see how congested a resource was If rtr tx is less than max there are operations in progress The number of operations is equal to rtr or tx subtracted from max If rtr tx is greater that max there are operations blocking LNET also limits concurrent sends and router buffers allocated to a single peer so that no peer can occupy all these resources proc sys Inet nis cat proc sys lnet nis nid refs peer max tx min 0 lo 3 0 0 0 0 192 168 10 34 tcp 4 8 256 256 252 Shows the current queue health on this node The fields are explained below Field Description nid Network interface refs Internal reference counter peer Number of peer to peer send credits on this NID Credits are used to size buffer pools max Total number of send credits on this NID tx Current number of send credits available on this NID min Lowest number of send credits available on this NID queue Total bytes in active queued s
236. der the batching Chapter 24 Striping and I O Options 24 3 Generally a good stripe size for sequential I O using high speed networks is between 1 MB and 4 MB Stripe sizes larger than 4 MB do not parallelize as effectively because Lustre tries to keep the amount of dirty cached data below 32 MB per server with the default configuration Writes which cross an object boundary are slightly less efficient than writes which go entirely to one server Depending on your application s write patterns you can assist it by choosing a stripe size with that in mind If the file is written in a very consistent and aligned way make the stripe size a multiple of the write size The choice of stripe size has no effect on a single stripe file 24 2 24 4 Displaying Files and Directories with lfs getstripe Use 1fs to print the index and UUID for each OST in the file system along with the OST index and object ID for each stripe in the file For directories the default settings for files created in that directory are printed 1fs getstripe lt filename gt Use 1fs find to inspect an entire tree of files lfs find recursive r lt file or directory gt If a process creates a file use the lfs getstripe command to determine which OST s the file resides on Using cat as an example run cat gt foo In another terminal run lfs getstripe barn users jacob tmp foo OBDS Lustre 1 8 Operations Manual October 2009
237. des Tell the script the names of the OSCs which should be up and running Alternately you can pass the parameter case netdisk to the script The script will use all of the local OSCs Note The obdfilter_survey script is NOT scalable to 100s of nodes since it is only intended to measure individual servers not the scalability of the entire system Chapter 18 Lustre I O Kit 18 5 18 2 2 1 Note The obdfilter_survey script must be customized depending on the components under test and where the script s working files should be kept Customization variables are clearly described in the script Customization Variables section In particular refer to the maximum supported value ranges for customization variables Running obdfilter_survey Against a Local Disk The obdfilter_survey script can be run automatically or manually against a local disk Obdfilter survey profiles the overall throughput of storage hardwarel by sending ranges of workloads to the OSTs that vary in thread counts and I O sizes When the obdfilter_survey script is complete it provides information on the performance abilities of the storage hardware and shows the saturation points If you use plot scripts on the data this information is shown graphically To run the obdfilter_survey script create a Lustre configuration using normal methods no special setup is needed To perform an automatic run 1 Set up the Lustre file system with the req
238. devices 2 spares 1 UUID 46ecd502 b39cd6d9 dd7e163b dd9b2620 spare group journals ARRAY dev md24 level raidl num devices 2 spares 1 UUID 5c099970 2a9919e6 28c9b741 3134be7e spare group journals ARRAY dev md25 level raidl num devices 2 spares 1 UUID b44a56c0 b1893164 4416e0b8 75beabc4 spare group journals ARRAY dev md26 level raidl num devices 2 spares 1 UUID 2adf9d0f 2b7372c5 4e5f483f 3d9a0a25 spare group journals Email address to notify of events e g disk failures MAILADDR admin example com 10 12 Lustre 1 8 Operations Manual October 2009 4 Set up periodic checks of the RAID array We recommend checking the software RAID arrays monthly for consistency This can be done using cron and should be scheduled for an idle period so performance is not affected To start a check write check into sys block ARRAY md sync_action For example to check dev md10 run this command on the Lustre server echo check gt sys block md10 md sync_action 5 Format the OSTs and MDT and continue with normal Lustre setup and configuration For configuration information see Configuring Lustre Note Per Bugzilla 18475 we recommend that stripe_cache_size be set to 16KB instead of 2KB These additional resources may be helpful when enabling software RAID on Lustre a md 4 mdadm 8 mdadm conf 5 manual pages m Linux software RAID wiki http linux raid osdl org m Kernel documentation Docum
239. disks When a disk fails you can use mdadm to make a spare disk active until such time as the failed disk is replaced Here is an example mdadm conf from an OSS with 7 OSTs including external journals Note how spare groups are configured so that OSTs without spares still benefit from the spare disks assigned to other OSTs ARRAY dev md10 level raid6 num devices 10 UUID e8926d28 0724ee29 65147008 b8df0bd1 spare group raids ARRAY dev md11 level raid6 num devices 10 spares 1 UUID 7b045948 ac4edfc4 9d7a279 17b468cd spare group raids ARRAY dev md12 level raid6 num devices 10 spares 1 UUID 29d8c0 0 d9408537 39c8053e bd476268 spare group raids ARRAY dev md13 level raid6 num devices 10 UUID 1753 a96 d83a518 d49fc558 9ae3488c spare group raids ARRAY dev md14 level raid6 num devices 10 spares 1 UUID 7 0ad256 0b3459a4 d7366660 cf6c7249 spare group raids ARRAY dev md15 level raid6 num devices 10 UUID 09830fd2 1cac8625 182d9290 2blccf2a spare group raids ARRAY dev md16 level raid6 num devices 10 UUID 32bf1b12 4787d254 29e76bd7 684d7217 spare group raids ARRAY dev md20 level raidl num devices 2 spares 1 UUID bc b5 40 7a2ebd50 b3111587 8b393b86 spare group journals ARRAY dev md21 level raidl num devices 2 spares 1 UUID 6c82d034 3 5465ad 11663a04 58fbc2d1 spare group journals ARRAY dev md22 level raidl num devices 2 spares 1 UUID 7c7274c5 86970569 03c22c87 e7a40e11 spare group journals ARRAY dev md23 level raidl num
240. e run echo 1 gt proc fs lustre llite lt fsname gt checksum_pages To disable both types of checksums in memory and wire run echo 0 gt proc fs lustre llite lt fsname gt checksum_pages To check the status of a wire checksum run 1lctl get_param osc checksums Lustre 1 8 Operations Manual October 2009 24 7 1 1 Changing Checksum Algorithms By default Lustre uses the adler32 checksum algorithm because it is robust and has a lower impact on performance than crc32 The Lustre administrator can change the checksum algorithm via proc depending on what is supported in the kernel To check which checksum algorithm is being used by Lustre run cat proc fs lustre osc lt fsname gt OST lt index gt osc checksum_type To change the wire checksum algorithm used by Lustre run echo lt algorithm name gt proc fs lustre osc lt fsname gt OST lt index gt osc checksum_type Note The in memory checksum always uses the adler32 algorithm if available and only falls back to crc32 if adler32 cannot be used In the following example the cat command is used to determine that Lustre is using the adler32 checksum algorithm Then the echo command is used to change the checksum algorithm to crc32 A second cat command confirms that the crc32 checksum algorithm is now in use cat proc fs lustre osc lustre OST0000 osc ffff81012b2c48e0 checksum type crc32 adler echo crc32 gt proc fs lustre osc l
241. e These features packages can be enabled at the build time by issuing appropriate arguments to the configure command For a list of supported features and packages run configure help in the Lustre source tree The configs directory of the kernel source contains the config files matching each the kernel version Copy one to config at the root of the kernel tree 3 Create the kernel package Navigate to the kernel source directory and run make rpm Example kernel 2 6 95 0 3 EL_lustre 1 6 5 1lcustom 1 i1686 rpm Note Step 3 is only valid for RedHat and SuSE kernels If you are using a stock Linux kernel you need to get a script to create the kernel RPM Chapter 3 Installing Lustre 3 17 4 Install the Lustre packages Lustre requires a set of packages be installed kernel module utilities and e2fsprogs in a specific order Depending on the platform different packages are required a For each Lustre package determine if it needs to be installed on servers and or clients TABLE 3 1 provides a complete list of the required Lustre packages and for each package where to install it Depending on the selected platform not all of the packages listed in TABLE 3 1 need to be installed b Install the kernel modules and Idiskfs packages Navigate to the directory where the RPMs are stored and use the rpm ivh command to install the kernel module and Idiskfs packages rpm ivh kernel lustre smp lt ver
242. e mounted client on that node to recover However since the client is now in crashed state the OST considers it to be a new client and blocks it from mounting until the recovery completes As a result running OST and client on same machine can cause a double failure and prevent a complete recovery Chapter 26 Lustre Operating Tips 26 5 26 5 Improving Lustre Metadata Performance While Using Large Directories To improve metadata performance while using large directories follow these tips m Increase RAM on the MDS On the MDS more memory translates into bigger caches thereby increasing the metadata performance m Patch the core kernel on the MDS with the 3G 1G patch if not running a 64 bit kernel which increases the available kernel address space This translates into support for bigger caches on the MDS 26 6 Lustre 1 8 Operations Manual October 2009 part V Reference This part includes reference information on Lustre user utilities configuration files and module parameters programming interfaces system configuration utilities and system limits CHAPTER 27 User Utilities man1 This chapter describes user utilities and includes the following sections m lfs m lfsck a Filefrag Handling Timeouts 27 1 27 Al lfs The 1fs utility can be used to create a new file with a specific striping pattern determine the striping pattern of existing files and gather the extended attributes object nu
243. e 1 8 adaptive timeouts are enabled by default In earlier Lustre versions supporting adaptive timeouts 1 6 5 through 1 6 7 x this feature was disabled by default In previous Lustre versions the static obd_timeout proc sys lustre timeout value was used as the maximum completion time for all RPCs this value also affected the client server ping interval and initial recovery timer Now with adaptive timeouts obd_timeout is only used for the ping interval and initial recovery estimate When a client reconnects during recovery the server uses the client s timeout value to reset the recovery wait period i e the server learns how long the client had been willing to wait and takes this into account when adjusting the recovery period Chapter 22 LustreProc 22 5 22 1 3 1 Configuring Adaptive Timeouts One of the goals of adaptive timeouts is to relieve users from having to tune the obd_timeout value In general obd_timeout should no longer need to be changed However there are several parameters related to adaptive timeouts that users can set In most situations the default values should be used The following parameters can be set as module parameters in modprobe conf or at runtime in proc sys lustre Parameter Description at_min at_max at_history Sets the minimum adaptive timeout in seconds Default value is 0 The at_min parameter is the minimum processing time that a server will report Clients bas
244. e 32 3 32 8 Maximum Number of Files or Subdirectories in a Single Directory 32 3 32 9 MDS Space Consumption 32 4 32 10 Maximum Length of a Filename and Pathname 32 4 32 11 Maximum Number of Open Files for Lustre File Systems 32 4 32 12 OSS RAM Size 32 5 A Lustre Knowledge Base A 1 Glossary Glossary 1 Index Index 1 xxii Lustre 1 8 Operations Manual October 2009 Preface The Lustre 1 8 Operations Manual provides detailed information and procedures to install configure and tune Lustre The manual covers topics such as failover quotas striping and bonding The Lustre manual also contains troubleshooting information and tips to improve Lustre operation and performance Using UNIX Commands This document might not contain information about basic UNIX commands and procedures such as shutting down the system booting the system and configuring devices Refer to the following for this information a Software documentation that you received with your system m Solaris Operating System documentation which is at http docs sun com xxi Shell Prompts Shell Prompt C shell C shell superuser Bourne shell and Korn shell Bourne shell and Korn shell superuser machine name machine name xxii Typographic Conventions Typeface Meaning Examples AaBbCc123 The names of commands files and directories on screen computer output AaBbCc123 What you type when contrasted
245. e Lustre group is similar to that of the upstream RedHat and SuSE packages Currently RHEL does not enable CONFIG_SCSI_MULTI_LUN because it can cause problems with SCSI hardware To enable this set the scsi_mod max_scsi_luns xx option typically xx is 128 in either modprobe conf 2 6 kernel or modules conf 2 4 kernel To pass this option as a kernel boot argument in grub conf or lilo conf compile the kernel with CONFIG_SCSI_MULT_LUN y 26 4 Failures Running a Client and OST on the Same Machine There are inherent problems if a client and OST share the same machine and the same memory pool An effort to relieve memory pressure by the client requires memory to be available to the OST If the client is experiencing memory pressure then the OST is as well The OST may not get the memory it needs to help the client get the memory it needs because it is all one memory pool this results in deadlock Running a client and an OST on the same machine can cause these failures m If the client contains a dirty file system in memory and memory pressure a kernel thread flushes dirty pages to the file system and it writes to a local OST To complete the write the OST needs to do an allocation Then the blocking of allocation occurs while waiting for the above kernel thread to complete the write process and free up some memory This is a deadlock condition m If the node with both a client and OST crashes then the OST waits for th
246. e conf 7 6 start clients 7 7 start servers 7 6 loadgen utility 31 21 locking proc entries 22 28 logs 21 5 Ir_reader utility 31 21 LUNs adding 26 5 Lustre administration aborting recovery 4 27 administration changing a server NID 4 27 administration failout failover mode for OSTs 4 18 administration file system name 4 14 administration finding nodes in the file system 4 16 administration mounting a server 4 14 administration mounting a server without Lustre service 4 17 Index 3 administration removing and restoring OSTs 4 25 administration running multiple Lustre file systems 4 19 administration setting Lustre parameters 4 21 administration working with inactive OSTs 4 16 adminstration running writeconf 4 24 adminstration unmounting a server 4 15 backups 15 1 components 1 5 configuration example 4 5 configuring 4 2 downgrading 1 8 x to 1 6 x 13 8 installing debugging tools 3 4 installing environmental requirements 3 5 installing HA software 3 4 installing memory requirements 3 6 installing prerequisites 3 2 installing required software 3 3 installing required tools utilities 3 3 interoperability 13 2 key features 1 3 operational scenarios 4 29 parameters getting 4 23 parameters listing 4 23 parameters setting 4 21 recovering 19 1 scaling 4 10 system capacity 1 14 upgrading 1 6 x to 1 8 x 13 3 upgrading 1 8 x to next minor version 13 8 VBR delaye
247. e from RPMs Caution Lustre contains kernel modifications which interact with storage devices and may introduce security issues and data loss if not installed configured and administered correctly Before installing Lustre back up ALL data Note When using third party network hardware with Lustre the third party modules typically the drivers must be linked against the Linux kernel The LNET modules in Lustre also need these references To meet these requirements a specific process must be followed to install and recompile Lustre See Installing Lustre with a Third Party Network Stack which provides an example to install Lustre 1 6 6 using the Myricom MX 1 2 7 driver The same process can be used for other third party network stacks Patching the Kernel If you are using non standard hardware plan to apply a Lustre patch or have another reason not to use packaged Lustre binaries you have to apply several Lustre patches to the core kernel and run the Lustre configure script against the kernel Chapter 3 Installing Lustre 3 13 3 3 1 1 3 3 1 2 3 14 Introducing the Quilt Utility To simplify the process of applying Lustre patches to the kernel we recommend that you use the Quilt utility Quilt manages a stack of patches on a single source tree A series file lists the patch files and the order in which they are applied Patches are applied incrementally on the base tree and all preceding patches Patches can
248. e gt For example tunefs lustre param mdt group_upcall NONE dev sdal Chapter 4 Configuring Lustre 4 21 4 3 9 3 4 22 Setting Parameters with Ictl In Lustre you can use the 1ct1 command to set parameters temporarily or permanently get current parameter settings and list available parameters Setting Parameters When the file system is running temporary parameters can be set using the 1ct1 set_param command These parameters map to items in proc fs sys lnet lustre The lctl set_param command uses this syntax letl set_param n lt obdtype gt lt obdname gt lt proc_file_name gt lt value gt For example lctl set_param 1dlm namespaces osc 1lru_size s NR_CPU 100 Many permanent parameters can be set with the 1ct1 conf_param command In general the lctl conf_param command can be used to specify any parameter settable in a proc fs lustre file with its own OBD device The lct1 conf_param command uses this syntax lt obd fsname gt lt obdtype gt lt proc_ file name gt lt value gt Here are some 1ctl conf_param examples mgs gt lctl conf _ param testfs MDTO000 sys timeout 40 1lctl conf _ param testfs MDTO0000 mdt group_upcall NON lctl conf _ param testfs llite max_read_ahead_mb 16 letl conf _ param testfs MDT0000 lov stripesize 2M letl conf _ param testfs OSTO0000 osc max_ dirty _mb 29 15 1ctl conf _ param testfs OSTO0000 ost client cache seconds 15 letl conf _ param testfs sys timeout 40
249. e important than the slight risk of data loss and downtime in case of a hardware software problem on the DDN Note There is no risk from an OSS MDS node crashing only if the DDN itself fails Lustre 1 8 Operations Manual October 2009 20 5 4 20 5 5 Setting maxcmds For S2A DDN 8500 array changing maxcmds to 4 from the default 2 improved write performance by as much as 30 in a particular case This only works with SATA based disks and when only one controller of the pair is actually accessing the shared LUNs However this setting comes with a warning DDN support does not recommend changing this setting from the default By increasing the value to 5 the same setup experienced some serious problems The CLI command for the DDN client is provided below default value is 2 diskmaxcmds 3 For S2A DDN 9500 9550 hardware you can safely change the default from 6 to 16 Although the maximum value is 32 values higher than 16 are not currently recommended by DDN support Further Tuning Tips Here are some tips we have drawn from testing at a large installation m Use the full device instead of a partition sda versus sdal When using the full device Lustre writes nicely aligned 1 MB chunks to disk Partitioning the disk can destroy this alignment and will noticeably impact performance m Separate the ext3 OST into two LUNs a small LUN for the ext3 journal and a big one for the data m Since Lustre 1 0 4 we s
250. e nodes also have IP interfaces these four could be used as routers Note that match all expressions For instance effectively mask all other lt net match gt entries specified after them They should be used with caution Lustre 1 8 Operations Manual October 2009 30 2 1 2 30 2 1 3 Here is a more complicated situation the route parameter is explained below We have Two TCP subnets m One Elan subnet m One machine set up as a router with both TCP and Elan interfaces a IP over Elan configured but only IP will be used to label the nodes options lnet ip2nets tcp198 129 135 192 128 88 98 elan 198 128 88 98 198 129 135 3 routes tcp 1022 elan Elan NID of router elan 198 128 88 98 tcp TCP NID of router networks tcp This is an alternative to ip2nets which can be used to specify the networks to be instantiated explicitly The syntax is a simple comma separated list of lt net spec gt s see above The default is only used if neither ip2nets nor networks is specified routes This is a string that lists networks and the NIDs of routers that forward to them It has the following syntax lt w gt is one or more whitespace characters lt routes gt lt route gt lt route gt lt route gt lt net gt lt w gt lt hopcount gt lt w gt lt nid gt lt w gt lt nid gt So a node on the network tcp1 that needs to go through a router to get to the Ela
251. e of processing output files to a csv format and plotting a graph using gnuplot 18 10 Lustre 1 8 Operations Manual October 2009 18 2 3 ost_survey The ost_survey tool is a shell script that uses 1fs setstripe to perform I O against a single OST The script writes a file currently using dd to each OST in the Lustre file system and compares read and write speeds The ost_survey tool is used to detect misbehaving disk subsystems Note We have frequently discovered wide performance variations across all LUNs in a cluster To run the ost_survey script supply a file size in KB and the Lustre mount point For example run ost survey sh 10 mnt lustre Average read Speed Average write Speed read Worst OST indx 0 write Worst OST indx 0 read Best OST write Best OST indx 1 indx 1 3 OST devices found Ost Ost Ost Ost Ost Ost index index index index index index 0 0 1 1 2 2 Read Read Read Read Read Read speed time speed time speed time ON J OW 6 41 84 MB s 77 MB s 38 MB s 31 MB s ANW 84 17 38 14 98 14 73 Write Write Write Write Write Write speed time speed time speed time ON On WW SST 27 3L 16 16 16 Chapter 18 Lustre I O Kit 18 11 18 3 18 12 PIOS Test Tool The PIOS test tool is a parallel I O simulator for Linux and Solaris PIOS generates 1 0 on file systems
252. e output with per OBD statistics quota t ul g lt filesystem gt Displays block and inode grace times for user u or group g quotas quotacheck ugf lt filesystem gt Scans the specified file system for disk usage and creates or updates quota files Options specify quota for users u groups g and force f quotachown i lt filesystem gt Changes the file s owner and group on OSTs of the specified file system quotaon ugf lt filesystem gt Turns on file system quotas Options specify quota for users u groups g and force f Lustre 1 8 Operations Manual October 2009 Option Description quotaoff ugf lt filesystem gt Turns off file system quotas Options specify quota for users u groups g and force f quotainv ug f lt filesystem gt Clears quota files administrative quota files if used without f operational quota files otherwise all of their quota entries for users u or groups g After running quotainv you must run quotacheck before using quotas CAUTION Use extreme caution when using this command its results cannot be undone Do not use this command unless you really know what it does It is used mainly for internal purposes setquota u g lt name gt block softlimit lt block softlimit gt block hardlimit lt block hardlimit gt inode softlimit lt inode softlimit gt inode hardlimit lt inode hardlimit g
253. e prints to the log stating how_many expected clients have reconnected If the recovery is aborted this log shows how many clients managed to reconnect When all clients have completed recovery or if the recovery timeout is reached the recovery period ends and the OST resumes normal request processing If some clients fail to replay their requests during the recovery period this will not stop the recovery from completing You may have a situation where the OST recovers but some clients are not able to participate in recovery e g network problems or client failure so they are evicted and their requests are not replayed This would result in any operations on the evicted clients failing including in progress writes which would cause cached writes to be lost This is a normal outcome the recovery cannot wait indefinitely or the file system would be hung any time a client failed The lost transactions are an unfortunate result of the recovery process 4 The timeout length is determined by the obd_timeout parameter 5 Until a client receives a confirmation that a given transaction has been written to stable storage the client holds on to the transaction in case it needs to be replayed Chapter 21 Lustre Monitoring and Troubleshooting 21 7 21 4 2 21 8 Note Lustre 1 8 introduces the version based recovery VBR feature which enables a failed client to be skipped so remaining clients can replay their requests resul
254. e rxb_npages module parameter default value is 1 The default conservatively avoids allocation problems due to kernel memory fragmentation However increasing this value to 2 is probably not risky The ptllnd also keeps an additional rxb_nspare buffers default value is 8 posted to account for full buffers being handled Assuming a 4K page size with 10000 peers 1258 buffers can be expected to be posted at startup increasing to a maximum of 10008 as peers that are actually connected By doubling rxb_npages halving max_msg_size this number can be reduced by a factor of 4 Chapter 30 Configuration Files and Module Parameters man5 30 15 30 16 ME MD Queue Length The ptlind uses a single portal set by the portal module parameter default value of 9 for both message and bulk buffers Message buffers are always attached with PTL_INS_AFTER and match anything sent with message matchbits Bulk buffers are always attached with PTL_INS_BEFORE and match only specific matchbits for that particular bulk transfer This scheme assumes that the majority of ME MDs posted are for message buffers and that the overhead of searching through the preceding bulk buffers is acceptable Since the number of bulk buffers posted at any time is also dependent on the bulk transfer breakpoint set by max_msg_size this seems like an issue worth measuring at scale TX Descriptors The ptllnd has a pool of so called tx descriptors which it uses not
255. e server m Updates the versions of all inodes involved in a given operation m Returns the old and new inode versions to the client with the reply When the recovery mechanism is underway VBR follows these steps 1 VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions even if there is gap in transactions due to a missed client 2 The server attempts to execute every transaction that the client offers even if it encounters a re integration failure 3 When the replay is complete the client and server check if a replay failed on any transaction because of inode version mismatch If the versions match the client gets a successful re integration message If the versions do not match then the client is evicted VBR recovery is fully transparent to users It may lead to slightly longer recovery times if the cluster loses several clients during server recovery Delayed Recovery With VBR it is possible to recover clients even after the server s recovery window closes This is known as delayed recovery This feature is useful if clients have become temporarily unavailable during recovery e g because of a network partition Note In Lustre 1 8 the delayed recovery feature is available as a preview and is turned off by default It is designed for use with future versions of Lustre to help with disconnected operations Lustre 1 8 Op
256. e their timeouts on this value but they do not use this value directly If you experience cases in which for unknown reasons the adaptive timeout value is too short and clients time out their RPCs then you can increase the at_min value to compensate for this Ideally users should leave at_min set to its default Sets the maximum adaptive timeout in seconds In Lustre 1 6 5 the default value is 0 This setting causes adaptive timeouts to be disabled and the old fixed timeout method obd_timeout to be used The at_max parameter is an upper limit on the service time estimate and is used as a failsafe in case of rogue bad buggy code that would lead to never ending estimate increases If at_max is reached an RPC request is considered broken and it should time out NOTE It is possible that slow hardware might validly cause the service estimate to increase beyond the default value of at_max In this case you should increase at_max to the maximum time you are willing to wait for an RPC completion Sets a time period in seconds within which adaptive timeouts remember the slowest event that occurred Default value is 600 22 6 Lustre 1 8 Operations Manual October 2009 Parameter Description at_early_ margin at_extra Idim_enqueue_min Sets how far before the deadline Lustre sends an early reply Default value is 5t Sets the incremental amount of time that a server asks for with each early reply The
257. e this issue increase the RX ring buffer size default is 256 Use either m sbin ethtool G ethX rx 4096 m e1000 module option RxDescriptors 4096 Chapter 20 Lustre Tuning 20 7 20 5 20 5 1 20 8 DDN Tuning This section provides guidelines to configure DDN storage arrays for use with Lustre For more complete information on DDN tuning refer to the performance management section of the DDN manual of your product available at http www ddnsupport com manuals html This section covers the following DDN arrays m S2A 8500 m S2A 9500 m S2A 9550 Setting Readahead and MF For the S2A DDN 8500 storage array we recommend that you disable the readahead In a 1000 client system if each client has up to 8 read RPCs in flight then this is 8 1000 1 MB 8 GB of reads in flight With a DDN cache in the range of 2 to 5 GB depending on the model it is unlikely that the LUN based readahead would have ANY cache hits even if the file data were contiguous on disk generally file data is not contiguous The Multiplication Factor MF also influences the readahead you should disable it CLI commands for the DDN are cache prefetch 0 cache MF off For the S2A 9500 and S2A 9550 DDN storage arrays we recommend that you use the above commands to disable readahead Lustre 1 8 Operations Manual October 2009 20 5 2 setting Segment Size The cache segment size noticeably affects I O performance Set the cache segment
258. eT Full 100baseT Half 100baseT Full Advertised auto negotiation Yes Speed 100Mb s Duplex Full Port MII PHYAD 32 Transceiver internal Auto negotiation on Supports Wake on pumbg Wake on d Current message level 0x00000007 7 Link detected yes To quickly check whether your kernel supports bonding run grep ifenslave sbin ifup which ifenslave sbin ifenslave Note Bonding and ethtool have been available since 2000 All Lustre supported kernels include this functionality Chapter12 Bonding 12 3 12 3 Using Lustre with Multiple NICs versus Bonding NICs Lustre can use multiple NICs without bonding There is a difference in performance when Lustre uses multiple NICs versus when it uses bonding NICs Whether an aggregated link actually yields a performance improvement proportional to the number of links provided depends on network traffic patterns and the algorithm used by the devices to distribute frames among aggregated links Performance with bonding depends on a Out of order packet delivery This can trigger TCP congestion control To avoid this some bonding drivers restrict a single TCP conversation to a single adapter within the bonded group Load balancing between devices in the bonded group Consider a scenario with a two CPU node with two NICs If the NICs are bonded Lustre establishes a single bundle of sockets to each peer Since ksockind bind sockets to CPUs only one
259. ecovery cannot complete successfully Lustre 1 8 Operations Manual October 2009 Information on the Socket LND socklnd protocol Lustre layers the socket LND sockind protocol above TCP IP The first message sent on the TCP IP bytestream is HELLO which is used to negotiate connection attributes The protocol version is determined by looking at the first 4 4 bytes of the hello message which contain a magic number and the protocol version In KSOCK_PROTO_V1 the hello message is an Inet_hdr_t of type LNET_MSG_HELLO with the dest_nid Destination Server Machine replaced by net_magicversion_t This is followed by payload_length bytes of IP addresses each 4 bytes which list the interfaces that the sending socklnd owns The whole message is sent in little endian LE byte order There is no socklnd level V1 protocol after the initial HELLO meaning everything that follows is unencapsulated LNET messages In KSOCK_PROTO_V2 the hello message is a ksock_hello_msg_t The whole message is sent in byte order of sender and the bytesex of kshm_magic is used on arrival to determine if the receiver needs to flip From then on every message is a ksock_msg_t also sent in the byte order by sender This either encapsulates an LNET message ksm_type KSOCK_MSG_LNET or is a NOOP Every message includes zero copy request and ACK cookies in every message so that a zero copy sender can determine when the source buffer can be released without resorting to
260. ed 5 0 bad blocks 1 large file 329 regular files 39 directories 0 character device files 0 block device files 0 fifos 0 links 0 symbolic links 0 fast symbolic links 0 sockets 368 files Make the mdsdb and all of the ostdb files available on a mounted client so lfsck can be run to examine the file system and optionally correct defects that it finds lfsck n v mdsdb tmp mdsdb ostdb tmp ostidb ost2db lustre mount point Chapter 27 User Utilities man1 27 17 Example lfsck n v mdsdb home mdsdb ostdb home ostdb mnt lustre client MDSDB home mdsdb OSTDB 0 home ostdb MOUNTPOINT mnt lustre client MDS max_id 288 OST max_id 321 lfsck ost_idx 0 passi check for duplicate objects lfsck ost_idx 0 passl OK 287 files total lfsck ost_idx 0 pass2 check for missing inode objects lfsck ost_idx 0 pass2 OK 287 objects lfsck ost_idx 0 pass3 check for orphan objects 0 uuid lustre OST0000_UUID 0 last_id 288 0 zero length orphan objid 1 lfsck ost_idx 0 pass3 OK 321 files total lfsck pass4 check for duplicate object references lfsck pass4 OK no duplicates lfsck fixed 0 errors O By default 1fsck does not repair any inconsistencies it finds it only reports errors It checks for three kinds of inconsistencies m Inode exists but has missing objects dangling inode Normally this happens if there was a problem with an OST a Inode is mis
261. ed to be available Availability is accomplished by providing replicated hardware and or software so failure of the system will be covered by a paired system The concept of failover is the method of switching an application and its resources to a standby server when the primary system fails or is unavailable Failover should be automatic and in most cases completely application transparent 8 1 In Lustre failover means that a client that tries to do I O to a failed OST continues to try forever until it gets an answer A userspace sees nothing strange other than that I O takes potentially a very long time to complete Lustre failover requires two nodes a failover pair which must be connected to a shared storage device Lustre supports failover for both metadata and object storage servers Failover is achieved most simply by powering off the node in failure to be absolutely sure of no multi mounts of the MDT and mounting the MDT on the partner When the primary comes back it MUST NOT mount the MDT while secondary has it mounted The secondary can then unmount the MDT and the master mount it The Lustre file system only supports failover at the server level Lustre does not provide the tool set for system level components that is needed for a complete failover solution node failure detection power control and so on Lustre failover is dependant on either a primary or backup OST to recover the file system You need to se
262. ed with k m or g in KB MB or GB respectively Chapter 27 User Utilities man1 27 5 27 6 Option Description count stripe cnt Number of OSTs over which to stripe a file A stripe count of 0 uses the file system wide default stripe count 1 A stripe count of 1 stripes over all available OSTs and normally results in a file with 80 stripes offset start ost The OST index base 10 starting at 0 on which to start striping for this file A start ost value of 1 allows the MDS the choose the starting index We strongly recommend setting 1 as this value as this allows space and load balancing to be done by the MDS as needed pool lt pool gt Name of the pre defined pool of OSTs see Ictl that will be used for striping The stripe cnt stripe size and start ost values are used as well The start ost value must be part of the pool or an error is returned setstripe d lt dirname gt Deletes default striping on the specified directory poollist lt filesystem gt lt pool gt lt pathname gt Lists pools in the file system or pathname or OSTs in the file system s pool quota v o obd_uuid u l g lt username groupname gt lt filesystem gt Displays disk usage and limits either for the full file system or for objects on a specific OBD A user or group name can be specified If both user and group are omitted quotas for the current UID GID are shown The v option provides more verbos
263. em However due to the large size of most Lustre file systems it is not always possible to get a complete backup We recommend that you back up subsets of a file system This includes subdirectories of the entire file system filesets for a single user files incremented by date and so on 15 1 15 12 15 1 3 15 2 Device level Backups Perform a full device level backup of the MDS or OSTs before replacing hardware performing maintenance etc A device level backup of the MDS is especially important because if it fails permanently the entire file system would need to be restored In case of hardware replacement if the spare storage device is available then it is possible to take a raw copy of the MDS or OST from one block device to the other as long as the new device is at least as large as the original device To do this run dd if dev original of dev new bs 1M If there are problems while reading the data on the original device due to hardware errors then run the following command to read the data and skip sections with errors dd if dev original of dev new bs 4k conv sync noerror In spite of hardware errors the ext3 file system is very robust and it may be possible to recover the file system data after running e2fsck on the new device Backing Up the MDS This procedure provides another way to back up the MDS To summarize the steps involved in the backup run mkdir p mnt mds mount t ldiskfs
264. ements of different Lustre environments Complete file system File system is shut down and all servers and clients are downgraded at once See Performing a Complete File System Downgrade a Individual servers clients Individual servers and clients are downgraded one at a time and restarted a rolling downgrade so the file system never goes down See Performing a Rolling Downgrade 13 8 Lustre 1 8 Operations Manual October 2009 13 5 1 Performing a Complete File System Downgrade This procedure describes a complete file system downgrade in which 1 6 x Lustre packages are installed on multiple 1 8 x servers and clients requiring a file system shut down If you want to upgrade one Lustre component at a time and avoid the shutdown see Performing a Rolling Downgrade Tip In a Lustre downgrade the package install and file system unmount steps are reversible you can do either step first To minimize downtime this procedure first performs the 1 6 x package installation and then unmounts the file system 1 Make a complete restorable file system backup before downgrading Lustre 2 Downgrade the utilities userspace packages using the oldpackage option For example rpm Uvh oldpackage lustre lt ver gt 3 Make sure that 1 6 x packages are installed on the Lustre servers and or clients The 1 6 x packages should be on all of the nodes because of the earlier upgrade to 1 8 x unless they were removed aft
265. en complicated ones to be implemented with a single RPC from the client to the server Glossary 3 IOV K Kerberos LBUG LDLM lfs lfsck liblustre Llite Llog Llog Catalog LMV Glossary 4 1 0 vector A buffer destined for transport across the network which contains a collection a k a as a vector of blocks with data An authentication mechanism optionally available in an upcoming Lustre version as a GSS backend A bug that Lustre writes into a log indicating a serious system failure Lustre Distributed Lock Manager The Lustre File System configuration tool for end users to set check file striping etc See lfs Lustre File System Check A distributed version of a disk file system checker Normally lfsck does not need to be run except when file systems are damaged through multiple disk failures and other means that cannot be recovered using file system journal recovery Lustre library A user mode Lustre client linked into a user program for Lustre fs access liblustre clients cache no data do not need to give back locks on time and can recover safely from an eviction They should not participate in recovery Lustre lite This term is in use inside the code and module names to indicate that code elements are related to the Lustre file system Lustre log A log of entries used internally by Lustre An llog is suitable for rapid transactional appends of records and cheap cancellation of records
266. en met For more information on these prerequisites see Preparing to Install Lustre 2 Get the Lustre source code Navigate to the Lustre download site select the Lustre version you want and Source as the platform The files required to install Lustre from source code unpatched kernels Lustre source and e2fsprogs are listed 3 Download the Lustre source code lustre lt ver gt tar gz Lustre 1 8 Operations Manual October 2009 3 3 1 3 4 Download the unpatched kernel you want to use If you do not know the kernel s filename check the which_patch file a In the Lustre source file navigate to the which_patch file lustre kernel_patches which_patch and get the filename of the kernel you want to use The which_patch file lists the kernels supported in this release b Download the selected kernel from the same location where you downloaded the Lustre source in Step 2 5 To save time later download the e2fsprogs tarball e2fsprogs lt ver gt tar gz Patch the Kernel This procedure describes how to use Quilt to apply the Lustre patches to the kernel To illustrate the steps in this procedure a RHEL 5 kernel is patched for Lustre 1 6 5 1 1 Unpack the Lustre source and kernel to separate source trees Lustre source and the unpatched kernel were previously downloaded in Get the Lustre Source and Unpatched Kernel a Unpack the Lustre source For this procedure we assume that the resulting source tree is in tmp lust
267. enabled dumpe2fs h lt device gt grep features Example dumpe2fs h dev sdc grep features Filesystem features has _ journal ext_attr resize_inode dir_index filetype extent mmp sparse_super large_file uninit_bg To manually disable MMP tune2fs O mmp lt device gt To manually enable MMP tune2fs O mmp lt device gt If Idiskfs detects that a file system is being mounted multiple times it reports the time when the MMP block was last updated the node name and the device name Chapter 8 Failover 8 17 8 7 8 7 1 8 18 Setting Up Failover with Heartbeat V2 This section describes how to set up failover with Heartbeat V2 Installing the Software 1 Install Lustre see Installing Lustre from RPMs 2 Install RPMs required for configuring Heartbeat The following packages are needed for Heartbeat v2 We used the 2 0 4 version of Heartbeat Heartbeat packages in order heartbeat stonith gt heartbeat stonith 2 0 4 1 i586 rpm heartbeat pils gt heartbeat pils 2 0 4 1 i586 rpm heartbeat itself gt heartbeat 2 0 4 1 i586 rpm You can find all the RPMs at the following location http linux ha org download index html 2 0 4 3 Satisfy the installation prerequisites To install Heartbeat 2 0 4 1 you require Python openssl libnet gt 1ibnet 1 1 2 1 19 i586 rpm libpopt gt popt 1 7 274 i586 rpm librpm gt rpm 4 1 1 222 1586 rpm libtid gt libtool 1td1 1 5 16 multilib2 3 1i386 r
268. ends Subtracting max tx yields the number of sends currently active A large or increasing number of active sends may indicate a problem cat proc sys lnet nis nid refs peer max tx min 0 lo 2 0 0 0 0 10 67 73 173 tcp 4 8 256 256 253 Lustre 1 8 Operations Manual October 2009 22 1 5 22 1 5 1 Free Space Distribution Free space stripe weighting as set gives a priority of 0 to free space versus trying to place the stripes widely nicely distributed across OSSs and OSTs to maximize network balancing To adjust this priority as a percentage use the qos_prio_free proc tunable cat proc fs lustre lov lt fsname gt mdtlov qos_prio_free Currently the default is 90 You can permanently set this value by running this command on the MGS lctl conf_param lt fsname gt MDT0000 lov qos_prio_free 90 Setting the priority to 100 means that OSS distribution does not count in the weighting but the stripe assignment is still done via weighting If OST 2 has twice as much free space as OST 1 it is twice as likely to be used but it is NOT guaranteed to be used Also note that free space stripe weighting does not activate until two OSTs are imbalanced by more than 20 Until then a faster round robin stripe allocator is used The new round robin order also maximizes network balancing Managing Stripe Allocation The MDS uses two methods to manage stripe allocation and determine which OSTs to use for file o
269. ent to run the userland self test client Istclient should be executed after creating a session on the console There are only two mandatory options for Istclient sesid NID The first console s NID group NAME The test group to join Console lst new_session testsession Client1i lstclient sesid 192 168 1 52 tcp group clients Also Istclient has a mandatory option that enforces LNET to behave as a server start acceptor if the underlying NID needs it use privileged ports etc server_mode For example Client1 lstclient sesid 192 168 1 52 tcp group clients server_mode Note Only the super user is allowed to use the server_mode option Lustre 1 8 Operations Manual October 2009 18 4 2 3 Batch and Test This section lists Ist batch and test commands add_batch NAME The default batch named batch is created when the session is started However the user can specify a batch name by using add_batch lst add_ batch bulkperf add_test batch BATCH loop concurrency distribute from GROUP to GROUP TEST Adds a test to batch For now TEST can be brw and ping loop Loop count of the test concurrency Concurrency of the test from GROUP The source group test client to GROUP The target group test server distribute The distribution of nodes in clients and servers The first number of distribute is a subset of client
270. entation md txt Chapter 10 RAID 10 13 10 14 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 1 Kerberos 11 1 This chapter describes how to use Kerberos with Lustre and includes the following sections m What is Kerberos m Lustre Setup with Kerberos What is Kerberos Kerberos is a mechanism for authenticating all entities such as users and services on an unsafe network Users and services known as principals share a secret password or key with the Kerberos server This key enables principals to verify that messages from the Kerberos server are authentic By trusting the Kerberos server users and services can authenticate one another Caution Kerberos is a future Lustre feature that is not available in current versions If you want to test Kerberos with a pre release version of Lustre check out the Lustre source from the CVS repository and build it For more information on checking out Lustre source code see CVS 11 2 11 2 1 11 2 1 1 Lustre Setup with Kerberos Setting up Lustre with Kerberos can provide advanced security protections for the Lustre network Broadly Kerberos offers three types of benefit a Allows Lustre connection peers MDS OSS and clients to authenticate one another m Protects the integrity of the PTLRPC message from being modified during network transfer m Protects the privacy of the PTLRPC message from being eavesdropped during network transfe
271. ented as integers Examples of portals are the portals on which certain groups of object metadata configuration and locking requests and replies are received An RPC protocol layered on LNET This protocol deals with stateful servers and has exactly once semantics and built in support for recovery The concept of re executing a server request after the server lost information in its memory caches and shut down The replay requests are retained by clients until the server s have confirmed that the data is persistent on disk Only requests for which a client has received a reply are replayed A request that has seen no reply can be re sent after a server reboot An RPC made by an OST or MDT to another system usually a client to revoke a granted lock The concept that server state is in a crash lost because it was cached in memory and not yet persistent on disk A mechanism whereby the identity of a root user on a client system is mapped to a different identity on the server to avoid root users on clients gaining broad permissions on servers Typically for management purposes at least one client system should not be subject to root squash LNET routing between different networks and LNDs Remote Procedure Call A network encoding of a request Glossary 8 Lustre 1 8 Operations Manual October 2009 S Storage Object API Storage Objects Stride Stride size Stripe count Striping metadata T T10 object protocol
272. er into a single usable file h Prints a brief help message mdsdb ms_database_file MDS database file created by running e2fsck mdsdb mds_database_file device on the MDS backing device ostdb ost1_database_file ost2_database_file OST database files created by running e2fsck ostdb ost_database_file device on each OST backing device Chapter 27 User Utilities man1 27 13 27 14 Description If an MDS or an OST becomes corrupt you can run a distributed check on the file system to determine what sort of problems exist 1 Run e2fsck f on the individual MDS OST that had problems to fix any local file system damage It is a very good idea to run this e2fsck under script so you have a log of whatever changes it made to the file system in case this is needed later After this is complete you can bring the file system up if necessary to reduce the outage window 2 Run a full e2fsck of the MDS to create a database for lfsck The n option is critical for a mounted file system otherwise you might corrupt your file system The mdsdb file can grow fairly large depending on the number of files in the file system 10 GB or more for millions of files though the actual file size is larger because the file is sparse It is fastest if this is written to a local file system because of the seeking and small writes Depending on the number of files this step can take several hours to complete In the following exa
273. er the upgrade If it is necessary to install 1 6 x kernel modules or Idiskfs packages use the rpm ivh command For example rpm ivh kernel lustre smp lt ver gt kernel ib lt ver gt lustre modules lt ver gt lustre ldiskfs lt ver gt For help determining where to install a specific package see TABLE 3 1 Lustre packages descriptions and installation guidance Note You do not need to downgrade or take any action with e2fsprogs Chapter 13 Upgrading and Downgrading Lustre 13 9 13 10 4 Shut down the file system Shut down the components in this order clients then the MDT then OSTs Unmounting a block device causes Lustre to be shut down on that node a Unmount the clients On each client node run umount lt mount point gt b Unmount the MDT On the MDS node run umount lt mount point gt c Unmount the OSTs be sure to unmount all OSTs On each OSS node run umount lt mount point gt Unload the old Lustre modules by either a Rebooting the node OR m Removing the Lustre modules manually Run lustre_rmmod several times and use 1smod to check the currently loaded modules Start the downgraded file system Start the components in this order OSTs then the MDT then clients a Mount the OSTs be sure to mount all OSTs On each OSS node run mount t lustre lt block device name gt lt mount point gt b Mount the MDT On the MDS node run mount t lustre lt
274. erations Manual October 2009 19 3 2 19 3 3 Working with VBR In Lustre 1 8 the VBR feature is built into the Lustre recovery functionality It cannot be disabled Delayed recovery can be enabled with the enable delayed recovery option configure enable delayed recovery During reboot a list of new messages is displayed CWARN RECOVERY service s d recoverable clients last_transno LPU64 n was updated with number delayed clients CWARN RECOVERY service s d recoverable clients d delayed clients last_transno LPU64 n Note There should be no delayed clients until delayed recovery is enabled These are some VBR messages that may be displayed DEBUG_REQ D_WARNING req Version mismatch during replay n This message indicates why the client was evicted No action is needed CWARN s version recovery fails reconnecting n This message indicates why the recovery failed No action is needed These are some VBR messages that may be displayed if delayed recovery is enabled CWARN RECOVERY service s d recoverable clients d delayed clients last_transno LPU64 n This controls the number of delayed clients There should be 0 delayed clients without delayed recovery enabled CWARN s NID s s export was already marked as delayed and will wait for end of recovery n The old client is trying to reconnect but it will wait for end of the server
275. erformance of the device and bypasses the kernel block device layers buffer cache and file system The subsequent tests survey progressively higher layers of the Lustre stack Typically with these tests Lustre should deliver 85 90 of the raw device performance It is very important to establish performance from the bottom up perspective First the performance of a single raw device should be verified Once this is complete verify that performance is stable within a larger number of devices Frequently while troubleshooting such performance issues we find that array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation After the raw performance has been established other software layers can be added and tested in an incremental manner 18 1 18 1 1 18 1 2 Downloading an I O Kit You can download the I O kit from http downloads clusterfs com public tools lustre iokit In this directory you will find two packages m lustre iokit consists of a set of developed and supported by the Lustre group m scali lustre iokit is a Python tool maintained by Scali team and is not discussed in this manual Prerequisites to Using an I O Kit The following prerequisites must be met to use the Lustre I O kit m password free remote access to nodes in the system normally obtained via ssh or rsh Lustre file system software a sg3_utils for the sgp_dd utility 18
276. erneath that directory Extended Attribute small amount of data which can be retrieved through a name associated with a particular inode Lustre uses EAa to store striping information location of file data on OSTs Examples of extended attributes are ACLs striping information and crypto keys The process of eliminating server state for a client that is not returning to the cluster after a timeout or if server failures have occurred The state held by a server for a client that is sufficient to transparently recover all in flight operations when a single failure occurs A lock used by the OSC to protect an extent in a storage object for concurrent control of read write file size acquisition and truncation operations The failover process in which the default active server regains control over the service An OST which is not expected to recover if it fails to answer client requests A failout OST can be administratively failed thereby enabling clients to return errors when accessing data on the failed OST without making additional network requests Glossary 2 Lustre 1 8 Operations Manual October 2009 Failover FID Fileset FLDB Flight Group G Glimpse callback Group Lock Group upcall I Import Intent Lock The process by which a standby computer server system takes over for an active computer server after a failure of the active node Typically the standby computer server gains exclusive access to
277. ers with mkfs lustre Option Description backfstype fstype Forces a particular format for the backing file system such as ext3 Idiskfs comment comment Sets a user comment about this disk ignored by Lustre device size KB Sets the device size for loop and non loop devices Lustre 1 8 Operations Manual October 2009 Option Description dryrun Only prints what would be done it does not affect the disk failnode nid Sets the NID s of a failover partner This option can be repeated as needed fsname filesystem_name The Lustre file system of which this service node will be a part The default file system name is lustre NOTE The file system name is limited to 8 characters index index Forces a particular OST or MDT index mkfsoptions opts Formats options for the backing file system For example ext3 options could be set here mountfsoptions opts Sets permanent mount options This is equivalent to the setting in etc fstab mgsnode nid Sets the NIDs of the MGS node required for all targets other than the MGS param key value Sets the permanent parameter key to value This option can be repeated as desired Typical options might include param sys timeout 40 System obd timeout param lov stripesize 2M Default stripe size param lov stripecount 2 Default stripe count param failover mode failout Returns errors instead of waiting for recove
278. ersion ial Red Hat Enterprise Linux 5 5 ial Lustre Client 1 6 6 1 6 6 lia2 Red Hat Enterprise Linux 5 5 sata20 Red Hat Enterprise Linux 4 4 sata20 Lustre MDS 1 6 6 1 6 6 sata20 Lustre MGS 1 6 6 1 6 6 sata21 Red Hat Enterprise Linux 5 5 sata21 Lustre OSS 1 6 6 1 6 6 sata22 Lustre OSS 1 6 6 1 6 6 sata22 SuSE Linux Enterprise Server 10 10 sata25 Lustre Client 1 6 6 1 6 6 sata25 SuSE Linux Enterprise Server 9 3 SUNSP00144 SUN FIRE X4440 RR SUNSPOO 144 SUN FIRE X4440 RR To save this information and register these products with Sun Connection later eT aa click the Save As button ESS Preferences Back L Next Cancel Help Chapter5 Service Tags 5 3 Note The Registration client requires an X display to run If the node from which you want to do the registration has no native X display you can use SSH s X forwarding to display the Registration client interface on your local machine The registration process includes up to five steps The first step is to discover the service tags created when you started Lustre The Registration client looks for Sun products on your local subnet by default Alternately you can specify another subnet specific hosts or IP addresses 5 Select an option to locate service tags and click Next The Product Data screen displays Sun products that support service tags as they are located For each product the system name product name and version if ap
279. es a Lustre kmalloc of mmm NNNN bytes failed messages m Lustre or kernel stack traces showing processes stuck in try_to_free_pages For information on determining the MDS memory and OSS memory requirements see Memory Requirements 21 20 Lustre 1 8 Operations Manual October 2009 21 4 21 21 4 22 Number of OSTs Needed for Sustained Throughput The number of OSTs required for sustained throughput depends on your hardware configuration If you are adding an OST that is identical to an existing OST you can use the speed of the existing OST to determine how many more OSTs to add Keep in mind that adding OSTs affects resource limitations such as bus bandwidth in the OSS and network bandwidth of the OSS interconnect You need to understand the performance capability of all system components to develop an overall design that meets your performance goals and scales to future system requirements Note For best performance put the MGS and MDT on separate devices Setting SCSI I O Sizes Some SCSI drivers default to a maximum 1 0 size that is too small for good Lustre performance we have fixed quite a few drivers but you may still find that some drivers give unsatisfactory performance with Lustre As the default value is hard coded you need to recompile the drivers to change their default On the other hand some drivers may have a wrong default set If you suspect bad I O performance and an analysis of Lustre statis
280. es the current statahead status 22 20 Lustre 1 8 Operations Manual October 2009 22 27 22 2 7 1 OSS Read Cache Lustre 1 8 introduces the OSS read cache feature which provides read only caching of data on an OSS This functionality uses the regular Linux page cache to store the data Just like caching from a regular filesytem in Linux OSS read cache uses as much physical memory as is allocated OSS read cache improves Lustre performance in these situations m Many clients are accessing the same data set as in HPC applications and when diskless clients boot from Lustre m One client is storing data while another client is reading it essentially exchanging data via the OST m A client has very limited caching of its own OSS read cache offers these benefits a Allows OSTs to cache read data more frequently m Improves repeated reads to match network speeds instead of disk speeds m Provides the building blocks for OST write cache small write aggregation Using OSS Read Cache OSS read cache is implemented on the OSS and does not require any special support on the client side Since OSS read cache uses the memory available in the Linux page cache you should use I O patterns to determine the appropriate amount of memory for the cache if the data is mostly reads then more cache is required than for writes OSS read cache is enabled by default and managed by the following tunables m read_cache_enable controls whether d
281. ess 20 15 tunefs lustre 31 5 Tuning directory statahead 22 20 file readahead 22 19 tuning formatting the MDT and OST 20 5 large scale 20 13 Index 7 LNET tunables 20 4 lockless tunables 20 15 MDS threads 20 3 module options 20 2 network 20 7 root squash 25 4 U upgrade 1 6 x to 1 8 x 13 3 1 8 x to next minor version 13 8 complete file system 13 4 rolling 13 6 using Lustre SNMP module 14 3 usockind using 2 7 utilities third party build tool compiler 3 3 e2fsprogs 3 3 Perl 3 3 V VBR delayed recovery 19 6 VBR introduction 19 5 VBR tips 19 7 VBR working with 19 7 Version based recovery VBR 19 5 VIB LND 30 12 Voltaire InfiniBand vib 2 2 Ww weighted allocator 24 11 weighting adjusting between free space and location 24 12 writeconf 4 24 Index 8 Lustre 1 8 Operations Manual October 2009
282. ess to use as the ping target Typically this is a firewall route or another very reliable network endpoint external to the cluster In Lustre a disk failure is an unrecoverable error For this reason you must have reliable back end storage with RAID Lustre 1 8 Operations Manual October 2009 Note If a disk fails requiring you to change the disk or resync the RAID you can deactivate the affected OST using 1ct1 on the clients and MDT This allows access functions to complete without errors files on the affected OST will be of 0 length however you can save rest of your files 8 9 8 5 1 Setting Up Failover with Heartbeat V1 This section describes how to set up failover with Heartbeat V1 Installing the Software 1 Install Lustre see Installing Lustre from RPMs 2 Install the RPMs that are required to configure Heartbeat The following packages are needed for Heartbeat V1 We used the 1 2 3 1 version RedHat supplies v1 2 3 2 Heartbeat is available as an RPM or source These are the Heartbeat packages in order heartbeat stonith gt heartbeat stonith 1 2 3 1 i586 rpm heartbeat pils gt heartbeat pils 1 2 3 1 1586 rpm heartbeat itself gt heartbeat 1 2 3 1 1586 rpm You can find the above RPMs at http linux ha org download index html 1 2 3 3 Satisfy the installation prerequisites Heartbeat 1 2 3 installation requires following python openssl libnet gt libnet 1 1 2 1 19 i58
283. ess you are a network expert enable_irq_affinity By default this parameter is OFF In the normal case on an SMP system we would like network traffic to remain local to a single CPU This helps to keep the processor cache warm and minimizes the impact of context switches This is especially helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs If you have an SMP platform with a single fast interface such as 10 GB Ethernet and more than 2 CPUs you may see improved performance by turning this parameter to OFF You should as always test to compare the performance impact 20 4 Lustre 1 8 Operations Manual October 2009 20 3 20 3 1 20 32 Options to Format MDT and OST File Systems The backing file systems on an MDT and OSTs are independent of one another so the formatting parameters for them should not be same Sizing the MDT depends solely on how many inodes you want in the entire Lustre file system This is not related to the size of the aggregate OST space Planning for Inodes Each time you create a file on a Lustre file system it consumes one inode on the MDT and one inode for each OST object that the file is striped over Normally it is based on the stripe count used for files either from the file system wide default set With param lov stripecount N at mkfs lustre time or from the per file or per directory stripe count set with 1fs cou
284. et ip2nets 02ib0 ib0 192 168 10 103 253 2 m Client with the even IP address options Inet ip2nets o02ib1 ib0O 192 168 10 102 254 2 7 3 2 Start servers To start the MGS and MDT server run modprobe lnet To start MGS and MDT run mkfs lustre fsname lustre mdt mgs dev sda mkdir p mnt test mdt mount t lustre dev sda mnt test mdt mount t lustre mgs o2ib0 lustre mnt mdt Ur Ur Ur Ur To start the OSS run mkfs lustre fsname lustre ost mgsnode mds o2ib0 dev sda mkdir p mnt test mdt mount t lustre dev sda mnt test ost mount t lustre mgs o2ib0 lustre mnt ost Ur Ur Ur Ur 7 6 Lustre 1 8 Operations Manual October 2009 7 3 3 Start clients For the IB client run mount t lustre 192 168 10 101 o021b0 192 168 10 102 o2ib1 mds client mnt lustre 7 4 Multi Rail Configurations with LNET To aggregate bandwidth across both rails of a dual rail IB cluster o2ibind using LNET consider these points a LNET can work with multiple rails however it does not load balance across them The actual rail used for any communication is determined by the peer NID a Multi rail LNET configurations do not provide an additional level of network fault tolerance The configurations described below are for bandwidth aggregation only Network interface failover is planned as an upcoming Lustre feature a A Lustre node always uses the same local NID to communicate with a given peer NID The criter
285. evice causes Lustre to be shut down on that node a Unmount the clients On each client node run umount lt mount point gt b Unmount the MDT On the MDS node run umount lt mount point gt c Unmount the OSTs be sure to unmount all OSTs On each OSS node run umount lt mount point gt 4 Unload the old Lustre modules by either a Rebooting the node OR m Removing the Lustre modules manually Run lustre_rmmod several times and use 1smod to check the currently loaded modules 5 Start the upgraded file system Start the components in this order OSTs then the MDT then clients a Mount the OSTs be sure to mount all OSTs On each OSS node run mount t lustre lt block device name gt lt mount point gt b Mount the MDT On the MDS node run mount t lustre lt block device name gt lt mount point gt c Mount the file system on the clients On each client node run mount t lustre lt MGS node gt lt fsname gt lt mount point gt If you have a problem upgrading Lustre contact us via the Bugzilla bug tracker Chapter 13 Upgrading and Downgrading Lustre 13 5 13 3 2 13 6 Performing a Rolling Upgrade This procedure describes a rolling upgrade in which one Lustre component server or client is upgraded and restarted at a time while the file system is running If you want to upgrade the complete Lustre file system or multiple components at a time requiring a file system shutdown
286. f machines that it will need to communicate With This affects how many receives it will pre post and each receive will use one page of memory Ideally on clients this value will be equal to the total number of Lustre servers MDS and OSS On servers it needs to equal the total number of machines in the storage system cksum 0 turns on small message checksums It can be used to aid in troubleshooting MX also provides an optional checksumming feature which can check all messages large and small For details see the MX README ntx 256 is the number of total sends in flight from this machine In actuality MXLND reserves half of them for connect messages so make this value twice as large as you want for the total number of sends in flight credits 8 is the number of in flight messages for a specific peer This is part of the flow control system in Lustre Increasing this value may improve performance but it requires more memory because each message requires at least one page board 0 is the index of the Myricom NIC Hosts can have multiple Myricom NICs and this identifies which one MXLND should use This value must match the board value in your MXLND hosts file for this host ep_id 3 is the MX endpoint ID Each process that uses MX is required to have at least one MX endpoint to access the MX library and NIC The ID is a simple index starting at zero 0 This value must match the endpoint ID value in your MXLND hosts file for this h
287. f open files but practically it depends on amount of RAM on the MDS There are no tables for open files on the MDS as they are only linked in a list to a given client s export Each client process probably has a limit of several thousands of open files which depends on the ulimit 32 4 Lustre 1 8 Operations Manual October 2009 32 12 OSS RAM Size For a single OST there is no strict rule to size the OSS RAM However as a guideline for Lustre 1 8 installations 2 GB per OST is a reasonable RAM size For details on determining the memory needed for an OSS node see OSS Memory Requirements Chapter 32 System Limits 32 5 32 6 Lustre 1 8 Operations Manual October 2009 APPENDIX A Lustre Knowledge Base The Knowledge Base is a collection of tips and general information regarding Lustre How can I check if a file system is active the MGS MDT and OSTs are all online How to reclaim the 5 percent of disk space reserved for root Why are applications hanging How do I abort recovery Why would I want to What does denying connection for new client mean How do I set a default debug level for clients How can I improve Lustre metadata performance when using large directories gt 0 5 million files File system refuses to mount because of UUID mismatch How do I set up multiple Lustre file systems on the same node Is it possible to change the IP address of a OST MDS Change the UUID How do I replace an OST or MDS
288. f such events since the system started Unit Unit of measurement for that statistic microseconds requests buffers last Average rate of these events in units event for the last interval during which they arrived For instance in the above mentioned case of ost_destroy it took an average of 736 microseconds per destroy for the 400 object destroys in the previous 10 seconds min Minimum rate in units events since the service started avg Average rate max Maximum rate stddev Standard deviation not measured in all cases The events common to all services are Parameter Description req_waittime req_qdepth req_active reqbuf_avail Amount of time a request waited in the queue before being handled by an available server thread Number of requests waiting to be handled in the queue for this service Number of requests currently being handled Number of unsolicited Inet request buffers for this service Some service specific events of interest are Parameter Description Idim_enqueue mds_reint Time it takes to enqueue a lock this includes file open on the MDS Time it takes to process an MDS modification record includes create mkdir unlink rename and setattr Chapter 22 LustreProc 22 35 22 3 1 2 22 3 1 3 llobdstat The llobdstat utility parses obdfilter statistics files located at proc fs lustre lt ost_name gt stats Use 1lobdstat to monitor changes in statistics ove
289. f the leading lt fsname gt and or ending _UUID are missing they are automatically added Chapter 24 Striping and I O Options 24 13 24 5 1 1 For example to add the even numbered OSTs to pool1 on file system lustre run a single command add to add many OSTs to the pool at one time 1lctl pool add lustre pool1 OST 0 10 2 Note Each time an OST is added to a pool a new llog configuration record is created For convenience you can run a single command To remove a named OST from a pool 1ctl pool remove lt fsname gt lt poolname gt lt ost_list gt To destroy a pool 1lctl pool _destroy lt fsname gt lt poolname gt Note All OSTs must be removed from a pool before the pool can be destroyed To list pools in the named file system letl pool _ list lt fsname gt lt pathname gt To list OSTs in a named pool letl pool _ list lt fsname gt lt poolname gt Using the lfs Command with OST Pools Several 1fs commands can be run with OST pools Use the lfs setstripe command to associate a directory with an OST pool This causes all new regular files and directories in the directory to be created in the pool The 1 fs command can be used to list pools in a file system and OSTs in a named pool To associate a directory with a pool so all new files and directories will be created in the pool lfs setstripe lt filename dirname gt pool p pool name To set striping patterns lfs setstripe size
290. figuration 8 6 memory determining 3 6 MDT OST formatting overriding default formatting options 20 6 planning for inodes 20 5 sizing the MDT 20 5 Mellanox Gold InfiniBand openib 2 2 memory requirements 3 6 mkfs lustre 31 2 MMP using 8 16 mod5 SOCKLND kernel TCP IP LND 30 8 modprobe conf 7 1 7 5 7 6 module parameters 2 5 module parameters routing 2 8 module setup 4 10 mount command 27 22 mount lustre 31 15 multihomed server Lustre complicated configurations 7 1 modprobe conf 7 1 start clients 7 4 start server 7 3 multiple mount protection see MMP 8 16 multiple NICs 12 4 MX LND 30 20 Myrinet 2 2 N network bonding 12 1 Network tuning 20 7 networks supported cib Cisco Topspin 2 2 Cray Seastar 2 2 Elan Quadrics Elan 2 2 GM and MX Myrinet 2 2 iib Infinicon InfiniBand 2 2 o2ib OFED 2 2 openib Mellanox Gold InfiniBand 2 2 ra RapidArray 2 2 TCP 2 2 vib Voltaire InfiniBand 2 2 NIC bonding 12 4 multiple 12 4 NID server changing 4 27 node active active 8 5 active passive 8 5 O o2ib OFED 2 2 obdfilter_survey tool 18 5 OFED 2 2 offset_stats utility 31 21 OpenIB LND 30 14 operating systems supported 3 2 operating tips data migration script simple 26 3 Operational scenarios 4 29 OSS memory determining 3 7 OSS read cache 22 21 OST failover 8 6 failover configuration 8 6 removing and restoring 4 25 OST block I O stream watching
291. ford Kirby Collins Al Slater Scott Rhine Mike Wisner Ken Goss Steve Landherr Brad Smith Mark Kelly Dr Alain CYR Randy Dunlap Mark Montague Dan Million Jean Marc Zucconi Jeff Blomberg Erik Habbinga Kris Strecker Walter Wong Run began Fri Sep 29 15 37 07 2006 Network distribution mode enabled Command line used iozone m test txt Output is in Kbytes sec Time Resolution 0 000001 seconds Processor cache size set to 1024 Kbytes Processor cache line size set to 32 bytes File stride size set to 17 record size random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 512 4 194309 406651 728276 792701 715002 498592 638351 700365 587235 190554 378448 686267 765201 iozone test complete Lustre 1 8 Operations Manual October 2009 CHAPTER 18 Lustre I O Kit This chapter describes the Lustre I O kit and PIOS performance tool and includes the following sections a Lustre I O Kit Description and Prerequisites Running I O Kit Tests m PIOS Test Tool a LNET Self Test 18 1 Lustre I O Kit Description and Prerequisites The Lustre I O kit is a collection of benchmark tools for a Lustre cluster The I O kit can be used to validate the performance of the various hardware and software layers in the cluster and also as a way to find and troubleshoot I O issues The I O kit contains three tests The first surveys basic p
292. found for this request found Number of free chunks mballoc found and measured before the final decision grps Number of groups mballoc scanned to satisfy the request cr Stage at which mballoc found the result 0 best in terms of resource allocation The request was 1MB or larger and was satisfied directly via the kernel buddy allocator 1 regular stage good at resource consumption 2 fs is quite fragmented not that bad at resource consumption 3 fs is very fragmented worst at resource consumption queue Total bytes in active queued sends Lustre 1 8 Operations Manual October 2009 Parameter Description merge Whether the request hit the goal This is good as extents code can now merge new blocks to existing extent eliminating the need for extents tree growth tail Number of blocks left free after the allocation breaks large free chunks broken How large the broken chunk was Most customers are probably interested in found cr If cr is 0 1 and found is less than 100 then mballoc is doing quite well Also number of blocks in request third number in the goal triple can tell the number of blocks requested by the obdfilter If the obdfilter is doing a lot of small requests just few blocks then either the client is processing input output to a lot of small files or something may be wrong with the client because it is better if client sends large input output requests This can be investigated with the OSC rpc_st
293. from www myri com scs tar xvf mx_1 2 7 tar cd mx 1 2 7 in s common include configure with kernel lib make make install Compile and install the Lustre source code a Install the Lustre source this can be done via RPM or tarball The source file is available at the Lustre download page This example shows installation via the tarball cd usr src gunzip lustre 1 6 6 tar gz tar xvf lustre 1 6 6 tar b Configure and build the Lustre source code The configure help command shows a list of all of the with options All third party network stacks are built in this manner cd lustre 1 6 6 configure with linux usr src linux with mx usr src mx 1 2 7 make make rpms The make rpms command output shows the location of the generated RPMs Use the rpm ivh command to install the RPMS rpm ivh lustre 1 6 6 2 6 18_92 1 10 e15_lustre 1 6 6smp x86_64 rpm rpm ivh lustre modules 1 6 6 2 6 18_92 1 10 e15_lustre 1 6 6smp x86_64 rpm rpm ivh lustre ldiskfs 3 0 6 2 6 18_92 1 10 e15_lustre 1 6 6smp x86_64 rpm Add the following lines to the etc modprobe conf file options kmxlnd hosts etc hosts mxlnd options lnet networks mx0 myri0 tcp0 eth0 Populate the myri0 configuration with the proper IP addresses vim etc sysconfig network scripts myri0 Lustre 1 8 Operations Manual October 2009 7 Add the following line to the etc hosts mxind file IP HOST BOARD EP_I
294. fsprogs lt ver gt There may or may not be a new e2fsprogs package with a Lustre upgrade The e2fsprogs release schedule is independent of Lustre releases d Optional If you want to add optional packages to your Lustre system install them now 3 Unload the old Lustre modules by either a Rebooting the node OR m Removing the Lustre modules manually Run lustre_rmmod several times and use 1smod to check the currently loaded modules 4 If the upgraded component is a server fail back services to it If you have a problem upgrading Lustre contact us via the Bugzilla bug tracker Chapter 13 Upgrading and Downgrading Lustre 13 7 13 4 Upgrading Lustre 1 8 x to the Next Minor Version To upgrade Lustre 1 8 x to the next minor version for example Lustre 1 8 0 1 gt 1 8 x follow these procedures m To upgrade the complete file system or multiple file system components at the same time requiring a file system shutdown see Performing a Complete File System Upgrade a To upgrade one Lustre component server or client at a time while the file system is running see Performing a Rolling Upgrade 13 5 Downgrading from Lustre 1 8 x to 1 6 x This section describes how to downgrade from Lustre 1 8 x to 1 6 x Only file systems that were upgraded from 1 6 x can be downgraded to 1 6 x A file system that was created or reformatted under Lustre 1 8 x cannot be downgraded Two paths are available to meet the downgrade requir
295. fy a set of client for which UID GID re mapping does not apply Configuring Root Squash Root squash functionality is managed by two configuration parameters root_squash and nosquash_nids m The root_squash parameter specifies the UID and GID with which the root user accesses the Lustre file system m The nosquash_nids parameter specifies the set of clients to which root squash does not apply LNET NID range syntax is used for this parameter see the NID range syntax rules described in Enabling and Tuning Root Squash For example nosquash_nids 172 16 245 0 255 2 tcp In this example root squash does not apply to TCP clients on subnet 172 16 245 0 that have an even number as the last component of their IP address Enabling and Tuning Root Squash The default value for nosquash_nids is NULL which means that root squashing applies to all clients Setting the root squash UID and GID to 0 turns root squash off Root squash parameters can be set when the MDT is created mkfs lustre mdt For example mkfs lustre reformat fsname Lustre mdt mgs param mdt root_squash 500 501 param mdt nosquash_nids 0 elanl 192 168 1 10 11 dev sdal 25 4 Lustre 1 8 Operations Manual October 2009 Root squash parameters can also be changed on an umounted device with tunefs lustre For example tunefs lustre param mdt root_squash 65534 65534 param mdt nosquash_nids 192 168 0 13 tcp0 dev sdal Root squash par
296. g id 192 168 1 3 tcp 192 168 1 18 tcp Busy session Isaac id 192 168 10 10 tcp 192 168 1 19 tcp Down session lt NULL gt id LNET_NID_ANY 192 168 1 20 tcp Down session lt NULL gt id LNET_NID_ANY Lustre 1 8 Operations Manual October 2009 stat bw rate read write max min avg timeout delay GROUP I NIDs GROUP NIDs The collection performance and RPC statistics of one or more nodes Specifying a group name GROUP causes statistics to be gathered for all nodes in a test group For example lst stat servers where servers is the name of a test group created by lst add_group Specifying a NID range NIDs causes statistics to be gathered for selected nodes For example lst stat 192 168 0 1 100 2 tcp Currently only LNET performance statistics are available By default all statistics information is displayed Users can specify additional information with these options bw Displays the bandwidth of the specified group nodes rate Displays the rate of RPCs of the specified group nodes read Displays the read statistics of the specified group nodes write Displays the write statistics of the specified group nodes max Displays the maximum value of the statistics min Displays the minimum value of the statistics avg Displays the average of the statistics timeout The timeout of the statistics RPC The default is 5 seconds delay
297. ges to parameters marked with Wc only have effect when connections are established existing connections are not affected by these changes 30 2 30 2 Module Options m With routed or other multi network configurations use ip2nets rather than networks so all nodes can use the same configuration m For a routed network use the same routes configuration everywhere Nodes specified as routers automatically enable forwarding and any routes that are not relevant to a particular node are ignored Keep a common configuration to guarantee that all nodes have consistent routing tables m A separate modprobe conf Inet included from modprobe conf makes distributing the configuration much easier m If you set config_on_load 1 LNET starts at modprobe time rather than waiting for Lustre to start This ensures routers start working at module load time Ictl Ictl gt net down m Remember the lctl ping nid command itis a handy way to check your LNET configuration Lustre 1 8 Operations Manual October 2009 30 2 1 LNET Options This section describes LNET options 30 2 1 1 Network Topology Network topology module parameters determine which networks a node should join whether it should route between these networks and how it communicates with non local networks Here is a list of various networks and the supported software stacks Network Software Stack openib OpenIB gen1 Mellanox Gold iib Silverstorm Inf
298. groups which uses the Lustre group supplied upcall It looks up the UID in etc passwd and if it finds the UID it looks for supplementary groups in etc group for that username You are free to enhance 1_getgroups to look at an external database for supplementary groups information The default group upcall is set by mkfs lustre To set the upcall use echo path gt proc fs lustre mds mdsname group_upcall or tunefs lustre param To avoid repeated upcalls the supplementary group information is cached by the MDS The default cache time is 300 seconds but can be changed via proc fs lustre mds mdsname group_expire The kernel waits at most 5 seconds by default proc fs lustre mds mdsname group_acquire_expire changes for the upcall to complete and will take the failure behavior as described above It is possible to flush cached entries by writing to the proc fs lustre mds mdsname group_flush file Lustre 1 8 Operations Manual October 2009 28 1 3 28 1 4 Parameters m Name of the MDS service a Numeric UID Data structures include lt lustre lustre_user h gt define MDS_GRP_DOWNCALL_ MAGIC 0x6d6dd620 struct mds_grp_downcall_data ___u32 ___u32 ___u32 ___u32 ___u32 ___u32 mgd_magic mgd_err mgd_uid mgd_gid mgd_ngroups mgd_groups 0 Chapter 28 Lustre Programming Interfaces man2 28 3 28 4 Lustre 1 8 Operations Manual October 2009 CHAPTER 29 Setting Lus
299. h tests should not exist individually Users can control a test batch run stop they cannot control individual tests Sample Script These are the steps to run a sample LNET self test script simulating the traffic pattern of a set of Lustre servers on a TCP network accessed by Lustre clients on an InfiniBand network connected via LNET routers In this example half the clients are reading and half the clients are writing 1 Load libcfs ko Inet ko ksockInd ko and Inet_selftest ko on all test nodes and the console node 2 Run this script on the console node bin bash export LST_SESSION S st new_session read write st add_group servers 192 168 10 8 10 12 16 tcp st add_group readers 192 168 1 1 253 2 o2ib lst add_ group writers 192 168 1 2 254 2 o2ib lst add_batch bulk_rw lst add_test batch bulk_rw from readers to servers brw read check simple size 1M st add_test batch bulk_rw from writers to servers r Ce w write check full size 4K start running lst run bulk_rw display server stats for 30 seconds lst stat servers amp sleep 30 kill tear down lst end_session Chapter 18 Lustre I O Kit 18 21 18 4 2 18 4 2 1 Note This script can be easily adapted to pass the group NIDs by shell variables or command line arguments making it good for general purpose use LNET Self Test Commands The LNET self test Ist utility is used to issue LNET self test comm
300. hapter 4 Configuring Lustre 4 9 4 1 0 2 4 1 1 4 10 Module Setup Make sure the modules like LNET are installed in the appropriate 1ib modules directory The mkfs lustre utility tries to automatically load LNET via the Lustre module with the default network settings using all available network interfaces To change this default setting use the network option to specify the network s that LNET should use modprobe v lustre networks XXX For example to load Lustre with multiple interface support meaning LNET will use more than one physical circuit for communication between nodes load the Lustre module with the following network option modprobe v lustre networks tcp0 eth0 021b0 1b0 where tcp0 is the network itself TCP IP eth0 is the physical device card that is used Ethernet 02ib0 is the interconnect InfiniBand Scaling the Lustre File System A Lustre file system can be scaled by adding OSTs or clients For instructions on creating additional OSTs see Step 4 and Step 5 above for clients see Step 7 Lustre 1 8 Operations Manual October 2009 4 2 4 2 1 Additional Lustre Configuration Once the Lustre file system is configured it is ready for use If additional configuration is necessary several configuration utilities are available For man pages and reference information see m mkfs lustre m tunefs lustre m lctl m mount lustre System Configuration Utilities man8
301. hardware and software RAID 10 4 RapidArray 2 2 RapidArray LND 30 11 readahead tuning 22 19 recovering Lustre 19 1 recovery mode failure types client failure 19 2 MDS failure failover 19 3 network partition 19 4 OST failure 19 3 recovery aborting 4 27 required software 3 3 required tools utilities 3 3 restore file level 15 4 root squash configuring 25 4 tips 25 6 tuning 25 4 root squash using 25 4 round robin allocator 24 11 routers downed 2 12 routers LNET 2 11 routing 2 8 routing elan to TCP 7 5 RPC stream tunables 22 12 RPC stream watching 22 14 RPMs installing Lustre 3 9 running a client and OST on the same machine 26 5 S scaling Lustre 4 10 server mounting 4 14 4 15 server NID changing 4 27 Service tags introduction 5 1 using 5 2 setting maxcmds 20 11 readahead and MF 20 8 SCSI I O sizes 21 21 segment size 20 9 write back cache 20 10 setting Lustre parameters 4 21 sgpdd_survey tool 18 3 simple configuration CSV file configuring Lustre 6 4 network combined MGS MDT 6 1 network separate MGS MDT 6 3 TCP network Lustre simple configurations 6 1 SOCKLND kernel TCP IP LND 30 8 software RAID support 10 8 source code installing Lustre 3 13 starting LNET 2 13 statahead tuning 22 20 stopping LNET 2 14 striping advantages 24 2 disadvantages 24 3 lfs getstripe display files and directories 24 4 lfs getstripe set file layout 24 6 size 24 3 stri
302. have discovered a programming error that allowed the servers to get out of sync Please report this condition to the Lustre group and we will investigate If the reported error is anything else such as 5 I O error it likely indicates a storage failure The low level file system returns this error if it is unable to read from the storage device Suggested Action If the reported error is 2 you can consider checking in lost found on your raw OST device to see if the missing object is there However it is likely that this object is lost forever and that the file that references the object is now partially or completely lost Restore this file from backup or salvage what you can and delete it If the reported error is anything else then you should immediately inspect this server for storage problems Chapter 21 Lustre Monitoring and Troubleshooting 21 9 21 4 4 21 4 5 OSTs Become Read Only If the SCSI devices are inaccessible to Lustre at the block device level then ext3 remounts the device read only to prevent file system corruption This is a normal behavior The status in proc fs lustre healthcheck also shows not healthy on the affected nodes To determine what caused the not healthy condition m Examine the consoles of all servers for any error indications m Examine the syslogs of all servers for any LustreErrors or LBUG m Check the health of your system hardware and network Are the disks working as expecte
303. he best one for communication Colon separation for example um11 um12 means that the two NIDs refer to two different hosts and should be treated as failover locations Lustre tries the first one and if that fails it tries the second one Lustre 1 8 Operations Manual October 2009 Note If you have an MGS or MDT configured for failover perform these steps 1 On the OST list the NIDs of all MGS nodes at mkfs time OST mkfs lustre fsname sunfs ost mgsnode 10 0 0 1 mgsnode 10 0 0 2 dev device 2 On the client mount the file system client mount t lustre 10 0 0 1 10 0 0 2 sunfs cfs client 4 5 Operational Scenarios In the operational scenarios below the management node is the MDS The management service is started as the initial part of the startup of the primary MDT Tip All targets that are configured for failover must have some kind of shared storage among two server nodes IP Network Single MDS Single OST No Failover On the MDS run mkfs lustre mdt mgs fsname lt fsname gt lt partition gt mount t lustre lt partition gt lt mountpoint gt On the OSS run mkfs lustre ost mgs fsname lt fsname gt lt partition gt mount t lustre lt partition gt lt mountpoint gt On the client run mount t lustre lt MGS NID gt lt fsname gt lt mountpoint gt Chapter 4 Configuring Lustre 4 29 4 30 IP Network Failover MDS For failover storage ho
304. he Next Minor Version Downgrading from Lustre 1 8 x to 1 6 x 13 1 13 1 Supported Upgrades For Lustre 1 8 x the following upgrades are supported Lustre 1 6 x latest version to Lustre 1 8 x latest version a Lustre 1 8 x any minor version to Lustre 1 8 x latest version 13 2 13 2 Lustre Interoperability Lustre interoperability enables 1 8 x servers new servers to work with 1 6 x clients old clients 1 6 x servers old servers to work with 1 8 x clients new clients and mixed environments with 1 6 x and 1 8 x servers For example half of each OSS failover pair could be upgraded to enable a quick reversion to 1 6 by powering down the 1 8 servers This table describes interoperability between Lustre clients OSTs and MDTs with different versions of Lustre installed Lustre Component Interoperability with Other Lustre Components Clients e Old live clients can communicate with old new mixed servers e Old clients can start up using old new mixed servers e New clients can start up using old new mixed servers Note Old clients cannot mount a file system that was created by a new MDT OSTs e Old OSTs can communicate with new clients MDT e New OSTs can only be started after the MGS has been started typically this means after the MDT has been upgraded MDTs e Old MDT can communicate with new clients e New co located MGS MDT can be started at any point e New non co located M
305. he default striping settings lov stripesize lt bytes gt lov stripecount lt count gt lov stripeoffset lt offset gt To change the default striping information m On the MGS lctl conf_param testfs MDT0000 lov stripesize 4M m On the MDT and clients mdt cli gt cat proc fs lustre lov testfs mdt cli lov stripe 21 12 Lustre 1 8 Operations Manual October 2009 21 4 8 21 4 9 Erasing a File System If you want to erase a file system run this command on your targets mkfs lustre reformat If you are using a separate MGS and want to keep other file systems defined on that MGS then set the writeconf flag on the MDT for that file system The writeconf flag causes the configuration logs to be erased they are regenerated the next time the servers start To set the writeconf flag on the MDT 1 Unmount all clients servers using this file system run umount mnt lustre 2 Erase the file system and presumably replace it with another file system run mkfs lustre reformat fsname spfs mdt mgs dev sda 3 If you have a separate MGS that you do not want to reformat then add the writeconf flag to mkfs lustre on the MDT run mkfs lustre reformat writeconf fsname spfs mdt mgs dev sda Note If you have a combined MGS MDT reformatting the MDT reformats the MGS as well causing all configuration information to be lost you can start building your new file system Nothing needs to be d
306. he network a Lustre client can perform end to end data checksums Be aware that at high data rates checksumming can impact Lustre performance 2 This feature computes a 32 bit checksum of data read or written on both the client and server and ensures that the data has not been corrupted in transit over the network Chapter 20 Lustre Tuning 20 15 20 16 Lustre 1 8 Operations Manual October 2009 CHAPTER 21 Lustre Monitoring and Troubleshooting This chapter provides information to troubleshoot Lustre submit a Lustre bug and Lustre performance tips It includes the following sections Monitoring Lustre m Troubleshooting Lustre m Submitting a Lustre Bug Common Lustre Problems and Performance Tips 21 1 Monitoring Lustre Several tools are available to monitor a Lustre cluster Lustre Monitoring Tool The Lustre Monitoring Tool LMT is a Python based distributed system that provides a top like display of activity on server side nodes MDS OSS and portals routers on one or more Lustre file systems LMT provides a Java based GUI that reports data for each file system A tab is presented for each Lustre file system that is being monitored Within each tab there are panes presenting the server side node information MDS OSS or portals routers There is also a tab that presents a multi level outline view of the sub components of each file system component Data is displayed for OSTs and file 1
307. hould never be used with other hardware Using Usocklnd Lustre now offers usocklnd a socket based LND that uses TCP in userspace By default liblustre is compiled with usocklnd as the transport so there is no need to specially enable it Variable USOCK_SOCKNAGLE N USOCK_SOCKBUFSIZ N USOCK_TXCREDITS N Use the following environmental variables to tune usocklnd s behavior Description Turns the TCP Nagle algorithm on or off Setting N to 0 the default value turns the algorithm off Setting N to 1 turns the algorithm on Changes the socket buffer size Setting N to 0 the default value specifies the default socket buffer size Setting N to another value must be a positive integer causes usocklnd to try to set the socket buffer size to the specified value Specifies the maximum number of concurrent sends The default value is 256 N should be set to a positive value USOCK_PEERTXCREDITSEN Specifies the maximum number of concurrent sends per USOCK_NPOLLTHREADS N USOCK_FAIR_LIMIT N peer The default value is 8 N should be set to a positive value and should not be greater than the value of the USOCK_TXCREDITS parameter Defines the degree of parallelism of usocklnd by equaling the number of threads devoted to processing network events The default value is the number of CPUs in the system N should be set to a positive value The maximum number of times that usocklnd loops processing events before
308. how to install and use the Lustre SNMP module and includes the following sections m Installing the Lustre SNMP Module m Building the Lustre SNMP Module m Using the Lustre SNMP Module 14 1 141 Installing the Lustre SNMP Module To install the Lustre SNMP module 1 Locate the SNMP plug in lustresnmp so in the base Lustre RPM and install it usr lib lustre snmp lustresnmp so 2 Locate the MIB Lustre MIB txt in usr share lustre snmp mibs Lustre MIB txt and append the following line to snmpd con dimod lustresnmp usr lib lustre snmp lustresnmp so 3 You may need to copy Lustre MIB txt to a different location to use few tools For this use either of these commands snmp mibs usr local share snmp mibs 14 2 Building the Lustre SNMP Module To build the Lustre SNMP module you need the net snmp devel package The default net snmp install includes a snmpd conf file 1 Complete the net snmp setup by checking and editing the snmpd conf file located in etc snmp etc snmp snmpd conf 2 Build the Lustre SNMP module from the Lustre src rpm m Install the Lustre source a Run configure m Add the enable snmp option 14 2 Lustre 1 8 Operations Manual October 2009 14 3 Using the Lustre SNMP Module Once the Lustre SNMP module in installed and built use it for purposes For all Lustre components the SNMP module reports a number and total and free capacity usually in bytes Depending on the com
309. i_ file create closes the file descriptor we must re open fd open tfile O_CREAT O_RDWR O_LOV_DELAY_CREATE 0644 if fd lt 0 fprintf stderr Can t open s file d s n tfile errno strerror errno return 1 return fd output a list of uuids for this file Chapter 24 Striping and I O Options 24 19 int get_my_uuids int fd struct obd_uuid uuids 1024 1024 int obdcount int rc i uuidp Output var re llapi_lov_get_uuids fd uuids amp obdcount if re 0 fprintf stderr get uuids failed d s n errno strerror errno printf This file system has d obds n obdcount for i 0 uuidp uuids i lt obdcount i uuidp printf UUID d is s n i uuidp gt uuid return 0 Print out some LOV attribu int get_file_info char path tes List our objects struct lov_user_md lump int ro int i lump malloc LOV_EA_MAX lump if lump NULL return 1 rc llapi_file_get_stripe path lump if re 0 fprintf stderr get_stripe failed d s n errno strerror errno return 1 printf Lov magic u n lump gt lmm magic printf Lov pattern u n lump gt lmm pattern printf Lov object id llu n lump gt lmm_object_id printf Lov object group llu n lump gt lmm_object_gr printf Lov stripe size u n lump gt lmm stripe size printf Lov stripe count hu n lump gt lmm_str
310. ia used to determine the local NID are a Fewest hops to minimize routing and a Appears first in the networks or ip2nets LNET configuration strings As an example consider a two rail IB cluster running the OFA stack OFED with these IPoIB address assignments i1b0 ibl Servers 192 168 0 192 168 1 Clients 192 168 2 127 192 168 128 253 1 Multi rail configurations are only supported by o2ibind other IB LNDs do not support multiple interfaces Chapter 7 More Complicated Configurations 7 7 7 8 You could create these configurations a cluster with more clients than servers The fact that an individual client cannot get two rails of bandwidth is unimportant because the servers are the actual bottleneck ip2nets o2ib0 ib0 o2ib1 ib1 192 168 0 1 all servers o21b0 ib0 192 168 2 253 0 252 2 even clients o2ibl ibl 192 168 2 253 1 253 2 odd clients This configuration gives every server two NIDs one on each network and statically load balances clients between the rails m A single client that must get two rails of bandwidth and it does not matter if the maximum aggregate bandwidth is only servers 1 rail ip2nets 02ib0 ib0 192 168 0 1 0 252 2 even servers o2ib1 ib1 192 168 0 1 1 253 2 odd servers o2ib0 ib0 o2ib1 ib1 192 168 2 253 clients This configuration gives every server a single NID on one rail or the other Clients have a NID on both rail
311. ice Lustre server component that interfaces with the underlying Object Storage Device to manage the Lustre file system namespace directories file ownership attributes MetaData Server Server node that is hosting the Metadata Target MDT Metadata Target A metadata device made available through the Lustre meta data network protocol A cache of metadata updates mkdir create setattr other operations which an application has performed but have not yet been flushed to a storage device or server Management Service A software module that manages the startup configuration and changes to the configuration Also the server node on which this system runs The Lustre configuration protocol introduced in version 1 6 which formats disk file systems on servers with the mkfs lustre program and prepares them for automatic incorporation into a Lustre cluster An older obsolete term for LND Network Identifier Encodes the type network number and network address of a network interface on a node for use by Lustre A subset of the LNET RPC module that implements a library for sending large network requests moving buffers with RDMA Object Device The base class of layering software constructs that provides Lustre functionality See Storage Object API Module that can implement the Lustre object or metadata APIs Examples of OBD types include the LOV OSC and OSD Lustre 1 8 Operations Manual October 2009 Obdfilter Ob
312. ice to be used for the operation specified by name or number See device_list ignore_errors ignore_errors Ignores errors during script processing Examples Ictl STEEL letl gt dl 0 UP mgc MGC192 168 0 20 tcp bfbb24e3 7deb 2ffa eab0 44dffe00f692 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter testfs OST0000 testfs OSTO000_UUID 3 letl gt dk tmp log Debug log 87 lines 87 kept 0 dropped letl gt quit Ictl conf_param testfs MDTO000 sys timeout 40 lctl conf param testfs MDT0000 lov stripesize 2M Ictl conf _ param testfs OSTO000 osc max_ dirty _mb 29 15 lctl conf _ param testfs OSTO0000 ost client_ cache seconds 15 Chapter 31 System Configuration Utilities man8 31 13 31 14 get_param lctl 249364 20 tetek S set_param l1ctl gt lctl gt get_param obdfilter lustre OST0000 kbytesavail obdfilter lustre OST0000 kbytesavail 249364 lctl gt get_param n obdfilter lustre OST0000 kbytesavail lctl gt get_param timeout timeout 20 lctl gt get_param n timeout letl gt get_param obdfilter kbytesavail obdfilter lustre OST0000 kbytesavail 249364 obdfilter lustre OST0001 kbytesavail 249364 set_param obdfilter kbytesavail 0 obdfilter lustre OST0000 kbytesavail 0 obdfilter lustre OST0001 kbytesavail 0 letl gt set_param n obdfilter kbytesavail 0 lctl gt set_param fail_loc 0 fail_loc 0 Lustre 1 8 Operations Manual October 2009 31 4 moun
313. ients causes them to return EIO when accessing objects on the OST instead of waiting for recovery abort_recovery Aborts the recovery process on a re starting MDT or OST device Chapter 31 System Configuration Utilities man8 31 11 31 12 Virtual Block Device Operations Lustre can emulate a virtual block device upon a regular file This emulation is needed when you are trying to set up a swap space via the file Option Description blockdev_attach lt file name gt lt device node gt Attaches a regular Lustre file to a block device If the device node is non existent Ictl creates it We recommend that you create the device node by Ictl since the emulator uses a dynamic major number blockdev_detach lt device node gt Detaches the virtual block device blockdev_info lt device node gt Provides information on which Lustre file is attached to the device node Debug Option Description debug_daemon Starts and stops the debug daemon and controls the output filename and size debug_kernel file raw Dumps the kernel debug buffer to stdout or a file debug_file lt input gt output Converts the kernel dumped debug log from binary to plain text format clear Clears the kernel debug buffer mark lt text gt Inserts marker text in the kernel debug buffer Lustre 1 8 Operations Manual October 2009 Options Use the following options to invoke 1ct1 Option Description device Dev
314. ig beeeeb3dfe132a8a0633a017c99ce0 x86 cache Disk quota exceeded The requested quota of 300 MB is divided across the OSTs Each OST has an initial allocation of 100 MB blocks with iunit limiting to 5000 Note It is very important to note that the block quota is consumed per OST and the MDS per block and inode there is only one MDS for inodes Therefore when the quota is consumed on one OST the client may not be able to create files regardless of the quota available on other OSTs Lustre 1 8 Operations Manual October 2009 Additional information Grace period The period of time in seconds within which users are allowed to exceed their soft limit There are four types of grace periods m user block soft limit m user inode soft limit group block soft limit m group inode soft limit The grace periods are applied to all users The user block soft limit is for all users who are using a blocks quota Soft limit Once you are beyond the soft limit the quota module begins to time but you still can write block and inode When you are always beyond the soft limit and use up your grace time you get the same result as the hard limit For inodes and blocks it is the same Usually the soft limit MUST be less than the hard limit if not the quota module never triggers the timing If the soft limit is not needed leave it as zero 0 Hard limit When you are beyond the hard limit you get EQUOTA and canno
315. ile can be cleared by writing to it echo gt proc fs lustre llite lustre 57dee00 rw_offset_stats Chapter 22 LustreProc 22 15 22 2 4 Client Read Write Extents Survey Client Based I O Extent Size Survey The rw_extent_stats histogram in the llite directory shows you the statistics for the sizes of the read write I O extents This file does not maintain the per process statistics Example cat proc fs lustre llite lustre ee5af200 extents_stats snapshot_time 1213828728 348516 secs usecs read write extents calls cum calls cum OK 4K 0 0 0 2 2 4K 8K 0 0 0 0 0 2 8K 16K 0 o 0 0 0 2 16K 32K 0 0 O 20 23 26 32K 64K 0 o 0 0 0 26 64K 128K 0 0 0 51 60 86 128K 256K 0 o o0 0 0 86 256K 512K 0 o 0 0 0 86 512K 1024K 0 o o0 0 0 86 1M 2M 0 0 O 11 t3 100 The file can be cleared by issuing the following command echo gt cat proc fs lustre llite lustre ee5af200 extents_stats 22 16 Lustre 1 8 Operations Manual October 2009 Per Process Client I O Statistics The extents_stats_per_process file maintains the I O extent size statistics on a per process basis So you can track the per process statistics for the last MAX_PER_PROCESS_HIST processes Example cat proc fs lustre llite lustre ee5af200 extents_stats_per_process snapshot_time extents PID 11488 OK 4K 4K 8K 8K 16K 16K 32K 32K 64K 64K 128K 128K 256K 256K 512K 512K 1024K 1M 2M PID 11491
316. ile type that stores tabular data Many popular spreadsheet programs such as Microsoft Excel can read from write to CSV files How lustre_config Works The lustre_config script parses each line in the CSV file and executes remote commands like mkfs lustre to format each Lustre target in the Lustre cluster Optionally the lustre_config script can also m Verify network connectivity and hostnames in the cluster m Configure Linux MD LVM devices a Modify etc modprobe conf to add Lustre networking information a Add the Lustre server information to etc fstab m Produce configurations for Heartbeat or CluManager Lustre 1 8 Operations Manual October 2009 How to Create a CSV File Five different types of line formats are available to create a CSV file Each line format represents a target The list of targets with the respective line formats are described below Linux MD device The CSV line format is hostname MD md name operation mode options raid level component devices Where Variable Supported Type hostname Hostname of the node in the cluster MD Marker of the MD device line md name MD device name for example dev md0 operation mode Operations mode either create or remove Default is create options A catchall for other mdadm options for example c 128 raid level RAID level 0 1 4 5 6 10 linear and multipath hostname Hostname of the node in the cluster component devices Block device
317. iles then the I O writes at offset O in each file The arguments are byte specifiers They generate runs with a range of offsets starting at OL increasing P until the region size exceeds OH Each of these arguments is exclusive with the offset argument Before each run executes the pre command as a shell command through the system 3 call The timestamp of the run is appended as the last argument to the pre command string Typically this is used to clear statistics or start a data collection script when the run starts After each run executes the post command as a shell command through the system 3 call The timestamp of the run is appended as the last argument to the pre command string Typically this is used to append statistics for the run or close an open data collection script when the run completes Lustre 1 8 Operations Manual October 2009 Parameter Description regioncount N N2 N3 regioncount_low RL regioncount_high RH regioncount_inc P regionnoise k regionsize S S2 S3 regionsize_low RL regionsize_high RH regionsize_inc P threadcount T T2 T3 threadcount_low TL threadcount_high TH threadcount_inc TP threaddelay ms fpp verify V timestamp Ltimestamp2 timestamp3 verify l V PIOS writes to N regions in a single file or block device or to N files Generate runs with a range of region counts starting at TL increasing P until the
318. ilures in the sub second range generally require special hardware As a result they are quite expensive Tip Failover of the Lustre client is dependent on the obd_timeout parameter The Lustre client does not attempt failover until the request times out Then the client tries to re send the request to the original server if again an obd_timeout occurs After that the Lustre client refers to the import list for that target and tries to connect in a round robin manner until one of the nodes replies The timeouts for the connection are much lower obd_timeout 20 5 2 This is true for every HA monitor not just the Lustre health_check 8 24 Lustre 1 8 Operations Manual October 2009 CHAPTER 9 Configuring Quotas This chapter describes how to configure quotas and includes the following sections Working with Quotas m Enabling Disk Quotas m Creating Quota Files and Quota Administration Quota Allocation Known Issues with Quotas m Lustre Quota Statistics 9 1 Working with Quotas Quotas allow a system administrator to limit the amount of disk space a user or group can use in a directory Quotas are set by root and can be specified for individual users and or groups Before a file is written to a partition where quotas are set the quota of the creator s group is checked If a quota exists then the file size counts towards the group s quota If no quota exists then the owner s user quota is che
319. in Lustre 1 0 2 as it provides useful information for diagnosing problems without materially impairing the performance of Lustre How can I improve Lustre metadata performance when using large directories gt 0 5 million files On the MDS more memory translates into bigger caches and therefore higher performance One of the requirements for higher metadata performance is to have lots of RAM on the MDS The other requirement if not running a 64 bit kernel is to patch the core kernel on the MDS with the 3G 1G patch to increase the available kernel address space This again translates into having support for bigger caches on the MDS Usually the address space is split in a 3 1 ratio 3G for userspace and 1G for kernel The 3G 1G patch changes this ratio to 3G for kernel 1G for user 3 1 or 2G for kernel and 2G for user 2 2 Appendix A Lustre Knowledge Base A 5 A 6 File system refuses to mount because of UUID mismatch When Lustre exports a device for the first time on a target MDS or OST it writes a randomly generated unique identifier UUID to the disk from the xml configuration file On subsequent exports of that device the Lustre code verifies that the UUID on disk matches the UUID in the xml configuration file This is a safety feature which avoids many potential configuration errors such as devices being renamed after the addition of new disks or controller cards to the system cabling errors etc This resul
320. in N A null checksum adler32 krb5n GSS Kerberos5 null checksum adler32 Almost no performance overhead The on wire RPC data is compatible with old versions of Lustre 1 4 x 1 6 x Carries checksum which only protects data mutating during transfer cannot guarantee the genuine author because there is no actual authentication No protection of the RPC message adler32 checksum protection of bulk data light performance overhead Chapter11 Kerberos 11 11 RPC Message Bulk Data Basic Flavor Authentication Protection Protection Remarks krb5a GSS Kerberos5 partial checksum Only the header of the RPC integrity adler32 message is integrity protected adler32 checksum protection of bulk data more performance overhead compared to krb5n krb5i GSS Kerberos5 integrity integrity RPC message integrity sha1 protection algorithm is determined by actual Kerberos algorithms in use heavy performance overhead krb5p GSS Kerberos5 privacy privacy RPC message privacy sha1 aes128 protection algorithm is determined by actual Kerberos algorithms in use heaviest performance overhead In Lustre 1 6 5 bulk data checksumming is enabled by default to provide integrity checking using the adler32 mechanism if the OSTs support it Adler32 checksums offer lower CPU overhead than CRC32 11 2 2 2 Security Flavor 11 12 A security flavor is a string that describes what kind of security transform is performed on a
321. in the format imposed by these file systems OSS Storage Each OSS can manage multiple object storage targets OSTs one for each volume I O traffic is load balanced against servers and targets An OSS should also balance network bandwidth between the system network and attached storage to prevent network bottlenecks Depending on the server s hardware an OSS typically serves between 2 and 25 targets with each target up to 8 terabytes TBs in size MDS Storage For the MDS nodes storage must be attached for Lustre metadata for which 1 2 percent of the file system capacity is needed The data access pattern for MDS storage is different from the OSS storage the former is a metadata access pattern with many seeks and read and writes of small amounts of data while the latter is an I O access pattern which typically involves large data transfers High throughput to MDS storage is not important Therefore we recommend that a different storage type be used for the MDS for example FC or SAS drives which provide much lower seek times Moreover for low levels of I O RAID 5 6 patterns are not optimal a RAID 0 1 pattern yields much better results Lustre uses journaling file system technology on the targets and for a MDS an approximately 20 percent performance gain can sometimes be obtained by placing the journal on a separate device Typically the MDS requires CPU power we recommend at least four processor cores Chapter 1 Introducti
322. inicon vib Voltaire o2ib OpenIB gen2 cib Cisco mx Myrinet MX gm Myrinet GM 2 elan Quadrics QSNet Note Lustre ignores the loopback interface 100 but Lustre use any IP addresses aliased to the loopback by default When in doubt explicitly specify networks Chapter 30 Configuration Files and Module Parameters man5 30 3 30 4 LL ip2nets is a string that lists globally available networks each with a set of IP address ranges LNET determines the locally available networks from this list by matching the IP address ranges with the local IPs of a node The purpose of this option is to be able to use the same modules conf file across a variety of nodes on different networks The string has the following syntax lt ip2nets gt lt net match gt lt comment gt lt net sep gt lt net match gt lt net match gt lt w gt lt net spec gt lt w gt lt ip range gt lt w gt lt ip range gt lt w gt lt net spec gt lt network gt lt interface list gt lt network gt lt nettype gt lt number gt lt nettype gt tcp elan openib lt iface list gt lt interface gt lt iface list gt lt ip range gt lt r expr gt lt r expr gt lt r expr gt lt r expr gt lt r expr gt lt number gt lt r list gt lt r list gt lt range gt lt r list gt lt range gt lt number gt
323. inux UML it might be useful to save the logs on the host machine so that they can be used at a later time 4 If you already have a debug log saved to disk likely from a crash to filter a log on disk letl gt debug_file lt input filename gt output filename During the debug session you can add markers or breaks to the log for any reason lctl gt mark marker text The marker text defaults to the current date and time in the debug log similar to the example shown below DEBUG MARKER Tue Mar 5 16 06 44 EST 2002 5 To completely flush the kernel debug buffer lctl gt clear Note Debug messages displayed with 1ct1 are also subject to the kernel debug masks the filters are additive Lustre 1 8 Operations Manual October 2009 23 2 4 23 2 5 23 2 6 Finding Memory Leaks Memory leaks can occur in a code where you allocate a memory but forget to free it when it becomes non essential You can use the leak_finder p1 tool to find memory leaks Before running this program you must turn on the debugging to collect all malloc and free entries Run sysctl w lnet debug malloc Dump the log into a user specified log file using Ictl as shown in The Ictl Tool Run the leak finder on the newly created log dump perl leak_finder pl lt logname gt The output is malloced 8bytes at a3116744 called pathcopy lprocfs_status c lprocfs_add_vars 80 freed 8bytes at a3116744 called pathcopy lprocfs
324. ion where debugging symbols should be stored for gdb The default is set to r tmp lustre log localhost localdomain These values can also be set via sysctl w lnet debug value Note The above entries only exist when Lustre has already been loaded proc sys Inet panic_on_lbug This causes Lustre to call panic when it detects an internal problem an LBUG panic crashes the node This is particularly useful when a kernel crash dump utility is configured The crash dump is triggered when the internal inconsistency is detected by Lustre proc sys Inet upcall This allows you to specify the path to the binary which will be invoked when an LBUG is encountered This binary is called with four parameters The first one is the string LBUG The second one is the file where the LBUG occurred The third one is the function name The fourth one is the line number in the file Lustre 1 8 Operations Manual October 2009 22 51 RPC Information for Other OBD Devices Some OBD devices maintain a count of the number of RPC events that they process Sometimes these events are more specific to operations of the device like lite than actual raw RPC counts find proc fs lustre name stats proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 d24265c91a3d stats proc fs 1 b36cf9chdaa0 stats proc fs 1 proc fs 1 proc fs 1 proc fs 1 us us us us us us us us us us us
325. ipe count printf Lov stripe offset u n lump gt lmm_ stripe_ offset for i 0 i lt lump gt lmm stripe count i printf Object index d Objid llu n lump gt lmm_objects i l_ost_idx free lump return rc 24 20 Lustre 1 8 Operations Manual October 2009 lump gt lmm_objects i l_object_id Ping all OSTs that belong to this filesysem int ping_osts DIR dir struct dirent d char osc_dir 100 int rc sprintf osc_ dir dir if proc fs lustre osc opendir osc_dir dir NULL printf Can t open dir n return 1 while d LE readdir dir NULL d gt d_type DT_DIR if strncemp d gt d_name OSC printf Pinging OSC s re llapi_ping osc tf trey printf else printf bad n good n return 0 int main int file int rc char filename 100 char sys_cmd 100 sprintf filename s s MY_LUSTRE_DIR printf Open a file with striping n file open_stripe_file if fille lt 0 printf Exiting n exit 1 printf Getting uuid list n Chapter 24 3 d gt d_name d gt d_name TESTFILE Striping and I O Options 24 21 re get_my_uuids file rintf Write to the file n re write_file file re close_file file printf Listing LOV data n re get_file_info filename printf Ping our OSTs n re ping_osts the res
326. ipt The basic method is to copy existing files to a temporary file then move the temp file over the old one This should not be attempted with files which are currently being written to by users or applications This operation redistributes the stripes over the entire set of OSTs For a sample data migration script see A Simple Data Migration Script A very clever migration script would do the following a Examine the current distribution of data a Calculate how much data should move from each full OST to the empty ones Search for files on a given full OST using 1fs getstripe a Force the new destination OST using lfs setstripe a Copy only enough files to address the imbalance If a Lustre administrator wants to explore this approach further per OST disk usage statistics can be found under proc fs lustre osc rpc_stats 26 2 Lustre 1 8 Operations Manual October 2009 26 2 A Simple Data Migration Script bin bash set x script to copy and check files To avoid allocating objects on one or more OSTs they should be deactivated on the MDS via lctl device device _ number deactivate where device_number is from the output of lctl dl on the MDS To guard against corruption the file is chksum d before and after the operation CKSUM S CKSUM md5sum usage echo usage 0 O lt OST_UUID to empty gt lt dir gt 1 gt amp 2 echo O can be specified multiple times
327. is the maximum number of cores CPUs on a single Catamount node Portals must know this value to properly clean up various queues LNET is not notified directly when a Catamount process aborts The first information LNET receives is when a new Catamount process with the same Cray portals NID starts and sends a connection request If the number of processes with that Cray portals NID exceeds the max_procs_per_node value LNET removes the oldest one to make space for the new one Chapter 20 Lustre Tuning 20 13 Parameter Description These two tunables combine to set the size of the ptllnd request buffer pool The buffer pool must never drop an incoming message so proper sizing is very important Ntx Ntx helps to size the transmit tx descriptor pool A tx descriptor is used for each send and each passive RDMA The max number of concurrent sends credits Passive RDMA is a response to a PUT or GET of a payload that is too big to fit in a small message buffer For servers this only happens on large RPCs for instance where a long file name is included so the MDS could be under pressure in a large cluster For routers this is bounded by the number of servers If the tx pool is exhausted a console error message appears Credits Credits determine how many sends are in flight at once on ptlind Optimally there are 8 requests in flight per server The default value is 128 which should be adequate for most applications
328. it is 8 characters NID s of the remote mgs node required for MDT and OST targets if this item is not given for an MDT it is assumed that the MDT is also an MGS according to mkfs lustre Lustre target index A catchall contains options to be passed to mkfs lustre For example device size param and so on Format options to be wrapped with mkfsoptions and passed to mkfs lustre If this script is invoked with m option then the value of this item is wrapped with mountfsoptions and passed to mkfs lustre otherwise the value is added into etc fstab NID s of the failover partner node Note In one node all NIDs are delimited by commas To use comma separated NIDs in a CSV file they must be enclosed in quotation marks for example lustre mgs2 2 elan When multiple nodes are specified they are delimited by a colon If you leave a blank it is set to default Lustre 1 8 Operations Manual October 2009 The lustre _config csv file looks like mdtname domainname options lnet networks tcp dev sdb mnt m t mgs mdt ost2name domainname options lnet networks tcp dev sda mnt ost1 ost 192 168 16 34 tcp0 ostiname domainname options lnet networks tcp dev sda mnt ost0 ost 192 168 16 34 tcp0 Note Provide a Fully Qualified Domain Name FQDN for all nodes that are a part of the file system in the first parameter of all the rows starting in
329. it is not already started 3 Then starting with the highest numbered device and working backward clean up each device root clctl AGEL T vvvvvVvVV VY cfg_device ost003_s1_client cleanup force detach fg_ device OSS leanup force fg_ device ost003_s1 leanup force detach c c c EC let C c C At this point you should be possible to unload the Lustre modules What is the default block size for Lustre The on disk block size for Lustre is 4 KB same as ext3 Nevertheless Lustre goes to great lengths to do 1 MB reads and writes to the disk as large requests are a key to getting very high performance Appendix A Lustre Knowledge Base A 33 A 34 How do I determine which Lustre server MDS OST was connected to a particular storage device In instances when the hardware configuration has changed e g moving equipment and re connecting it it is important to connect the right storage devices to the associated Lustre servers Lustre writes a UUID to every OST and MDS To view this information Mount the storage device as Idiskfs mount t ldiskfs dev foo mnt tmp Inspect the contents of the last_rcvd file in the root directory strings mnt tmp last_rcvd The MDS OST UUID is the first element in the last_rcvd file and is in a human readable form e g mds1_UUID Unmount the storage device and connect it to the appropriate Lustre server umount mnt tmp Note It is not p
330. ite for user A from a bundle of clients you will write much more data than 400 GB and cause an out of quota error EDQUOT Lustre 1 8 Operations Manual October 2009 9 1 4 2 Note The effect of granted cache on quota limits can be mitigated but not eradicated Reduce the max_ dirty _ buffer in the clients can be set from 0 to 512 To set max_dirty_buffer to 0 In releases after Lustre 1 6 5 lctl set_param osc max_dirty_mb 0 In releases before Lustre 1 6 5 proc fs lustre osc max_dirty_mb do echo 512 gt SO Quota Limits Available quota limits depend on the Lustre version you are using Lustre version 1 4 11 and earlier for 1 4 x releases and Lustre version 1 6 4 and earlier for 1 6 x releases support quota limits less than 4 TB m Lustre versions 1 4 12 1 6 5 and later support quota limits of 4 TB and greater in Lustre configurations with OST storage limits of 4 TB and less m Future Lustre versions are expected to support quota limits of 4 TB and greater with no OST storage limits Lustre Version Quota Limit Per User Per Group OST Storage Limit 1 4 11 and earlier lt 4TB n a 1 4 12 gt 4TB lt 4TB of storage 1 6 4 and earlier lt 4TB n a 1 6 5 gt 4TB lt 4TB of storage Future Lustre versions gt 4TB No storage limit Chapter 9 Configuring Quotas 9 11 9 1 4 3 9 12 Quota File Formats Lustre 1 6 5 introduced the v2 file format for administrative quotas with
331. ith the operating system 6 Make a mount point on all the OSTs for the file system and mount it mkdir p mnt data ost0 mount t lustre dev sda mnt data ost0 mkdir p mnt data ost1 mount t lustre dev sdd mnt data ost1 mkdir p mnt data ost2 mount t lustre dev sdal mnt data ost2 mkdir p mnt data ost3 mount t lustre dev sdb mnt data ost3 mount t lustre mdt16 tcp0 datafs mnt datafs 6 2 Lustre 1 8 Operations Manual October 2009 6 1 2 6 1 2 1 6 1 2 2 Lustre with Separate MGS and MDT The following example describes a Lustre file system datafs having an MGS and an MDT on separate nodes four OSTs and a number of Lustre clients Installation Summary m One MGS a One MDT m Four OSTs m Any number of Lustre clients Configuration Generation and Application 1 Install the Lustre RPMs per Installing Lustre on all the nodes that are going to be a part of the Lustre file system Boot the nodes in the Lustre kernel including the clients Change the modprobe conf by adding the following line to it options lnet networks tcp Start Lustre on the MGS node mkfs lustre mgs dev sda Make a mount point on MGS for the file system and mount it mkdir p mnt mgs mount t lustre dev sdal mnt mgs Start Lustre on the MDT node mkfs lustre fsname datafs mdt mgsnode mgsnode tcp0 dev sda2 Make a mount point on MDT MGS for the file
332. ject device Object storage opencache Orphan objects Orphan handling OSC OSD OSS OST Pdirops An older name for the OSD device driver An instance of an object that exports the OBD API Refers to a storage device API or protocol involving storage objects The two most well known instances of object storage are the T10 iSCSI storage object protocol and the Lustre object storage protocol a network implementation of the Lustre object API The principal difference between the Lustre and T10 protocols is that Lustre includes locking and recovery control in the protocol and is not tied to a SCSI transport layer A cache of open file handles This is a performance enhancement for NFS Storage objects for which there is no Lustre file pointing at them Orphan objects can arise from crashes and are automatically removed by an llog recovery When a client deletes a file the MDT gives back a cookie for each stripe The client then sends the cookie and directs the OST to delete the stripe Finally the OST sends the cookie back to the MDT to cancel it A component of the metadata service which allows for recovery of open unlinked files after a server crash The implementation of this feature retains open unlinked files as orphan objects until it is determined that no clients are using them Object Storage Client The client unit talking to an OST via an OSS Object Storage Device A generic industry term for storage device
333. k cfs21 rm mnt ostback last_rcvd cfs21 umount mnt ostback 2 Mount the snapshot file system For example cfs21 mount t lustre dev volgroup MDTb1 mnt mdtback cfs21 mount t lustre dev volgroup OSTb1 mnt ostback cfs21 mount t lustre cfs21 back mnt back 3 Note the old directory contents as of the snapshot time For example cfs21 cfs b1_5 lustre utils ls mnt back fstab passwds 15 3 2 Deleting Old Snapshots You can delete old snapshots to reclaim disk space Run lvremove dev volgroup MDTb1 15 3 3 Changing Snapshot Volume Size If you find that the daily deltas of changes are smaller or larger than you expected you can shrink or extend the snapshot volumes Run lvextend L10G dev volgroup MDTb1 15 8 Lustre 1 8 Operations Manual October 2009 CHAPTER 16 POSIX This chapter describes POSIX and includes the following sections m Installing POSIX m Running POSIX Tests Against Lustre Isolating and Debugging Failures Portable Operating System Interface POSIX is a set of standard operating system interfaces based on the Unix OS POSIX defines file system behavior on single Unix node It is not a standard for clusters POSIX specifies the user and software interfaces to the OS Required program level services include basic I O file terminal and network services POSIX also defines a standard threading library API which is supported by most modern operating systems POSIX i
334. kB pages If you are running i386 clients with ia64 servers you must compile the ia64 kernel with a 4kB PAGE_SIZE so the server page size is not larger than the client page size Required Lustre Software To install Lustre the following are required m Linux kernel patched with Lustre specific patches the patched Linux kernel is required only on Lustre MDSs and OSSs m Lustre modules compiled for the Linux kernel Lustre utilities required for Lustre configuration a Optional Network specific kernel modules and libraries for example kernel modules and libraries required for an InfiniBand interconnect These packages can be downloaded from the Lustre download site Required Tools and Utilities Several third party utilities are required m e2fsprogs Lustre requires a recent version of e2fsprogs that understands extents Use e2fsprogs 1 41 6 or later available with the Lustre file downloads Note Lustre patched e2fsprogs utility only needs to be installed on machines that mount backend Idiskfs file systems such as the OSS MDS and MGS nodes It does not need to be loaded on clients m Perl Various userspace utilities are written in Perl Any recent version of Perl will work with Lustre a Build tool Compiler If you plan to build Lustre from source code then you need a GCC compiler use GCC 3 0 or later If you are installing Lustre from RPMs you do not need a compiler Chapter 3 Installing Lustre 3 3
335. l OSTs of an OSS run root oss1 lctl get_param obdfilter readcache_max_filesize Chapter 22 LustreProc 22 23 22 2 8 22 24 mballoc History proc fs ldiskfs sda mb_history Multi Block Allocate mballoc enables Lustre to ask ext3 to allocate multiple blocks with a single request to the block allocator Typically an ext3 file system allocates only one block per time Each mballoc enabled partition has this file This is sample output pid 2838 2838 2838 2838 2838 2838 2838 2838 2838 2838 2828 2838 2838 2838 2838 inode 13926 13926 13926 24577 24578 32769 32770 32111 32772 32773 32774 32115 32776 32777 32778 goal result found grpscr merge tailbroken 7 17 12288 1 17 12288 1 1 0 0 M La 8192 7 17 12289 1 17 12289 1 1 0 0 M 0 0 7 17 12290 1 17 12290 1 1 0 0 M 2 3 12288 1 3 12288 1 1 0 0 M 8192 3 12288 1 3 771 1 1 1 0 0 4 12288 1 4 12288 1 1 0 0 M 8192 4 12288 1 4 12289 1 13 1 0 0 4 12288 1 5 771 1 26 2 0 0 4 12288 1 5 896 1 31 2 di 128 4 12288 1 5 897 1 32 2 0 0 4 12288 1 5 898 1 31 2 1 2 4 12288 1 5 899 1 31 2 X 0 0 4 12288 1 5 900 1 31 2 1 4 4 12288 1 5 901 1 3T 2 0 0 4 12288 1 5 902 1 32 2 2 The parameters are described below Parameter Description pid Process that made the allocation inode inode number allocated blocks goal Initial request that came to mballoc group block in group number of blocks result What mballoc actually
336. l has more support to read packets from clients to OSTs than to decode packets between clients and MDSs The tcpdump module is available from Lustre CVS at www sourceforge net It can be checked out as cvs co d ext lt username gt cvs lustre org cvsroot lustre tcpdump 23 5 23 16 Ptlrpc Request History Each service always maintains request history which is useful for first occurrence troubleshooting Ptlrpc history works as follows 1 Request_in_callback adds the new request to the service s request history 2 When a request buffer becomes idle add it to the service s request buffer history list 3 Cull buffers from the service s request buffer history if it has grown above req_buffer_history_max and remove its reqs from the service s request history Request history is accessed controlled via the following proc files under the service directory m req _buffer_history_len Number of request buffers currently in the history m req buffer history max Maximum number of request buffers to keep m req history The request history Lustre 1 8 Operations Manual October 2009 Requests in the history include live requests that are actually being handled Each line in req_history looks like lt seq gt lt target NID gt lt client ID gt lt xid gt lt length gt lt phase gt lt svc specific gt Parameter Description seq Request sequence number target NID Destination NID of the incoming request client
337. lap each other The test consists of creating a new file writing it with data then reading the data back The IOR benchmark developed by LLNL tests system performance by focusing on parallel sequential read write operations that are typical of scientific applications To install and run the IOR benchmark 1 Satisfy the prerequisites to run IOR a Download lam 7 0 6 local area multi computer http www lam mpi org 7 0 download php b Obtain a Fortran compiler for the Fedora Core 4 operating system c Download the most recent version of the IOR software http sourceforge net projects ior sio Chapter 17 Benchmarking 17 3 17 4 Install the IOR software per the ReadMe file and User Guide accompanying the software Run the IOR software In user mode use the lamboot command to start the lam service and use appropriate Lustre specific commands to run IOR described in the IOR User Guide Sample Output TOR 2 9 0 MPI Coordinated Test of Parallel I O Run began Fri Sep 29 11 43 56 2006 Command line used IOR w r k O lustrestripecount 10 o test Machine Linux mds Summary api POSIX test filename test access single shared file clients 1 1 per node repetitions SN xfersize 262144 bytes blocksize 1 MiB aggregate filesize 1 MiB access bw MiB s block KiB xfer KiB open s wr rd s close s iter write 173 89 1024 00 256 00 0 0000300 0057010 0000160 read 278 49 1024 00 256 00 0 0000090 0
338. lding target data must be available as shared storage to failover server nodes Failover nodes are statically configured as mount options On the MDS run mkfs lustre mdt mgs fsname lt fsname gt failover lt failover MGS NID gt lt partition gt mount t lustre lt partition gt lt mount point gt On the OSS run mkfs lustre ost mgs fsname lt fsname gt mgsnode lt MGS NID gt lt failover MGS NID gt lt partition gt mount t lustre lt partition gt lt mount point gt On the client run mount t lustre lt MGS NID gt lt failover MGS NID gt lt fsname gt lt mount point gt IP Network Failover MDS and OSS On the MDS run mkfs lustre mdt mgs fsname lt fsname gt failover lt failover MGS NID gt lt partition gt mount t lustre lt partition gt lt mount point gt On the OSS run mkfs lustre ost mgs fsname lt fsname gt mgsnode lt MGS NID gt lt failover mds hostdesc gt failover lt failover OSS NID gt lt partition gt mount t lustre lt partition gt lt mount point gt On the client run mount t lustre lt MGS NID gt lt failover MGS NID gt lt fsname gt lt mount point gt Lustre 1 8 Operations Manual October 2009 4 5 1 4 5 2 4 5 3 Unmounting a Server without Failover To stop a server MDS or OSS without failover run umount lt mds oss mountpoint gt This stops the server unconditionally and cleans up client
339. le or directory is one that cannot be modified renamed or removed To do this chattr i lt file gt To remove this flag use chattr i Chapter 24 Striping and I O Options 24 15 24 7 24 7 1 24 16 Other I O Options This section describes other I O options including checksums Lustre Checksums To guard against network data corruption a Lustre client can perform two types of data checksums in memory for data in client memory and wire for data sent over the network For each checksum type a 32 bit checksum of the data read or written on both the client and server is computed to ensure that the data has not been corrupted in transit over the network The Idiskfs backing file system does NOT do any persistent checksumming so it does not detect corruption of data in the OST file system In Lustre 1 6 5 and later the checksumming feature is enabled by default on individual client nodes If the client or OST detects a checksum mismatch then an error is logged in the syslog of the form LustreError BAD WRITE CHECKSUM changed in transit before arrival at OST from 192 168 1 1 tcp inum 8991479 2386814769 object 1127239 0 extent 102400 106495 If this happens the client will re read or re write the affected data up to five times to get a good copy of the data over the network If it is still not possible then an I O error is returned to the application To enable both types of checksums in memory and wir
340. le parameter If this is omitted then the first non loopback IP interface that is up is used instead Variable Description n_connd Sets the number of connection daemons 4 min_reconnect_interval 1 W max_reconnect_interval 60 W timeout 30 W ntx 64 ntx_nblk 256 fma_cq_size 8192 max_immediate 2048 W Minimum connection retry interval in seconds After a failed connection attempt this sets the time that must elapse before the first retry As connections attempts fail this time is doubled on each successive retry up to a maximum of the max_reconnect_interval option Maximum connection retry interval in seconds Time in seconds that communications may be stalled before the LND completes them with failure Number of normal message descriptors for locally initiated communications that may block for memory callers block when this pool is exhausted Number of reserved message descriptors for communications that may not block for memory This pool must be sized large enough so it is never exhausted Number of entries in the RapidArray FMA completion queue to allocate It should be increased if the ralnd starts to issue warnings that the FMA CQ has overflowed This is only a performance issue Size in bytes of the smallest message that will be RDMA d rather than being included as immediate data in an FMA All messages greater than 6912 bytes must be RDMA d FMA limit
341. line upgraded and restarted or failed over to a standby server with the new software All active jobs continue to run without failures they merely experience a delay Lustre MDSs are configured as an active passive pair while OSSs are typically deployed in an active active configuration that provides redundancy without extra overhead as shown in FIGURE 1 8 Often the standby MDS is the active MDS for another Lustre file system so no nodes are idle in the cluster FIGURE 1 8 Lustre failover configurations for OSSs and MDSs Shared storage partitions Shared storage partition for OSS targets OST for MDS target MDT oss1 OSS2 MDS 1 MDS 2 OSS1 active for target 1 standby for target 2 MDS1 active for MDT OSS2 active for target 2 standby for target 1 MDS2 standby for MDT Chapter 1 Introduction to Lustre 1 17 Although a file system checking tool lfsck is provided for disaster recovery journaling and sophisticated protocols re synchronize the cluster within seconds without the need for a lengthy fsck Lustre version interoperability between successive minor versions is guaranteed As a result the Lustre failover capability is used regularly to upgrade the software without cluster downtime Note Lustre does not provide redundancy for data it depends exclusively on redundancy of backing storage devices The backing OST storage should be RAID 5 or preferably RAID 6 storage MDT storage should be RAID 1 or RAID 0 1
342. lity is a test program you can use to generate large loads on local or remote OSTs or echo servers For more information on loadgen and its usage refer to https mail clusterfs com wikis lustre LoadGen llog_reader The llog_reader utility translates a Lustre configuration log into human readable form Ir_ reader The 1r_reader utility translates a last received 1ast_rcvd file into human readable form Chapter 31 System Configuration Utilities man8 31 21 SLS 31 5 7 1 Flock Feature Lustre now includes the flock feature which provides file locking support Flock describes classes of file locks known as flocks Flock can apply or remove a lock on an open file as specified by the user However a single file may not simultaneously have both shared and exclusive locks By default the flock utility is disabled on Lustre Two modes are available local mode In this mode locks are coherent on one node a single node flock but not across all clients To enable it use o localflock This is a client mount option NOTE This mode does not impact performance and is appropriate for single node databases consistent mode In this mode locks are coherent across all clients To enable it use the o flock This is a client mount option CAUTION This mode has a noticeable performance impact and may affect stability depending on the Lustre version used Consider using a newer Lustre version which is more stable A
343. ll it on the affected nodes Make sure you are using the most recent version of the diagnostics tool 1 Once you have a Lustre Bugzilla account open a new bug and describe the problem you having 2 Run the Lustre diagnostics tool using one of the following commands lustre diagnostics t lt bugzilla bug gt lustre diagnostics In case you need to use it later the output of the bug is sent directly to the terminal Normal file redirection can be used to send the output to a file which you can manually attach to this bug if necessary Lustre 1 8 Operations Manual October 2009 21 4 21 4 1 Common Lustre Problems and Performance Tips This section describes common issues encountered with Lustre as well as tips to improve Lustre performance Recovering from an Unavailable OST One of the most common problems encountered in a Lustre environment is when an OST becomes unavailable because of a network partition OSS node crash etc When this happens the OST s clients pause and wait for the OST to become available again either on the primary OSS or a failover OSS When the OST comes back online Lustre starts a recovery process to enable clients to reconnect to the OST Lustre servers put a limit on the time they will wait in recovery for clients to reconnect During recovery clients reconnect and replay their requests serially in the same order they were done originally gt Periodically a progress messag
344. lover 8 23 Likewise the Lustre health_check mechanism does not provide perfect protection against any or all failures It is a sample taken at a time interval not something that brackets each and every I O request There are a few places where health_check could generate a bad status m Ona device basis if there are requests that have not been processed in a very long time more than the maximum allowed timeout a CERROR is printed service unhealthy request has been waiting Ns Ns is the number of seconds The CERROR displays a true value for Ns for example request has been waiting 100s m If the backing file system has gone read only due to file system errors m Ona per device basis if any of the above failed it is reported in the proc fs lustre health_check file device device reported unhealthy m If ANY device or service on the node is unhealthy it also prints NOT HEALTHY m If ALL devices and services on the node are healthy it prints healthy There will be cases where a user job will die prior to the HA software triggering a failover You can certainly shorten timeouts add monitoring and take other steps to decrease this probability But there is a serious trade off shortening timeouts increases the probability of false triggering a busy system Increasing monitoring takes the system resources and can likewise cause a false trigger Unfortunately hard failover solutions capable of catching fa
345. ltiple incremental backups of a Lustre file system LVM snapshots are most useful as a way to freeze the MDT OST file systems and then do a consistent backup from the snapshot Keep in mind that creating an LVM snapshot is not as reliable as making a separate backup because the LVM snapshot shares the same disks as the primary MDT device and depends on the primary MDT device for much of its data If the primary MDT device becomes corrupted this may result in the snapshot being corrupted Also consider that LVM snapshots cost CPU cycles as new files are written so taking snapshots of the main Lustre file system will probably result in unacceptable performance losses To get around this problem create a new backup file system and periodically back up new changed files Take periodic snapshots of this backup file system to create a series of compact full backups To use LVM to make snapshots for backup purposes the Lustre MDT and or OST needs to be initially configured on an LVM block device For information on configuring Lustre using LVM see Configuring Lustre on LVM Devices Chapter 15 Backup and Restore 15 5 15 3 1 15 3 1 1 Creating LVM Snapshot Volumes To make a checkpoint of the Lustre file system create LVM snapshots of all the target disks in main You must set the maximum size of a snapshot ahead of time although this setting can be changed later The size of a daily snapshot depends on the amount of data that cha
346. luster to declare that an OST has failed so errors can be immediately returned Cluster File Systems Inc a United States corporation founded in 2001 by Peter J Braam to develop maintain and support Lustre Clustered metadata a collection of metadata targets implementing a single file system namespace An RPC made by an OST or MDT to another system usually a client to indicate that the lock request is now granted An llog file used in a node or retrieved from a management server over the network with configuration instructions for Lustre systems at startup time A lock held by every node in the cluster to control configuration changes When callbacks are received the nodes quiesce their traffic cancel the lock and await configuration changes after which they reacquire the lock before resuming normal operation Glossary 1 D Default stripe pattern Direct I O Directory stripe descriptor EA Eviction Export Extent Lock F Failback Failout OST Information in the LOV descriptor that describes the default stripe count used for new files in a file system This can be amended by using a directory stripe descriptor or a per file stripe descriptor A mechanism which can be used during read and write system calls It bypasses the kernel 1 0 cache to memory copy of data between kernel and application memory address spaces An extended attribute that describes the default stripe pattern for files und
347. lustre mgsnode tcp0 bar mnt lustre2 Lustre 1 8 Operations Manual October 2009 4 3 9 4 3 9 1 4 3 9 2 setting Lustre Parameters There are several options for setting parameters in Lustre m When the file system is created Parameters can be set using the mkfs lustre command m When a server is stopped Parameters can be set on an idle server using the tunefs lustre command m When the file system is running Parameters can be set on a live file system using the 1ct1 command Setting Parameters with mkfs lustre When the file system is created parameters can simply be added as a param option to the mkfs lustre command For example mkfs lustre mdt param Sys timeout 50 dev sda Setting Parameters with tunefs lustre If a server OSS or MDS is stopped parameters can be added using the param option to the tunefs lustre command For example tunefs lustre param failover node 192 168 0 13 tcp0 dev sda With tunefs lustre parameters are additive new parameters are specified in addition to old parameters they do not replace them To erase all old tunefs lustre parameters and just use newly specified parameters run tunefs lustre erase params param lt new parameters gt The tunefs lustre command can be used to set any parameter settable in a proc fs lustre file and that has its own OBD device so it can be specified as lt obd fsname gt lt obdtype gt lt proc_file_name gt lt valu
348. lustrel Note If a client s will be mounted on several file systems add the following line to etc xattr conf file to avoid problems when files are moved between the file systems lustre skip Note The MGS is universal there is only one MGS per Lustre installation not per file system Note There is only one file system per MDT Therefore specify mdt mgs on one file system and mdt mgsnode lt MGS node NID gt on the other file systems Chapter 4 Configuring Lustre 4 19 4 20 A Lustre installation with two file systems foo and bar could look like this where the MGS node is mgsnode tcp0 and the mount points are mt lustrel and mnt lustre2 mgsnode mkfs lustre mgs mnt lustrel mdtfoonode mkfs lustre fsname foo mdt mgsnode mgsnode tcp0 mnt lustrel ossfoonode mkfs lustre fsname foo ost mgsnode mgsnode tcp0 mnt lustrel ossfoonode mkfs lustre fsname foo ost mgsnode mgsnode tcp0 mnt lustre2 mdtbarnode mkfs lustre fsname bar mdt mgsnode mgsnode tcp0 mnt lustrel ossbarnode mkfs lustre fsname bar ost mgsnode mgsnode tcp0 mnt lustrel ossbarnode mkfs lustre fsname bar ost mgsnode mgsnode tcp0 mnt lustre2 To mount a client on file system foo at mount point mt lustrel run mount t lustre mgsnode tcp0 foo mnt lustrel To mount a client on file system bar at mount point mnt lustre2 run mount t
349. m Aug 9 09 50 44 oss161 crmd 4733 info update dc Set DC to lt null gt lt nul1l gt Aug 9 09 50 44 oss161 crmd 4733 info do_election_check Still waiting on 2 non votes 2 total Aug 9 09 50 44 oss161 crmd 4733 info do_election_count_vote Updated voted hash for oss161 clusterfs com to vote Aug 9 09 50 44 oss161 crmd 4733 info do_election_count_vote Election ignore our vote oss161 clusterfs com Aug 9 09 50 44 oss161 crmd 4733 info do_election_check Still waiting on 1 non votes 2 total Aug 9 09 50 44 oss161 crmd 4733 info do_state_transition State transition S_ELECTION gt S_PENDING input I_PENDING cause C_FSA_INTERNAL origin do_election_count_vote Aug 9 09 50 44 oss161 crmd 4733 info update dc Set DC to lt null gt lt null gt Aug 9 09 50 44 oss161 crmd 4733 info do_dc_release DC role released Aug 9 09 50 45 oss161 crmd 4733 info do_election_count_vote Election check vote from oss162 clusterfs com Aug 9 09 50 45 oss161 crmd 4733 info update dc Set DC to lt null gt lt nul1l gt Aug 9 09 50 46 oss161 crmd 4733 info update dc Set DC to oss162 clusterfs com 1 0 9 Aug 9 09 50 47 oss161 crmd 4733 info update dc Set DC to oss161 clusterfs com 1 0 9 Aug 9 09 50 47 oss161 cib 4729 info cib replace notify Local only Replace 0 0 1 from lt null gt Aug 9 09 50 47 oss161 crmd 4733 info do_state_transition State transition S_PENDING g
350. mber of Linux kernels offer software RAID support by which the kernel organizes disks into a RAID array All Lustre supported kernels have software RAID capability but Lustre has added performance improvements to the RHEL 4 and RHEL 5 kernels that make operations even faster Therefore if you are using software RAID functionality we recommend that you use a Lustre patched RHEL 4 or 5 kernel to take advantage of these performance improvements rather than a SLES kernel Enabling Software RAID on Lustre This procedure describes how to set up software RAID on a Lustre system It requires use of mdadm a third party tool to manage devices using software RAID 1 Install Lustre but do not configure it yet See Installing Lustre 2 Create the RAID array with the mdadm command The mdadm command is used to create and manage software RAID arrays in Linux as well as to monitor the arrays while they are running To create a RAID array use the create option and specify the MD device to create the array components and the options appropriate to the array Note For best performance we generally recommend using disks from as many controllers as possible in one RAID array To illustrate how to create a software RAID array for Lustre the steps below include a worked example that creates a 10 disk RAID 6 array from disks dev dsk c0t0d0 through cOtod4 and dev dsk c1t0d0 through cltod4 This RAID array has no spares For the 10 di
351. mber to indicate the error llapi Errors Ilapi errors are described below Errors Description EFAULT qctl is invalid ENOSYS Kernel or Lustre modules have not been compiled with the QUOTA option ENOMEM Insufficient memory to complete operation ENOTTY qc_cmdis invalid EBUSY Cannot process during quotacheck ENOENT UUID does not correspond to OBD or mnt does not exist EPERM The call is privileged and the caller is not the super user ESRCH No disk quota is found for the indicated user Quotas have not been turned on for this file system 29 8 Lustre 1 8 Operations Manual October 2009 29 1 5 llapi_path2fid Use llapi_path2fid to get the FID from the pathname Synopsis include lt lustre liblustreapi h gt include lt lustre lustre_user h gt int llapi_path2fid const char path unsigned long long seq unsigned long oid unsigned long ver Description The llapi_path2fid function returns the FID sequence object ID version for the pathname Return Values llapi_path2fid returns 0 on success non zero value on failure Chapter 29 Setting Lustre Properties man3 29 9 29 10 Lustre 1 8 Operations Manual October 2009 CHAPTER 30 Configuration Files and Module Parameters man5 This section describes configuration files and module parameters and includes the following sections m Introduction a Module Options 30 1 Introduction LNET network hardware and routing are no
352. mbers and location of a specific file Synopsis lfs lfs lfs lfs print p check lt mds osts servers gt df i h pool p lt fsname gt lt pool gt path find atime A N mtime M N ctimel C N print0O P obd 0 lt uuid s gt size S N kMGTPE type t bcdflpsD gid g N uid u N lt dirname filename gt lfs lfs lfs lfs lfs lfs lfs lfs lfs lfs lfs lfs lfs osts EL maxdepth D N name n pattern group G lt name gt user U lt name gt getstripe obd 0 lt uuid gt quiet q verbose v recursive r lt dirname filename gt setstripe size offset lt dirname s stripe size count c stripe cnt o start ost pool p lt pool gt filename gt setstripe d lt dirname gt poollist lt filename lt pool gt lt pathname gt quota v o obd_uuid u lt username gt g lt groupname gt lt filesystem gt quota t lt u g gt lt filesystem gt quotacheck ugf lt filesystem gt quotachown i lt filesystem gt quotaon ugf lt filesystem gt quotaoff ug lt filesystem gt quotainv ug f lt filesystem gt setquota lt u use block r g group gt lt username groupname gt softlimit lt block softlimit gt block hardlimit lt block hardlimit gt inode s
353. me of formatting This ensures that the file system is optimized for the underlying disk geometry Use the mkfsoptions parameter to specify these options when formatting the OST or MDT For RAID 5 RAID 6 RAID 1 0 storage specifying the E stride lt chunksize gt option improves the layout of the file system metadata ensuring that no single disk contains all of the allocation bitmaps The lt chunksize gt parameter is in units of 4096 byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk This is applicable to both MDS and OST file systems For more information on how to override the defaults while formatting MDS or OST file systems see Options to Format MDT and OST File Systems 2 Client writeback cache improves performance for many small files or for a single large file alike However if the cache is filled with small files cache flushing is likely to be much slower because of less data being sent per RPC so there may be a drop off in total throughput Chapter 10 RAID 10 5 10 1 5 1 Creating an External Journal If you have configured a RAID array and use it directly as an OST it houses both data and metadata For better performance we recommend putting OST metadata on another journal device by creating a small RAID 1 array and using it as an external journal for the OST It is not known if external journals improve performance of MDTs Currently we recommend ag
354. mer does NOT mean that the thread OOPSed but rather that it is taking longer time than expected to complete a given operation In some cases this situation is expected For example if a RAID rebuild is really slowing down I O on an OST it might trigger watchdog timers to trip But another message follows shortly thereafter indicating that the thread in question has completed processing after some number of seconds Generally this indicates a transient problem In other cases it may legitimately signal that a thread is stuck because of a software error lock inversion for example Lustre 0 0 watchdog c 122 1lcw_chb The above message indicates that the watchdog is active for pid 933 It was inactive for 100000ms Lustre 0 0 linux debug c 132 portals_debug_dumpstack Showing stack for process 933 ll_ost_25 D F896071A 0 933 1 934 932 L TLB 6d87c60 00000046 00000000 896071a f8def7cc 00002710 00001822 2da48cae 0008cf1a f6d7c220 f6d7c3d0 6d86000 3529648 f6d87cc4 3529640 8961d3d 00000010 6d87c9c ca65al3c 00001fff 00000001 00000001 00000000 00000001 Chapter 21 Lustre Monitoring and Troubleshooting 21 17 21 4 15 Call trace filter_do_bio 0x3dd 0xb90 obdfilter default_wake_function 0x0 0x20 filter_direct_io 0x2fb 0x990 obdfilter filter_preprw_read 0x5c5 0xe00 obdfilter lustre_swab_niobuf_remote 0x0 0x30 ptlrpc ost_brw_read 0x18df 0x2400 ost ost_handle 0x14c2 0x42d0 ost ptlrpc_server_handle_
355. message descriptors for communications that may not block for memory This pool must be sized large enough so it is never exhausted Maximum number of queue pairs and therefore the maximum number of peers that the instance of the LND may communicate with Boolean that determines whether messages NB not RDMAs should be check summed This is a diagnostic feature that should not normally be enabled Lustre 1 8 Operations Manual October 2009 30 2 7 Portals LND Linux The Portals LND Linux ptllnd can be used as a interface layer to communicate with Sandia Portals networking devices This version is intended to work on Cray XT3 Linux nodes that use Cray Portals as a network transport Message Buffers When ptllnd starts up it allocates and posts sufficient message buffers to allow all expected peers set by concurrent_peers to send one unsolicited message The first message that a peer actually sends is a so called HELLO message used to negotiate how much additional buffering to setup typically 8 messages If 10000 peers actually exist then enough buffers are posted for 80000 messages The maximum message size is set by the max_msg_size module parameter default value is 512 This parameter sets the bulk transfer breakpoint Below this breakpoint payload data is sent in the message itself Above this breakpoint a buffer descriptor is sent and the receiver gets the actual payload The buffer size is set by th
356. met Supported Operating System Platform and Interconnect m Required Lustre Software m Required Tools and Utilities m Optional High Availability Software m Debugging Tools m Environmental Requirements m Memory Requirements Supported Operating System Platform and Interconnect Lustre 1 8 supports the following operating systems platforms and interconnects Make sure you are using a supported configuration Configuration Component Supported Type Operating system Platform Interconnect Red Hat Enterprise Linux 5 SuSE Linux Enterprise Server 10 SuSE Linux Enterprise Server 11 i686 and x84_64 only Lustre 1 8 1 Linux kernel 2 6 16 or greater NOTE Lustre does not support security enhanced SE Linux including clients and servers x86 A 64 x86 64 EM64 and AMD64 PowerPC architectures for clients only and mixed endian clusters TCP IP Quadrics Elan 3 and 4 Myri 10G and Myrinet 2000 Mellanox InfiniBand Voltaire OpenIB Silverstorm and any OFED supported InfiniBand adapter 3 2 2 We encourage the use of 64 bit platforms Lustre 1 8 Operations Manual October 2009 Gi 3 1 3 Note Lustre clients running on architectures with different endianness are supported One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server In particular ia64 clients with large pages up to 64kB pages can run with i386 servers 4
357. mine Appropriate Mount Parameters for Clients 2 4 2 4 Configuring LNET 2 5 24 1 Module Parameters 2 5 2 4 2 Module Parameters Routing 2 8 24 3 Downed Routers 2 12 2 5 Starting and Stopping LNET 2 13 2 5 1 Starting LNET 2 13 2 5 2 Stopping LNET 2 14 vi Lustre 1 8 Operations Manual October 2009 Part II Lustre Administration 3 Installing Lustre 3 1 3 1 3 2 3 3 Preparing to Install Lustre 3 2 3 1 1 Supported Operating System Platform and Interconnect 3 2 3 12 Required Lustre Software 3 3 3 1 3 Required Tools and Utilities 3 3 3 14 Optional High Availability Software 3 4 3 15 Debugging Tools 3 4 3 16 Environmental Requirements 3 5 3 17 Memory Requirements 3 6 Installing Lustre from RPMs 3 9 Installing Lustre from Source Code 3 13 3 3 1 Patching the Kernel 3 13 3 3 2 Create and Install the Lustre Packages 3 16 3 3 3 Installing Lustre with a Third Party Network Stack 3 19 Configuring Lustre 4 1 4 1 4 2 4 3 Configuring the Lustre File System 4 2 41 1 Scaling the Lustre File System 4 10 Additional Lustre Configuration 4 11 42 1 Configuring Lustre on LVM Devices 4 11 Basic Lustre Administration 4 13 43 1 Specifying the File System Name 4 14 4 3 2 Mounting a Server 4 14 4 3 3 Unmounting a Server 4 15 434 Working with Inactive OSTs 4 16 4 3 5 Finding Nodes in the Lustre File System 4 16 4 3 6 Mounting a Server Without Lustre Service 4 17 Contents vii 43 7 Specifying Failout Failover Mode
358. mnt myfilesystem The Management Service is running on a node reachable from this client via the cfs21 tcp0 NID mount t lustre cfs21 tcp0 testfs mnt myfilesystem Starts the Lustre target service on dev sdal mount t lustre dev sdal mnt test mdt Starts the test s MDT0000 service using the disk label but aborts the recovery process mount t lustre L testfs MDTO000 o abort_recov mnt test mdt Note If the Service Tags tool from the sun servicetag package can be found in opt sun servicetag bin stclient an inventory service tag is created reflecting the Lustre service being provided If this tool cannot be found mount lustre silently ignores it and no service tag is created The stclient 1 tool only creates the local service tag No information is sent to the asset management system until you run the Registration Client to collect the tags and then upload them to the inventory system using your inventory system account For more information see Service Tags Chapter 31 System Configuration Utilities man8 31 17 31 5 21 54 91 52 31 18 Additional System Configuration Utilities This section describes additional system configuration utilities that were added in Lustre 1 6 lustre rmmod sh The lustre_rmmod sh utility removes all Lustre and LNET modules assuming no Lustre services are running It is located in usr bin Note The lustre_rmmod sh utility does not work if Lustre modules are bei
359. more Appendix A Lustre Knowledge Base A 17 A 18 Note CluManager requires two 10M LUNs visible to each member of a failover group For more information see http www redhat com docs manuals enterprise RHEL 3 Manual cluster suite For more download see http ftp redhat com pub redhat linux enterprise 3 en RHCS i386 SRPMS In the future we hope to publish more information and sample scripts to configure Heartbeat and CluManager with Lustre Is there a way to tell which OST is being used by a client process If a process is doing I O to a file use the lfs getstripe command to see the OST to which it is writing Using cat as an example run cat gt foo While that is running on another terminal run readlink proc pidof cat fd 1 barn users jacob tmp foo You can also Is 1 proc lt pid gt fd to find open files using Lustre lfs getstripe readlink proc pidof cat fd 1 OBDS 0 databarn ost1_UUID ACTIVI 1 databarn ost2_UUID ACTIVI 2 databarn ost3_UUID ACTIVI 3 databarn ost4_UUID ACTIVI barn users jacob tmp foo obdidx objid objid group 2 835487 Oxcbf9Ff 0 Li ST ES E The output shows that this file lives on obdidx 2 which is databarn ost3 Lustre 1 8 Operations Manual October 2009 To see which node is serving that OST run cat proc fs lustre osc databarn ost3 ost_conn_uuid NID_oss1 databarn 87k net_UUID The above also works with connections to the
360. mple tmp mdsdb is the database file e2fsck n v mdsdb tmp mdsdb dev mdsdev Lustre 1 8 Operations Manual October 2009 Example e2fsck n v mdsdb tmp mdsdb dev sdb e2fsck 1 39 cfsl 29 May 2006 Warning skipping journal recovery because doing a read only filesystem check lustre MDT0000 contains a file system with errors check forced Pass 1 Checking inodes blocks and sizes MDS ost_idx 0 max_id 288 MDS got 8 bytes 1 entries in lov_objids MDS max_files 13 MDS num_osts 1 mds info db file written Pass 2 Checking directory structure Pass 3 Checking directory connectivity Pass 4 Checking reference counts Pass 5 Checking group summary information Free blocks count wrong 656160 counted 656058 Fix no Free inodes count wrong 786419 counted 786036 Fix no Pass 6 Acquiring information for lfsck MDS max_files 13 MDS num_osts 1 MDS lustre MDTO000 UUID mdt idx 0 compat 0x4 rocomp 0x1 incomp 0x4 lustre MDTO000 WARNING Filesystem still has errors kkkkkkxk 13 inodes used 0 2 non contiguous inodes 15 4 of inodes with ind dind tind blocks 0 0 0 130272 blocks used 16 0 bad blocks 1 large file 296 regular files 91 directories 0 character device files 0 block device files 0 fifos 0 links 0 symbolic links 0 fast symbolic links 0 sockets 387 files Chapter 27 User Utilities man1 27 15 27 16 3 Make this file accessible on
361. mponent at a time and avoid the shutdown see Performing a Rolling Upgrade Tip In a Lustre upgrade the package install and file system unmount steps are reversible you can do either step first To minimize downtime this procedure first performs the 1 8 x package installation and then unmounts the file system 1 Make a complete restorable file system backup before upgrading Lustre 2 Install the 1 8 x packages on the Lustre servers and or clients Some or all servers can be upgraded Some or all clients can be upgraded For help determining where to install a specific package see TABLE 3 1 Lustre packages descriptions and installation guidance a Install the kernel modules and Idiskfs packages For example rpm ivh kernel lustre smp lt ver gt kernel ib lt ver gt lustre modules lt ver gt lustre ldiskfs lt ver gt b Upgrade the utilities userspace packages For example rpm Uvh lustre lt ver gt c If anew e2fsprogs package is available upgrade it For example rpm Uvh e2fsprogs lt ver gt There may or may not be a new e2fsprogs package with a Lustre upgrade The e2fsprogs release schedule is independent of Lustre releases d Optional If you want to add optional packages to your Lustre system install them now Lustre 1 8 Operations Manual October 2009 3 Shut down the file system Shut down the components in this order clients then the MDT then OSTs Unmounting a block d
362. n OR you can specify reformat in the ninth field of the target line in the csv file m no fstab change don t modify etc fstab to add the new Lustre targets If using this option then the value of mount options item in the csv file will be passed to mkfs lustre else the value will be added into the etc fstab v verbose mode csv file is a spreadsheet that contains configuration parameters separated by commas for each target in a Lustre cluster Lustre 1 8 Operations Manual October 2009 Example 1 Simple Lustre configuration with CSV use the following command lustre config v a f lustre config csv This command starts the execution and configuration on the nodes or targets in lustre_config csv prompting you for the password to log in with root access to the nodes To avoid this prompt configure a shell like pdsh or SSH After completing the above steps the script makes Lustre target entries in the etc fstab file on Lustre server nodes such as dev sdb mnt mdtlustre defaults 0 o0 dev sda mnt ostlustre defaults 0 o0 2 Run mount dev sdb and mount dev sda to start the Lustre services Note Use the usr sbin lustre_createcsv script to collect information on Lustre targets from running a Lustre cluster and generating a CSV file It is a reverse utility compared to lustre_config and should be run on the MGS node Example 2 More complicated Lustre configuration with CSV For RAID and LVM based c
363. n most cases you do not need to specify the lt direction gt part Examples Apply krb5i on ALL connections mgs gt lctl conf_param lustre srpc flavor default krb5i For nodes in network tcp0 use krb5p All other nodes use null mgs gt lctl conf_param lustre srpc flavor tcp0 krb5p mgs gt lctl conf_param lustre srpc flavor default null For nodes in network tcp0 use krb5p for nodes in elan1 use plain Among other nodes clients use krb5i to MDT OST MDT use null to other MDTs MDT use plain to OSTs mgs gt lctl conf_param lustre srpc flavor tcp0 krb5p mgs gt lctl conf param lustre srpc flavor elanl plain mgs gt lctl conf _ param lustre srpc flavor default cli2mdt krb5i mgs gt lctl conf _ param lustre srpc flavor default cli2ost krb5i mgs gt lctl conf param lustre srpc flavor default mdt2mdt null mgs gt lctl conf _ param lustre srpc flavor default mdt2ost plain Chapter 11 Kerberos 11 15 11 2 2 7 Authenticating Normal Users On client nodes non root users must use kinit to access Lustre just like other Kerberized applications kinit is used to obtain and cache Kerberos ticket granting tickets Two requirements to authenticating users m Before kinit is run the user must be registered as a principal with the Kerberos server the Key Distribution Center or KDC In KDC the username is noted as username REALM m The client and MDT nodes should have the same user database To destroy the established security contex
364. n network options lnet networks tcpl routes elan 1 192 168 2 2 tcp1 The hopcount is used to help choose the best path between multiply routed configurations A simple but powerful expansion syntax is provided both for target networks and router NIDs as follows lt expansion gt lt entry gt lt entry gt lt entry gt lt numeric range gt lt non numeric item gt lt numeric range gt lt number gt lt number gt lt number gt Chapter 30 Configuration Files and Module Parameters man5 30 5 30 6 The expansion is a list enclosed in square brackets Numeric items in the list may be a single number a contiguous range of numbers or a strided range of numbers For example routes elan 192 168 1 22 24 tcp says that network elanO is adjacent hopcount defaults to 1 and is accessible via 3 routers on the tcp0 network 192 168 1 22 tcp 192 168 1 23 tcp and 192 168 1 24 tcp routes tcp vib 2 8 14 2 elan says that 2 networks tcp0 and vib0 are accessible through 4 routers 8 elan 10 elan 12 elan and 14 elan The hopcount of 2 means that traffic to both these networks will be traversed 2 routers first one of the routers specified in this entry then one more Duplicate entries entries that route to a local network and entries that specify routers on a non local network are ignored Equivalent entries are resolved in favor of the route with the shorter hopcount The
365. n MDT inode ratio of 1024 bytes per inode a 2 TB MDT holds 2B inodes and a 4 TB MDT holds 4B inodes the maximum number of inodes currently supported by ext3 Inode Size for the MDT Lustre uses large inodes on backing file systems to efficiently store Lustre metadata with each file On the MDT each inode is at least 512 bytes in size by default while on the OST each inode is 256 bytes in size Lustre or more specifically the backing ext3 file system also needs sufficient space for other metadata like the journal up to 400 MB bitmaps and directories There are also a few regular files that Lustre uses to maintain cluster consistency To specify a larger inode size use the I lt inodesize gt option We do NOT recommend specifying a smaller than default inode size as this can lead to serious performance problems you cannot change this parameter after formatting the file system The inode ratio must always be larger than the inode size 1 Aninode stores basic information metadata about its associated file 20 6 Lustre 1 8 Operations Manual October 2009 20 3 3 3 Number of Inodes for OST For OST file systems it is normally advantageous to take local file system usage into account Try to minimize the number of inodes created on each OST This helps reduce the format and e2fsck runtime and makes more space available for data Note In addition to the number of inodes e2fsck runtime on OSTs is affected by a n
366. n a cluster means that most of the operations are atomic Clients can not see the metadata POSIX offers strict mandatory locking which gives guarantee of semantics Users do not have control on these locks The current Lustre POSIX is comparable with NFS Future Lustre releases promise strong security with features like GSS Kerberos 5 This enables graceful handling of users from multiple realms which in turn introduce multiple UID and GID databases Note Although used mainly with UNIX systems the POSIX standard can apply to any operating system 16 1 16 1 Installing POSIX To install POSIX used for testing Lustre 1 Download all POSIX files from http downloads clusterfs com public tools benchmarks posix m lits _ vsx pcts 1 0 1 2 tgz m install sh m myscen bld E myscen exec Caution Do not configure or mount a Lustre file system yet 2 Run the install sh script and select home tet for the root directory for the test suite installation 3 Install users and groups Accept the defaults for the packages to be installed 4 To avoid a bug in the installation scripts where the test directory is not created properly create a temporary directory to hold the POSIX tests when they are built mkdir p mnt lustre TESTROOT chown vsx0 vsxg0 5 Log in as the test user su vsx0 6 Build the test suite run setup sh Most of the defaults are correct except the root directory from whi
367. n on all Lustre nodes and make sure other security extensions like Novell AppArmorand network packet filtering tools such as iptables do not interfere with Lustre Chapter 3 Installing Lustre 3 5 3 1 7 Memory Requirements This section describes the memory requirements of Lustre 3 1 7 1 Determining the MDS s Memory MDS memory requirements are determined by the following factors Number of clients m Size of the directories m Extent of load The amount of memory used by the MDS is a function of how many clients are on the system and how many files they are using in their working set This is driven primarily by the number of locks a client can hold at one time The default maximum number of locks for a compute node is 100 num_cores and interactive clients can hold in excess of 10 000 locks at times For the MDS this works out to approximately 2 KB per file including the Lustre DLM lock and kernel data structures for it just for the current working set There is by default 400 MB for the file system journal and additional RAM usage for caching file data for the larger working set that is not actively in use by clients but should be kept HOT for improved access times Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk Approximately 1 5 KB file is needed to keep a file in cache For example for a single MDT on an MDS with 1 000 clients 16 interactive nodes
368. n route LNET between different networks the node may be a server a client or a standalone router LNET can route across different network types such as TCP to Elan or across different topologies such as bridging two InfiniBand or TCP IP networks Identify Network Interfaces to Include Exclude from LNET If not explicitly specified LNET uses either the first available interface or a pre defined default for a given network type If there are interfaces that LNET should not use such as administrative networks IP over IB and so on then the included interfaces should be explicitly listed Chapter 2 Understanding Lustre Networking 2 3 2 3 4 24410 2 4 Determine Cluster wide Module Configuration The LNET configuration is managed via module options typically specified in etc modprobe conf or etc modprobe conf local depending on the distribution To ease the maintenance of large clusters you can configure the networking setup for all nodes using a single unified set of options in the modprobe conf file on each node For more information see the ip2nets option in Modprobe conf Users of liblustre should set the accept all parameter For details see Module Parameters Determine Appropriate Mount Parameters for Clients In mount commands clients use the NID of the MDS host to retrieve their configuration information Since an MDS may have more than one NID a client should use the appropriate NID for its local network
369. n two successive regions when all threads are writing to the same file In the case of multiple files threads start writing in files at Offset bytes Parameter Description chunknoise N chunksize N N2 N3 chunksize_low L chunksize_high H chunksize_incr F cleanup directio posixio cowio offset O 02 03 offset_low OL offset_high OH offset_inc PH prerun pre command postrun post command N is a byte specifier When performing an I O task add a random signed integer in the range N N to the chunksize All regions are still fully written This randomizes the I O size to some extent N is a byte specifier and performs I O in chunks of N kilo mega giga or terabyte You can give a comma separated list of multiple values This argument is mutually exclusive with chunksize_low Note that each thread allocates a buffer of size chunksize chunknoise for use during the run Performs a sequence of operations starting with a chunksize of L increasing it by F each time until chunksize exceeds H Removes files that were created during the run If there is an encounter for existing files they are over written One of these arguments must be passed to indicate if DIRECT I O POSIX I O or COW I O is used The argument is a byte specifier or a list of specifiers Each run uses regions at offset multiple of O in a single file If the run targets multiple f
370. nces in both mechanisms m The router pinger only checks routers for their health while LNDs notices all dead peers regardless of whether they are a router or not m The router pinger actively checks the router health by sending pings but LNDs only notice a dead peer when there is network traffic going on m The router pinger can bring a router from alive to dead or vice versa but LNDs can only bring a peer down 2 12 Lustre 1 8 Operations Manual October 2009 2 5 2 51 2 5 1 1 Starting and Stopping LNET Lustre automatically starts and stops LNET but it can also be manually started in a standalone manner This is particularly useful to verify that your networking setup is working correctly before you attempt to start Lustre Starting LNET To start LNET run modprobe lnet Ictl network up To see the list of local NIDs run Ictl list_nids This command tells you if the local node s networks are set up correctly If the networks are not correctly setup see the modules conf networks line and make sure the network layer modules are correctly installed and configured To get the best remote NID run Ictl which_nid lt NID list gt where lt NID list gt is the list of available NIDs This command takes the best NID from a list of the NIDs of a remote host The best NID is the one that the local node uses when trying to communicate with the remote node Starting Clients To start a TCP client
371. ndix A Lustre Knowledge Base A 7 A 8 Is it possible to change the IP address of a OST MDS Change the UUID The IP address of any node can be changed as long as the rest of the machines in the cluster are updated to reflect the new location Even if you used hostnames in the xml config file you need to regenerate the configuration logs on your metadata server It is also possible to change the UUID but unfortunately it is not very easy as two binary files would need editing How do I set striping on a file To stripe a file across lt n gt OSTs with stripesize of lt b gt blocks per stripe run lfs setstripe lt new_filename gt lt stripe_size gt lt stripe_offset gt lt stripe_count gt This creates new_filename which must not already exist We strongly recommend that the stripe_size value be 1MB or larger size in bytes Best performance is seen with one or two stripes per file unless it is a file that has shared IO from a large number of clients when the maximum number of stripes is best pass 1 as the stripe count to get maximum striping The stripe_offset OST index which holds the first stripe subsequent stripes are created on sequential stripes should be 1 which means allocate stripes in a round robin manner Abusing the stripe_offset value leads to uneven usage of the OSTs and premature file system usage Most users want to use 1fs setstripe lt new_filename gt 2097152 1 N Or use system wide defa
372. ndle specifically requests a synchronous operation file system modifying operations on the MDS that make up a single file create operation are a Allocate inode inode bitmap group descriptor new inode Create directory entry directory block parent inode for timestamps a Update lov_objids file Lustre file a Update last_rcvd file Lustre file For a single inode each of the above items dirties a single block in the journal 7 blocks 28 KB in total When many new files are created at one time dirty blocks are merged in the journal because each block needs to be dirtied only once per transaction 5s or 1 4 of full journal whichever occurs earlier For 1 000 files created in a single directory this works out to 516 KB if they are all created within the same transaction In 2 6 kernels it is possible to tune the ext3 journal commit interval with o commit seconds This may be desirable for performance testing ext3 code reserves a lot more blocks about 70 for worst case scenarios e g growing a directory which also results in a split of the directory index quota updates adding new indirect blocks for each of the Lustre files modified These are returned to the journal when the transaction is complete most are returned unused To avoid spurious journal commits due to these temporary reservations calculate the journal size based on this formula assuming a default of 32 MDS threads 70 blocks thread 32 threads
373. nes such as ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613 54 64 00 82 00 Where Variable Supported Type ost8 Total number of OSTs being tested sz 67108864K Total amount of data read or written in KB rsz 1024 Record size size of each echo_client I O in KB obj 8 Total number of objects over all OSTs thr 8 Total number of threads over all OSTs and objects write Test name If more tests have been specified they all appear on the same line 613 54 Aggregate bandwidth over all OSTs measured by dividing the total number of MB by the elapsed time 64 82 00 Minimum and maximum instantaneous bandwidths on an individual OST Note Although the numbers of threads and objects are specified per OST in the customization section of the script the reported results are aggregated over all OSTs Visualizing Results It is useful to import the obdfilter_survey script summary data it is fixed width into Excel or any graphing package and graph the bandwidth versus the number of threads for varying numbers of concurrent regions This shows how the OSS performs for a given number of concurrently accessed objects files with varying numbers of I Os in flight It is also extremely useful to record average disk I O sizes during each test These numbers help locate pathologies in the system when the file system block allocator and the block device elevator The plot obdfilter script included is an exampl
374. nfiguring Lustre Examples This chapter provides Lustre configuration examples and includes the following section m Simple TCP Network 6 1 6 1 1 6 1 1 1 Simple TCP Network This chapter presents several examples of Lustre configurations on a simple TCP network Lustre with Combined MGS MDT Below is an example is of a Lustre setup datafs having combined MDT MGS with four OSTs and a number of Lustre clients Installation Summary Combined co located MDT MGS m Four OSTs a Any number of Lustre clients 6 1 6 1 1 2 Configuration Generation and Application 1 Install the Lustre RPMS per Installing Lustre on all nodes that are going to be part of the Lustre file system Boot the nodes in Lustre kernel including the clients Change modprobe conf by adding the following line to it options lnet networks tcp Configuring Lustre on MGS and MDT node mkfs lustre fsname datafs mdt mgs dev sda Make a mount point on MDT MGS for the file system and mount it mkdir p mnt data mdt mount t lustre dev sda mnt data mdt Configuring Lustre on all four OSTs mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sda mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sdd mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sdal mkfs lustre fsname datafs ost mgsnode mds16 tcp0 dev sdb Note While creating the file system make sure you are not using disk w
375. ng used or if you have manually run the 1ct1 network up command e2scan The e2scan utility is an ext2 file system modified inode scan program The e2scan program uses libext2fs to find inodes with ct ime or mt ime newer than a given time and prints out their pathname Use e2scan to efficiently generate lists of files that have been modified The e2scan tool is included in e2fsprogs located at http downloads clusterfs com public tools e2fsprogs latest Synopsis e2scan options f file block _ device Description When invoked the e2scan utility iterates all inodes on the block device finds modified inodes and prints their inode numbers A similar iterator using libext2fs 5 builds a table called parent database which lists the parent node for each inode With a lookup function you can reconstruct modified pathnames from root Lustre 1 8 Operations Manual October 2009 31 5 3 Options Option Description b inode buffer blocks Sets the readahead inode blocks to get excellent performance when scanning the block device o output file If an output file is specified modified pathnames are written to this file Otherwise modified parameters are written to stdout t inode pathname Sets the e2scan type if type is inode The e2scan utility prints modified inode numbers to stdout By default the type is set as pathname The e2scan utility lists modified pathnames based on modified inode numbers
376. ng a Resource You can start a resource with the mount command and stop it with the umount command For details see Mounting a Server and Unmounting a Server Active Active Failover Configuration With OSSs it is possible to have a load balanced active active configuration which means that out of all of the OSTs that both machines see use you mount 50 of them on one OSS and the other 50 on the other OSS with the capability of one machine taking 100 of them should the other node die Each OSS is the primary node for a group of OSTs and the failover node for other groups To expand the simple two node example we add ost2 which is primary on nodeB and is on the LUNs nodeB dev sdc1 and nodeA dev sdd1 This demonstrates that the dev identity can differ between nodes but both devices must map to the same physical LUN In this type of failover configuration you can mount two OSTs on two different nodes and format them from either node With failover two OSSs provide the same service to the Lustre network in parallel In case of disaster or a failure in one of the nodes the other OSS can provide uninterrupted file system services Chapter 8 Failover 8 7 8 4 3 8 4 3 1 8 8 For an active active configuration mount one OST on one node and another OST on the other node You can format them from either node Note The two OSS nodes must have shared disks Hardware Requirements for Failover This section describes
377. ng as expected Each LNET self test runs in the context of a session A node can be associated with only one session at a time to ensure that the session has exclusive use of the nodes on which it is running A single node creates controls and monitors a single session This node is referred to as the self test console Any node may act as the self test console Nodes are named and allocated to a self test session in groups This allows all nodes in a group to be referenced by a single name Test configurations are built by describing and running test batches A test batch is a named collection of tests with each test composed of a number of individual point to point tests running in parallel These individual point to point tests are instantiated according to the test type source group target group and distribution specified when the test is added to the test batch Modules To run LNET self test load following modules libcfs Inet Inet_selftest and any one of the kinds ksockind ko2ibind To load all necessary modules run modprobe Inet_selftest which recursively loads the modules on which Inet_selftest depends There are two types of nodes for LNET self test console and test Both node types require all previously specified modules to be loaded The userspace test node does not require these modules Test nodes can either be in kernel or in userspace A console user can invite a kernel test node to join the test session
378. nges daily in the online file system For example a two day old snapshot will likely be twice as big as a one day old snapshot You can create as many snapshots as you have room for in the volume group If necessary you can dynamically add disks to the volume group to make it larger Snapshots of the target disks MDT OSTs should be taken at the same point in time make sure that cronjob updating main is not running since that is the only job writing to the disks Here is an example cfs21 modprobe dm snapshot cfs21 lvcreate L50M s n MDTb1 dev volgroup MDT Rounding up size to full physical extent 52 00 MB Logical volume MDTb1 created cfs21 lvcreate L50M s n OSTb1 dev volgroup OSTO Rounding up size to full physical extent 52 00 MB Logical volume OSTb1 created After the snapshots are taken you can continue to back up new changed files to main The snapshots will not contain the new files cfs21 cp etc termcap mnt main cf s21 ls mnt main fstab passwd termcap Restoring the File System From a Snapshot Use this procedure to restore the file system from a snapshot 1 Rename the snapshot Rename the file system snapshot from main to back so you can mount it without unmounting main This is recommended but not required Use the reformat flag to tunefs lustre to force the name change For example cfs21 tunefs lustre reformat fsname back writeconf dev volgroup MDT
379. ning for Inodes 20 5 20 3 2 Sizing the MDT 20 5 20 3 3 Overriding Default Formatting Options 20 6 Network Tuning 20 7 DDN Tuning 20 8 20 5 1 Setting Readahead and MF 20 8 20 5 2 Setting Segment Size 20 9 20 5 3 Setting Write Back Cache 20 10 20 5 4 Setting maxcmds 20 11 20 5 5 Further Tuning Tips 20 11 Large Scale Tuning for Cray XT and Equivalents 20 13 20 6 1 Network Tunables 20 13 Lockless 1 0 Tunables 20 15 Data Checksums 20 15 xiv Lustre 1 8 Operations Manual October 2009 21 Lustre Monitoring and Troubleshooting 21 1 21 1 Monitoring Lustre 21 1 212 Troubleshooting Lustre 21 4 21 2 1 21 2 2 21 2 3 Error Numbers 21 4 Error Messages 21 5 Lustre Logs 21 5 21 3 Submitting a Lustre Bug 21 6 21 4 Common Lustre Problems and Performance Tips 21 7 21 4 1 21 4 2 21 4 3 21 4 4 21 4 5 21 4 6 21 4 7 21 4 8 21 4 9 21 4 10 21 4 11 21 4 12 21 4 13 21 4 14 21 4 15 21 4 16 21 4 17 21 4 18 21 4 19 21 4 20 Recovering from an Unavailable OST 21 7 Write Performance Better Than Read Performance 21 8 OST Object is Missing or Damaged 21 9 OSTs Become Read Only 21 10 Identifying a Missing OST 21 10 Improving Lustre Performance When Working with Small Files 21 12 Default Striping 21 12 Erasing a File System 21 13 Reclaiming Reserved Disk Space 21 13 Considerations in Connecting aSAN with Lustre 21 14 Handling Debugging Bind Address already in use Error 21 15 Replacing An Existing OST or
380. ning that the failed node is now good Lustre clients can work during a failback but they are momentarily blocked Note When formatting the MGS the f ailnode option is not available This is because MGSs do not need to be told about a failover MGS they do not communicate with other MGSs at any time However OSSs MDSs and Lustre clients need to know about failover MGSs MDSs and OSSs are told about failover MGSs with the mgsnode parameter and or using multi NID mgsspec specifications At mount time clients are told about all MGSs with a multi NID mgsspec specification For more details on the multi NID mgsspec specification and how to tell clients about failover MGSs see the mount lustre man page 8 8 Considerations with Failover Software and Solutions The failover mechanisms used by Lustre and tools such as Hearbeat are soft failover mechanisms They check system and or application health at a regular interval typically measured in seconds This combined with the data protection mechanisms of Lustre is usually sufficient for most user applications However these soft mechanisms are not perfect The Heartbeat poll interval is typically 30 seconds To avoid a false failover Heartbeat waits for a deadtime interval before triggering a failover In normal case a user I O request should block and recover after the failover completes But this may not always be the case given the delay imposed by Heartbeat Chapter 8 Fai
381. nks of data When the chunk being written to a particular object exceeds the stripe_size the next chunk of data in the file is stored on the next target FIGURE 1 6 Files striped with a stripe count of 2 and 3 with different stripe sizes Legend File A data az File B data i Ey Each gray area is one object File striping presents several benefits One is that the maximum file size is not limited by the size of a single target Lustre can stripe files over up to 160 targets and each target can support a maximum disk use of 8 TB by a file This leads to a maximum disk use of 1 48 PB by a file in Lustre Note that the maximum file size is much larger 264 bytes but the file cannot have more than 1 48 PB of allocated data hence a file larger than 1 48 PB must have many sparse sections While a single file can only be striped over 160 targets Lustre file systems have been built with almost 5000 targets which is enough to support a 40 PB file system Another benefit of striped files is that the I O bandwidth to a single file is the aggregate I O bandwidth to the objects in a file and this can be as much as the bandwidth of up to 160 servers Lustre 1 8 Operations Manual October 2009 1 4 2 1 4 2 1 1 4 2 2 Lustre Storage The storage attached to the servers is partitioned optionally organized with logical volume management LVM and formatted as file systems Lustre OSS and MDS servers read write and modify data
382. node On the node var log messages holds a log of all messages for at least the past day Lustre Logs The error message initiates with LustreError in the console log and provides a short description of m What the problem is m Which process ID had trouble m Which server node it was communicating with and so on Collect the first group of messages related to a problem and any messages that precede LBUG or assertion failure errors Messages that mention server nodes OST or MDS are specific to that server you must collect similar messages from the relevant server console logs Another Lustre debug log holds information for Lustre action for a short period of time which in turn depends on the processes on the node to use Lustre Use the following command to extract debug logs on each of the nodes run lctl dk lt filename gt Note LBUG freezes the thread to allow capture of the panic stack A system reboot is needed to clear the thread Chapter 21 Lustre Monitoring and Troubleshooting 21 5 21 3 21 6 Submitting a Lustre Bug If after troubleshooting your Lustre system you cannot resolve the problem consider submitting a Lustre bug To do this you will need an account on Bugzilla defect tracking system used for Lustre and download the Lustre diagnostics tool to run and capture the diagnostics output Note Create a Lustre Bugzilla account Download the Lustre diagnostics tool and insta
383. not have to rebuild each time Chapter 16 POSIX 16 3 16 2 Running POSIX Tests Against Lustre To run the POSIX tests against Lustre 1 As root set up your Lustre file system mounted on mnt lustre for instance sh IImount sh and untar the POSIX tests back to their home tar same owner xzpvf path to tarball TESTROOT tgz C mnt lustre As the vsx0 user you can re run the tests as many times as you want If you are newly logged in as the vsx0 user you need to source the environment with profile so that your path and environment is set up correctly 2 Run the POSIX tests run home tet profile tcc e s scen exec a mnt lustre TESTROOT p New results are placed in new directories at home tet test_sets results Each result is given a directory name similar to 0004e an incrementing number which ends with e for test execution or b for building tests 3 To look at a formatted report run vrpt results 0004e journal less Some tests are Unsupported Untested or Not In Use which does not necessarily indicate a problem 4 To compare two test results run vrptm results ext3 journal results 0004e journal less This is more interesting than looking at the result of a single test as it helps to find test failures that are specific to the file system instead of the Linux VFS or kernel Up to six test results can be compared at one time It is often useful to rename the results di
384. not recommended because of resulting bottlenecks 6 Mount the OST On the OSS node where the OST was created run mount t lustre lt block device name gt lt mount point gt Note To create additional OSTs repeat Step 4 and Step 5 2 When you create the OST you are defining a storage device sd a device number a b c d and a partition 1 2 3 where the OST node lives Chapter 4 Configuring Lustre 4 3 7 Create the client mount the file system on the client On the client node run mount t lustre lt MGS node gt lt fsname gt lt mount point gt Note To create additional clients repeat Step 7 8 Verify that the file system started and is working correctly by running the df dd and 1s commands on the client node a Run the 1fs df h command root client1 lfs df h The 1fs df h command lists space usage per OST and the MDT in human readable format b Run the 1fs df ih command root client1 lfs df ih The 1fs df ih command lists inode usage per OST and the MDT c Run the dd command root client1 cd lustre root client1 lustre dd if dev zero of lustre zero dat bs 4M count 2 The dd command verifies write functionality by creating a file containing all zeros 0s In this command an 8 MB file is created d Run the 1s command root client1 lustre ls lsah The 1s 1sah command lists files and directories in the current working director
385. ns is user adjustable Querying File System Free Space Free space is an important consideration in assigning file stripes The 1fs df command shows available disk space on the mounted Lustre file system and space consumption per OST If multiple Lustre file systems are mounted a path may be specified but is not required Option Description h Human readable print sizes in human readable format for example 1K 234M 5G i inodes Lists inodes instead of block usage Note The df i and 1fs df i commands show the minimum number of inodes that can be created in the file system Depending on the configuration it may be possible to create more inodes than initially reported by df i Later df i operations will show the current estimated free inode count If the underlying file system has fewer free blocks than inodes then the total inode count for the file system reports only as many inodes as there are free blocks This is done because Lustre may need to store an external attribute for each new inode and it is better to report a free inode count that is the guaranteed minimum number of inodes that can be created Chapter 24 Striping and I O Options 24 9 24 10 Examples lin cli1 UUID mds lustre ost lustre ost lustre ost lustre filesystem summary 282544104167068468 lin cli1 UUID mds lustre ost lustre ost lustre ost lustre filesystem summary lin cli1 UUID mds lustre os
386. nt for example when an OST and a failover OST share a partition The backing file system for Lustre 1diskfs supports the MMP mechanism A block in the file system is updated by a kmmpd daemon at one second intervals and a monotonically increasing sequence number is written in this block If the file system is cleanly unmounted then a special clean sequence is written in this block When mounting a file system Idiskfs checks if the MMP block has a clean sequence or not Even if the MMP block holds a clean sequence Idiskfs waits for some interval to guard against the following situations m Under heavy I O it may take longer for the MMP block to be updated m If another node is also trying to mount the same file system there may be a race With MMP enabled mounting a clean file system takes at least 10 seconds If the file system was not cleanly unmounted then mounting the file system may require additional time Note The MMP feature is only supported on Linux kernel versions gt 2 6 9 Lustre 1 8 Operations Manual October 2009 Note The MMP feature is automatically enabled by mkfs lustre for new file systems at format time if failover is being used and the kernel and e2fsprogs support it Otherwise the Lustre administrator has to manually enable this feature when the file system is unmounted If failover is being used the MMP feature is automatically enabled by mkfs lustre To determine if MMP is
387. nt N In ext3 Idiskfs file systems inodes are pre allocated so creating a new file does not consume any of the free blocks However this also means that the format time options should be conservative as it is not possible to increase the number of inodes after the file system is formatted If there is a shortage of inodes or space on the OSTs it is possible to add OSTs to the file system To be on the safe side plan for 4 KB per inode on the MDT the default For the OST the amount of space taken by each object depends entirely upon the usage pattern of the users applications running on the system Lustre by necessity defaults to a very conservative estimate for the object size 16 KB per object You can almost always increase this value for file system installations Many Lustre file systems have average file sizes over 1 MB per object Sizing the MDT When calculating the MDT size the only important factor is the average size of files to be stored in the file system If the average file size is for example 5 MB and you have 100 TB of usable OST space then you need at least 100 TB 1024 GB TB 1024 MB GB 5 MB inode 20 million inodes We recommend that you have twice the minimum 40 million inodes in this example At the default 4 KB per inode this works out to only 160 GB of space for the MDT Conversely if you have a very small average file size 4 KB for example Lustre is not very efficient This is because you consume
388. nt OBDs cfs21 cat proc fs lustre osc lustre OST0001 osc ce129800 timeouts last reply 1193428639 0Od0h00m00s ago network cur 1 worst 2 at 1193427053 Od0h26m26s ago 1 1 1 1 portal 6 cur 33 worst 34 at 1193427052 O0d0h26m27s ago 33 33 33 2 portal 28 cur 1 worst 1 at 1193426141 0d0h41m38s ago 1 1 1 1 portal 7 cur 1 worst 1 at 1193426141 0d0h41m38s ago 1 0 1 1 portal 17 cur 1 worst 1 at 1193426177 0Od0h41m02s ago 1 0 0 dE In this case RPCs to portal 6 the OST_IO PORTAL see lustre include lustre lustre_idl h shows the history of what the ost_io portal has reported as the service estimate 22 8 Lustre 1 8 Operations Manual October 2009 22 1 4 Server statistic files also show the range of estimates in the normal min max sum sumsq manner cfs21 cat proc fs lustre mdt MDS mds stats req timeout 6 samples sec 1 10 15 105 LNET Information This section describes proc entries for LNET information proc sys lnet peers Shows all NIDs known to this node and also gives information on the queue state cat proc sys lnet peers nid refs state max EET min EX minqueue 0 lo 1 rtr 0 0 0 0 0 0 192 168 10 35 tcpl1 rtr 8 8 8 8 6 0 192 168 10 36 tcp1 rtr 8 8 8 8 6 0 192 168 10 37 tcpl1 rtr 8 8 8 8 6 0 The fields are explained below Field Description refs A reference count principally used for debugging state Only valid to refer to routers Possible values e rtr indicates this no
389. nvite a kernel test node to join the test session by running lst add_group NID but the user cannot actively add a userspace test node to the test session However the console user can passively accept a test node to the test session while the test node runs 1stclient to connect to the console Chapter 18 Lustre I O Kit 18 19 18 4 1 2 18 4 1 3 18 4 1 4 18 4 1 5 Utilities LNET self test has two user utilities Ist and Istclient m Ist The user interface for the self test console run on the console node It provides a list of commands to control the entire test system such as create session create test groups etc a Istclient The userspace LNET self test program run on a test node Istclient is linked with userspace LNDs and LNET Istclient is not needed if a user just wants to use kernel space LNET and LNDs Session In the context of LNET self test a session is a test node that can be associated with only one session at a time to ensure that the session has exclusive use Almost all operations should be performed in a session context From the console node a user can only operate nodes in his own session If a session ends the session context in all test nodes is destroyed The console node can be used to create change or destroy a session new_session end_session show_session For more information see Session Console The console node is the user interface of the LNET self test system and can be any
390. o 0 read write The fields are explained below Field Description pages per brw Number of pages per RPC request which should match aggregate client rpc_stats discont pages Number of discontinuities in the logical file offset of each page in a single RPC discont blocks Number of discontinuities in the physical block allocation in the file system for a single RPC Lustre 1 8 Operations Manual October 2009 22 2 6 22 2 6 1 Using File Readahead and Directory Statahead Lustre 1 6 5 1 introduced file readahead and directory statahead functionality that read data into memory in anticipation of a process actually requesting the data File readahead functionality reads file content data into memory Directory statahead functionality reads metadata into memory When readahead and or statahead work well a data consuming process finds that the information it needs is available when requested and it is unnecessary to wait for network I O Tuning File Readahead File readahead is triggered when two or more sequential reads by an application fail to be satisfied by the Linux buffer cache The size of the initial readahead is 1 MB Additional readaheads grow linearly and increment until the readahead cache on the client is full at 40 MB proc fs lustre llite lt fsname gt lt uid gt max_read_ahead_mb This tunable controls the maximum amount of data readahead on a file Files are read ahead in RPC sized chunks 1 MB or
391. o configure Lustre use nettype openib nid lt IPoIB address gt m Silverstorm A Silverstorm driver for Lustre is available OpenIB 1 0 An OpenIB 1 0 driver for Lustre is available Lustre 1 8 Operations Manual October 2009 Currently v1 4 5 the Voltaire IB module kvibnal will _not work on the Altix system This is due to hardware differences in the Altix system To build Silverstorm with Lustre configure Lustre with with iib lt path to silverstorm sources gt Can the same Lustre file system be mounted at multiple mount points on the same client system Yes this is perfectly safe How do I identify files affected by a missing OST If an OST is missing for any reason you may need to know what files are affected The file system should still be operational even though one OST is missing so from any mounted client node it is possible to generate a list of files that reside on that OST In such situations we recommend marking the missing OST as unavailable so clients and the MDS do not time out trying to contact it On mixed MDS client nodes 1 Generate a list of devices and determine the OST s device number lctl dl 2 Deactivate the OST on the OSS at the MDS lctl device lt OST device name or number gt deactivate If the OST later becomes available it needs to be reactivated Run lctl device lt OST device number gt activate Determine all files striped over the missing OST Run lf
392. o their place in the O 0 d directories To use Il_recover_lost_found_objs mount the file system locally using the t Idiskfs command run the utility and then unmount it again The OST must not be mounted by Lustre when Il_recover_lost_found_objs is run Options Field Description h Prints a help message v Increases verbosity d directory Sets the lost and found directory path Example 11_recover_lost_found_objs d mnt ost lost found Chapter 31 System Configuration Utilities man8 31 31 31 32 Lustre 1 8 Operations Manual October 2009 CHAPTER 32 System Limits This chapter describes various limits on the size of files and file systems These limits are imposed by either the Lustre architecture or the Linux VFS and VM subsystems In a few cases a limit is defined within the code and could be changed by re compiling Lustre In those cases the selected limit is supported by Lustre testing and may change in future releases This chapter includes the following sections a Maximum Stripe Count m Maximum Stripe Size a Minimum Stripe Size Maximum Number of OSTs and MDTs a Maximum Number of Clients a Maximum Size of a File System a Maximum File Size m Maximum Number of Files or Subdirectories in a Single Directory a MDS Space Consumption m Maximum Length of a Filename and Pathname a Maximum Number of Open Files for Lustre File Systems m OSS RAM Size 32 1 Maximum Stripe Count
393. ocated in usr include asm errno h Lustre does not use all of the available Linux error numbers The exact meaning of an error number depends on where it is used Here is a summary of the basic errors that Lustre users may encounter Error Number Error Name Description 1 EPERM 2 ENOENT 4 EINTR 5 EIO 19 ENODEV 22 EINVAL 28 ENOSPC 30 EROFS 43 EIDRM 107 ENOTCONN 110 ETIMEDOUT Permission is denied The requested file or directory does not exist The operation was interrupted usually CTRL C or a killing process The operation failed with a read or write error No such device is available The server stopped or failed over The parameter contains an invalid value The file system is out of space or out of inodes Use 1fs df query the amount of file system space or 1fs df i query the number of inodes The file system is read only likely due to a detected error The UID GID does not match any known UID GID on the MDS Update etc hosts and etc group on the MDS to add the missing user or group The client is not connected to this server The operation took too long and timed out Lustre 1 8 Operations Manual October 2009 21 22 21 2 3 Error Messages As Lustre code runs on the kernel single digit error codes display to the application these error codes are an indication of the problem Refer to the kernel console log dmesg for all recent kernel messages from that
394. ode This chapter includes these sections m Preparing to Install Lustre m Installing Lustre from RPMs Installing Lustre from Source Code Lustre can be installed from either packaged binaries RPMs or freely available source code Installing from the package release is straightforward and recommended for new users Integrating Lustre into an existing kernel and building the associated Lustre software is an involved process For either installation method the following are required m Linux kernel patched with Lustre specific patches m Lustre modules compiled for the Linux kernel Lustre utilities required for Lustre configuration Note When installing Lustre and creating components on devices a certain amount of space is reserved so less than 100 of storage space will be available Lustre servers use the ext3 file system to store user data objects and system data By default ext3 file systems reserve 5 of space that cannot be used by Lustre Additionally Lustre reserves up to 400 MB on each OST for journal use This reserved space is unusable for general storage For this reason you will see up to 400 MB of space used on each OST before any file object data is saved to it 1 Additionally a few bytes outside the journal are used to create accounting data for Lustre 3 1 al Preparing to Install Lustre To successfully install and run Lustre make sure the following installation prerequisites have been
395. oftlimit lt inode softlimit gt inode hardlimit lt inode hardlimit gt lt filesystem gt 27 2 Lustre 1 8 Operations Manual October 2009 lfs setquota lt u user g group gt lt username groupname gt b lt block softlimit gt B lt block hardlimit gt i lt inode softlimit gt I lt inode hardlimit gt lt filesystem gt lfs setquota t lt u g gt block grace lt block grace gt inode grace lt inode grace gt lt filesystem gt lfs setquota t lt u g gt b lt block grace gt i lt inode grace gt lt filesystem gt lfs help Note In the above example the lt filesystem gt parameter refers to the mount point of the Lustre file system The default mount point is mnt lustre Note The old 1fs quota output was very detailed and contained cluster wide quota statistics including cluster wide limits for a user group and cluster wide usage for a user group as well as statistics for each MDS OST Now 1fs quota has been updated to provide only cluster wide statistics by default To obtain the full report of cluster wide limits usage and statistics use the v option with lfs quota Description The 1fs utility is used to create a new file with a specific striping pattern determine the default striping pattern gather the extended attributes object numbers and location for a specific file find files with specific attributes list OST information or set quota limi
396. on all servers and clients An individual node identifies the locally available networks based on the listed IP address patterns that match the node s local IP addresses Note that the IP address patterns listed in the ip2nets option are only used to identify the networks that an individual node should instantiate They are not used by LNET for any other communications purpose The servers megan and oscar have eth0 IP addresses 192 168 0 2 and 4 They also have IP over Elan eip addresses of 132 6 1 2 and 4 TCP clients have IP addresses 192 168 0 5 255 Elan clients have eip addresses of 132 6 2 3 2 4 6 8 modprobe conf is identical on all nodes options lnet ip2nets tcp0 eth0 eth1 192 168 0 2 4 tcp0O 192 168 0 elan0 132 6 1 3 2 8 2 Note LNET lines in modprobe conf are only used by the local node to determine what to call its interfaces They are not used for routing decisions Because megan and oscar match the first rule LNET uses eth0 and eth1 for tcp0 on those machines Although they also match the second rule it is the first matching rule for a particular network that is used The servers also match the only Elan rule The 2 8 2 format matches the range 2 8 stepping by 2 that is 2 4 6 8 For example clients at 132 6 3 5 would not find a matching Elan network Lustre 1 8 Operations Manual October 2009 7 1 2 Start Servers For the combined MGS MDT with TCP network run mkfs lustre f
397. on conflicts export LST_SESSION lst new_session force liangzhen end_session Stops all operations and tests in the current session and clears the session s status lst end_session show_session Shows the session information This command prints information about the current session It does not require LST_SESSION to be defined in the process environment lst show_session Chapter 18 Lustre I O Kit 18 23 18 4 2 2 Group This section lists Ist group commands add_group NAME NIDs NIDs Creates the group and adds a list of test nodes to the group NAME Name of the group NIDs A string that may be expanded into one or more LNET NIDs lst add_group servers 192 168 10 35 40 45 tcp lst add group clients 192 168 1 10 100 tcp 192 168 2 4 10 20 tcp update_group NAME refresh clean STATE remove NIDs Updates the state of nodes in a group or adjusts a group s membership This command is useful if some nodes have crashed and should be excluded from the group refresh Refreshes the state of all inactive nodes in the group clean STATUS Removes nodes with a specified status from the group Status may be active The node is in the current session busy The node is now owned by another session down The node has been marked down unknown The node s status has yet to be determined invalid Any state but active remove NIDs Removes specified nodes from the group
398. on environment run sysctl w lnet debug warning dlmtrace error emerg ha rpctrace vfstrace 1net debug warning dlmtrace error emerg ha rpctrace vfstrace The flags above collect enough high level information to aid debugging but they do not cause any serious performance impact To clear all flags and set new ones run sysctl w lnet debug warning inet debug warning 22 30 2 This controls the level of Lustre debugging kept in the internal log buffer It does not alter the level of debugging that goes to syslog Lustre 1 8 Operations Manual October 2009 To add new flags to existing ones prefix them with a sysctl w lnet debug neterror ha 1net debug neterror tha sysctl Inet debug 1net debug neterror warning ha m To remove flags prefix them with a sysctl w lnet debug ha inet debug ha sysctl lnet debug 1net debug neterror warning You can verify and change the debug level using the proc interface in Lustre To use the flags with proc run cat proc sys lnet debug neterror warning echo ha gt proc sys lnet debug cat proc sys lnet debug neterror warning ha echo warning gt proc sys lnet debug cat proc sys lnet debug neterror ha Chapter 22 LustreProc 22 31 22 32 proc sys Inet subsystem_debug This controls the debug logs for subsystems see S_ definitions proc sys Inet debug_path This indicates the locat
399. on to Lustre 1 13 1 4 3 Lustre System Capacity Lustre file system capacity is the sum of the capacities provided by the targets As an example 64 OSSs each with two 8 TB targets provide a file system with a capacity of nearly 1 PB If this system uses sixteen 1 TB SATA disks it may be possible to get 50 MB sec from each drive providing up to 800 MB sec of disk bandwidth If this system is used as storage backend with a system network like InfiniBand that supports a similar bandwidth then each OSS could provide 800 MB sec of end to end I O throughput Note that the OSS must provide inbound and outbound bus throughput of 800 MB sec simultaneously The cluster could see aggregate I O bandwidth of 64x800 or about 50 GB sec Although the architectural constraints described here are simple in practice it takes careful hardware selection benchmarking and integration to obtain such results In a Lustre file system storage is only attached to server nodes not to client nodes If failover capability is desired then this storage must be attached to multiple servers In all cases the use of storage area networks SANs with expensive switches can be avoided because point to point connections between the servers and the storage arrays normally provide the simplest and best attachments 1 5 1 14 Lustre Configurations Lustre file systems are easy to configure First the Lustre software is installed and then MDT and OST partitions a
400. one with old disks that will not be part of the new file system just do not mount them Reclaiming Reserved Disk Space All current Lustre installations run the ext3 file system internally on service nodes By default the ext3 reserves 5 of the disk space for the root user In order to reclaim this space run the following command on your OSSs tune2fs m reserved_blocks_percent device You do not need to shut down Lustre before running this command or restart it afterwards Chapter 21 Lustre Monitoring and Troubleshooting 21 13 21 4 10 Considerations in Connecting a SAN with Lustre Depending on your cluster size and workload you may want to connect a SAN with Lustre Before making this connection consider the following m In many SAN file systems without Lustre clients allocate and lock blocks or inodes individually as they are updated The Lustre design avoids the high contention that some of these blocks and inodes may have m Lustre is highly scalable and can have a very large number of clients SAN switches do not scale to a large number of nodes and the cost per port of a SAN is generally higher than other networking m File systems that allow direct to SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks and misbehaving clients can corrupt the file system for many reasons like improper file system network or other kernel software bad cabling bad memory
401. onfiguration the lustre_config csv file looks like this Configuring RAID 5 on mds16 clusterfs com mds16 clusterfs com MD dev md0 c 128 5 dev sdb dev sdc dev sdd configuring multiple RAID5 on oss161 clusterfs com oss161 clusterfs com MD dev md0 c 128 5 dev sdb dev sdc dev sdd ossi161 clusterfs com MD dev md1 c 128 5 dev sde dev sdf dev sdg configuring LVM2 PV from the RAID5 from the above steps on oss161 clusterfs com oss161 clusterfs com PV dev md0 dev md1 configuring LVM2 VG from the PV and RAIDS from the above steps on oss161 clusterfs com oss161 clusterfs com VG oss_data s 32M dev md0 dev md1 configuring LVM2 LV from the VG PV and RAIDS from the above steps on oss161 clusterfs com oss161 clusterfs com LV ost0 i 2 I 128 2G oss_data oss161 clusterfs com LV ost1 i 2 I 128 2G oss_data Chapter 6 Configuring Lustre Examples 6 11 6 12 configuring LVM2 PV on oss162 clusterfs com oss162 clusterfs com PV dev sdb dev sdc dev sdd dev sde dev sdf dev sdg configuring LVM2 VG from the PV from the above steps on oss162 clusterfs com oss162 clusterfs com VG vg_oss1 s 32M dev sdb dev sdc dev sdd oss162 clusterfs com VG vg_oss2 s 32M dev sde dev sdf dev sdg configuring LVM2 LV from the VG and PV from the above steps on oss162 clusterfs com oss162 clusterfs com LV ost3 i 3 I 64 1G vg_oss2 oss162 clusterfs com LV ost2 i 3 I 64 1G vg_oss1 configu
402. ons Manual October 2009 8 1 1 8 1 2 For proper resource fencing the Heartbeat software must be able to completely power off the server or disconnect it from the shared storage device It is imperative that no two active nodes access the same storage device at the risk of severely corrupting data When Heartbeat detects a server failure it calls a process STONITH to power off the failed node and then starts Lustre on the secondary node using its built in file system resource manager Servers providing Lustre resources are configured in primary secondary pairs for the purpose of failover When a server umount command is issued the disk device is set read only This allows the second node to start service using that same disk after the command completes This is known as a soft failover in which case both the servers can be running and connected to the net Powering off the node is known as a hard failover The Power Management Software The Linux HA package includes a set of power management tools known as STONITH Shoot The Other Node In The Head STONITH has native support for many power control devices and is extensible It uses expect scripts to automate control PowerMan by the Lawrence Livermore National Laboratory LLNL is a tool for manipulating remote power control RPC devices from a central location Several RPC varieties are supported natively by PowerMan The latest versions of PowerMan are available at http
403. ons Manual October 2009 FIGURE 1 5 shows how a file open operation transfers the object pointers from the MDS to the client when a client opens the file and how the client uses this information to perform I O on the file directly interacting with the OSS nodes where the objects are stored FIGURE 1 5 File open and file I O in Lustre Lustre Client Linux VFS Lustre clientFS LOY osct oscs Write obj 2 File open request mm er File metadata A node A obit obj2 Metadata Server Parallel Ban dwidth Odd blocks even blocks If only one object is associated with an MDS inode that object contains all of the data in that Lustre file When more than one object is associated with a file data in the file is striped across the objects The benefits of the Lustre arrangement are clear The capacity of a Lustre file system equals the sum of the capacities of the storage targets The aggregate bandwidth available in the file system equals the aggregate bandwidth offered by the OSSs to the targets Both capacity and aggregate I O bandwidth scale simply with the number of OSSs Chapter 1 Introduction to Lustre 1 11 1 4 1 1 12 Lustre File System and Striping Striping allows parts of files to be stored on different OSTs as shown in FIGURE 1 6 A RAID 0 pattern in which data is striped across a certain number of objects is used the number of objects is called the stripe_count Each object contains chu
404. or When OSTs have approximately the same amount of free space within 20 an efficient round robin allocator is used The round robin allocator alternates stripes between OSTs on different OSSs Here are several sample round robin stripe orders the same letter represents the different OSTs on a single OSS 3 AAA one 3 OST OSS 3x3 ABABAB two 3 OST OSSs 3x4 BBABABA one 3 OST OSS A and one 4 OST OSS B 3x5 BBABBABA 3x5x1 BBABABABC 3x5x2 BABABCBABC 4x6x2 BABABCBABABC Weighted Allocator When the free space difference between the OSTs is significant then a weighting algorithm is used to influence OST ordering based on size and location Note that these are weightings for a random algorithm so the emptiest OST is not necessarily chosen every time On average the weighted allocator fills the emptier OSTs faster Chapter 24 Striping and I O Options 24 11 24 4 5 Adjusting the Weighting Between Free Space and Location This priority can be adjusted via the proc fs lustre lov lustre mdtlov qos_prio_free proc file The default is 90 Use the following command to permanently change this weighting on the MGS letl conf param lt fsname gt MDT0000 lov qos_prio_free 90 Increasing the value puts more weighting on free space When the free space priority is set to 100 then location is no longer used in stripe ordering calculations and weighting is based entirely on free space Note that setting the priority to 100
405. ore control over allocation policy will be available in the next version Field Description stats Enables disables the collection of statistics Collected statistics can be found in proc fs ldiskfs2 lt dev gt mb_history max_to_scan Maximum number of free chunks that mballoc finds before a final decision to avoid livelock min_to_scan Minimum number of free chunks that mballoc finds before a final decision This is useful for a very small request to resist fragmentation of big free chunks order2_req For requests equal to 2 N where N gt order2_req a very fast search via buddy structures is used small_req All requests are divided into 3 categories large_req lt small_req packed together to form large aggregated requests lt large_reg allocated mostly in linearly gt large_req very large requests so the arm seek does not matter The idea is that we try to pack small requests to form large requests and then place all large requests including compound from the small ones close to one another causing as few arm seeks as possible prealloc_table The amount of space to preallocate depends on the current file size The idea is that for small files we do not need 1 MB preallocations and for large files 1 MB preallocations are not large enough it is better to preallocate 4 MB group_prealloc The amount of space preallocated for small requests to be grouped Chapter 22 LustreProc 22 27 22 2 10 Locking
406. ort for failed tests This includes the test strategy operations done by the test suite and the failures Each subtest for instance access create usually contains many single tests The report shows exactly which single testing fails In this case you can find more information directly from the VSX source code Chapter16 POSIX 16 5 For example if the fifth single test of subtest chmod failed you could look at the source home tet test_sets test POSIX os files chmod chmod c Which contains a single test array public struct tet_testlist tet_testlist testi test2 test3 test4 test5 test test7 test8 test9 J o UT amp N CescL CesStl CESCE CeEStl COSELL 3 a Ut amp WN PO 0 1 2 3 test14 5 6 7 8 ct M n ct a O test20 20 test21 21 test22 22 test23 23 16 6 Lustre 1 8 Operations Manual October 2009 If this single test is causing problems as in the case of a kernel panic or if you are trying to isolate a single failure it may be useful to narrow the tet_testlist array down to the single test in question and then recompile the test suite Then you can create a new tarball of the resulting TESTROOT directory with an appropriate name like TESTROOT chmod 5 only tgz and re run the POSIX suite It may also be helpful to edit the scen exec file to run only test set in question total tests in POSIX os 1 tset POSIX os files chmod T
407. ossible to mismatch storage devices with their Lustre servers If Lustre tries to mount such devices incorrectly it would report a UUID mismatch to the syslog and refuse to mount Does the mount option multiple directories on the same client system r bind allow mounting a Lustre file system to Yes this is supported In fact it is entirely handled by the VFS No special file system support is required Lustre 1 8 Operations Manual October 2009 What operations take place in Lustre when a new file is created This is a high level description of what operations take place in Lustre when a new file is created It corresponds to Lustre version 1 4 5 1 2 10 On the Lustre client Create path file mode For every component in path execute IT_LOOKUP intent LDLM_ENQUEUE RPC to MDS Execute IT_OPEN intent LDLM_ENQUEUE RPC to MDS On the MDS Lock the parent directory Create the file Setattr on the file to set desired owner mode Setattr on parent to update ATIME CTIME Determine the default striping pattern Set the file s extended attribute to the desired stripping pattern For every OST that this file will have stripes on see if there is a spare Assign precreated objects if any to the file Update the extended attribute holding OST oids Reply to client with no lock in reply Appendix A Lustre Knowledge Base A 35 A 36 m On the journal ext3 journaling is asynchronous unless a ha
408. ost polling 0 determines whether this host will poll or block for MX request completions A value of 0 blocks and any positive value will poll that many times before blocking Since polling increases CPU usage we suggest that you set this to zero 0 on the client and experiment with different values for servers Chapter 30 Configuration Files and Module Parameters man5 30 21 30 22 Lustre 1 8 Operations Manual October 2009 CHAPTER Ot System Configuration Utilities mans This chapter includes system configuration utilities and includes the following sections m mkfs lustre m tunefs lustre a lctl mount lustre m Additional System Configuration Utilities 31 1 31 2 mkfs lustre The mkfs lustre utility formats a disk for a Lustre service Synopsis mkfs lustre lt target_type gt options device where lt target_type gt is one of the following Option Description ost Object Storage Target OST mdt Metadata Storage Target MDT mgs Configuration Management Service MGS one per site This service can be combined with one mdt service by specifying both types Description mkfs lustre is used to format a disk device for use as part of a Lustre file system After formatting a disk can be mounted to start the Lustre service defined by this command When the file system is created parameters can simply be added as a param option to the mkfs lustre command See Setting Paramet
409. ount t ldiskfs dev MDSDEV mnt mds cp mnt mds last_revd mnt mds last_rcvd sav cp mnt mds last_rcvd tmp last_rcvd sav ad if mnt mds last_rcvd sav of mnt mds last_rcvd bs 8k count 1 umount mnt mds mount t lustre dev MSDDEV mnt mds Lustre version 1 6 5 and later should not encounter this problem How do I determine which Ethernet interfaces Lustre uses Use the 1ct1 list_nids command to show the interfaces that Lustre is using Keep in mind that when socklnd bonding is used e g networks tcp0 eth0 eth1 the LNET NID only picks up the IP address of the first interface in the network s specification e g the IP address of eth0 tcp despite LNET trying to make use of both interfaces Moreover the Ethernet interface in use is solely determined by the Linux IP routing For example if you have two Ethernet interfaces eth0 and eth1 and you direct LNET to use eth0 only e g networks tcp eth0 traffic can still use eth1 if Linux IP routing selects it because of misconfigured routing both interfaces are in the same IP network the routing table entry for eth1 comes first or by mistake A 38 Lustre 1 8 Operations Manual October 2009 Glossary A ACL Administrative OST failure C CFS CMD Completion Callback Configlog Configuration Lock Access Control List An extended attribute associated with a file which contains authorization directives A configuration directive given to a c
410. out having to back up and restore all OSS data Caution If this data is very important to you we strongly recommend that you try to back it up before you proceed It is possible to run out of space or inodes in both the MDS and OST file systems If these file systems reside on some sort of virtual storage device e g LVM Logical Volume RAID etc it may be possible to increase the storage device size this is device specific and then grow the file system to use this increased space 1 Prior to doing any sort of low level changes like this back up the file system and or device See How do I backup restore a Lustre file system 2 After the file system or device has been backed up increase the size of the storage device as necessary For LVM this would be lvextend L new size dev vgname lvname or lvextend L size increase dev vgname lvname 3 Runa full e2fsck on the file system using the Lustre e2fsprogs available at the Lustre download site or http downloads clusterfs com public tools e2fsprogs Run e2fsck f dev 4 Resize the file system to use the increased size of the device Run resize2fs p dev A 12 Lustre 1 8 Operations Manual October 2009 How do I backup restore a Lustre file system Several types of Lustre backups are available CLIENT FILE SYSTEM LEVEL BACKUPS It is possible to back up Lustre file systems from a client or many clients in parallel working in diffe
411. out using Lustre quotas When mounting an MDT filesystm the kernel crashes What do I do How do I determine which Ethernet interfaces Lustre uses Lustre 1 8 Operations Manual October 2009 How can I check if a file system is active the MGS MDT and OSTs are all online You can look at proc fs lustre lov target_obds for ACTIVE versus INACTIVE on MDS clients How to reclaim the 5 percent of disk space reserved for root If your file system normally looks like this df h mnt lustre Filesystem Size Used Avail Use Mounted on databarn 100G 81G 14G 81 mnt lustre You might be wondering where did the other 5 percent go This space is reserved for the root user Currently all Lustre installations run the ext3 file system internally on service nodes By default ext3 reserves 5 percent of the disk for the root user To reclaim this space for use by all users run this command on your OSSs tune2fs m reserved_blocks_percent device This command takes effect immediately You do not need to shut down Lustre beforehand or restart Lustre afterwards Why are applications hanging The most common cause of hung applications is a timeout For a timeout involving an MDS or failover OST applications attempting to access the disconnected resource wait until the connection is re established In most cases applications can be interrupted after a timeout with the KILL INT TERM QUIT or ALRM signals In some cases for a
412. p among the OSTs and the MDS within the file system Only the MDS uses the file system inode quota This means that the minimum quota for block is 100 MB the number of OSTs the number of MDSs which is 100 MB number of OSTs 1 The minimum quota for inode is the inode qunit If you attempt to assign a smaller quota users maybe not be able to create files The default is established at file system creation time but can be tuned via proc values described below The inode quota is also allocated in a quantized manner on the MDS Chapter9 Configuring Quotas 9 7 9 8 This sets a much smaller granularity It is specified to request a new quota in units of 100 MB and 500 inodes respectively If we look at the setquota example again running this 1fs quota command lfs quota u bob v mnt lustre displays this command output Disk quotas for user bob uid 500 Filesystem blocks quota limit grace files quota limit grace mnt lustre 207432 307200 30920 1041 10000 11000 lustre MDTO0000 UUID 992 0 102400 1041 05000 lustre OST0000_UUID 103204 0 102400 lustre OST0001_UUID 103236 0 102400 The total quota of 30 920 is allotted to user bob which is further distributed to two OSTs and one MDS with a 102 400 block quota Note Values appended with show the limit that has been over used exceeding the quota and receives this message Disk quota exceeded For example cp writing mnt lustre var cache fontconf
413. p qc_id qc_type is USRQUOTA or GRPQUOTA UUID may be filled with OBD UUID string to query quota information from a specific node dqb_valid may be set nonzero to query information only from MDS If UUID is an empty string and dqb_valid is zero then cluster wide limits and usage are returned On return obd_dqblk contains the requested information block limits unit is kilobyte Quotas must be turned on before using this command LUSTRE_Q_SETQUOTA Sets disk quota limits for user or group qc_id qc_type is USRQUOTA or GRPQUOTA dqb_valid must be set to QIF_ILIMITS QIF_BLIMITS or QIF_LIMITS both inode limits and block limits dependent on updating limits obd_dqpblk must be filled with limits values as set in dqb_valid block limits unit is kilobyte Quotas must be turned on before using this command LUSTRE_Q_GETINFO Gets information about quotas qc_type is either USRQUOTA or GRPQUOTA On return dqi_igrace is inode grace time in seconds dqi_bgrace is block grace time in seconds dqi_flags is not used by the current Lustre version LUSTRE_Q_SETINFO Sets quota information like grace times qc_type is either USRQUOTA or GRPQUOTA dqi_igrace is inode grace time in seconds dqi_bgrace is block grace time in seconds dqi_flags is not used by the current Lustre version and must be zeroed Chapter 29 Setting Lustre Properties man3 29 7 Return Values llapi_quotact1 returns 0 on success 1 on failure and sets error nu
414. p1 for example it is necessary to specify the module parameters and ethX interfaces like this options lnet networks tcp0 eth0 tcpl eth1 Note The requirement to specify explicit interfaces is not consistent across all LNDs used with Lustre and LND behavior may change over time We recommend that if your multi homed settings do not work try specifying the ethX interfaces in the options lnet networks line Chapter 2 Understanding Lustre Networking 2 5 2 6 All LNET routers that bridge two networks are equivalent their configuration is not primary or secondary All available routers balance their overall load With the router checker configured Lustre nodes can detect router health status avoid those that appear dead and reuse the ones that restore service after failures To do this LNET routing must correspond exactly with the Linux nodes map of alive routers There is no hard limit on the number of LNET routers Note When multiple interfaces are available during the network setup Lustre choose the best route Once the network connection is established Lustre expects the network to stay connected In a Lustre network connections do not fail over to the other interface even if multiple interfaces are available on the same node Under Linux 2 6 the LNET configuration parameters can be viewed under sys module generic and acceptor parameters under Inet and LND specific parameters under the corresponding
415. pages_per_rpc This tunable is the maximum number of pages that will undergo I O in a single RPC to the OST The minimum is a single page and the maximum for this setting is platform dependent 256 for i386 x86_64 possibly less for ia64 PPC with larger PAGE_SIZE though generally amounts to a total of 1 MB in the RPC proc fs lustre osc lt object name gt max_rpcs_in_flight This tunable is the maximum number of concurrent RPCs in flight from an OSC to its OST If the OSC tries to initiate an RPC but finds that it already has the same number of RPCs outstanding it will wait to issue further RPCs until some complete The minimum setting is 1 and maximum setting is 32 If you are looking to improve small file I O performance increase the max_rpcs_in_flight value To maximize performance the value for max_dirty_mb is recommended to be 4 max_pages_per_rpc max_rpcs_in_flight Note The lt object name gt varies depending on the specific Lustre configuration For lt object name gt examples refer to the sample command output Chapter 22 LustreProc 22 13 222 2 Watching the Client RPC Stream In the same directory is a file that gives a histogram of the make up of previous RPGs cat proc fs lustre osc spfs OST0000 osc c45f9c00 rpc_stats snapshot_time 1174867307 156604 secs usecs read RPCs in flight 0 write RPCs in flight 0 pending write pages 0 pending read pages 0 read write pages per rpc rpcs cums rpcs
416. part of Lustre file system shutdown process Users can restart debug_daemon by using start command after each stop command issued This is an example using debug_daemon with the interactive mode of 1ct1 to dump debug logs to a 10 MB file utils lctl To start daemon to dump debug_buffer into a 40 MB tmp dump file 1lctl gt debug _ daemon start trace log 40 To completely shut down the daemon lctl gt debug _ daemon stop To start another daemon with an unlimited file size letl gt debug _daemon start tmp unlimited The text message End of debug_daemon trace log appears at the end of each output file 23 6 Lustre 1 8 Operations Manual October 2009 29 2 2 23 259 Controlling the Kernel Debug Log Masks in proc sys portals subsystem_debug and proc sys portals debug controls the amount of information printed to the kernel debug logs The subsystem_debug mask controls the subsystems example obdfilter net portals OSC etc and the debug mask controls the debug types written out to the log example info error trace alloc etc To turn off Lustre debugging sysctl w lnet debug 0 To turn on full Lustre debugging sysctl w lnet debug 1 To turn on logging of messages related to network communications sysctl w lnet debug net To turn on logging of messages related to network communications and existing debug flags sysctl w lnet debug net To turn off network logging
417. ping using llapi 24 18 supported interconnects 3 2 operating systems 3 2 platforms 3 2 supported networks cib Cisco Topspin 2 2 Cray Seastar 2 2 Elan Quadrics Elan 2 2 GM and MX Myrinet 2 2 iib Infinicon InfiniBand 2 2 o2ib OFED 2 2 openib Mellanox Gold InfiniBand 2 2 ra RapidArray 2 2 TCP 2 2 vib Voltaire InfiniBand 2 2 system capacity 1 14 T TCP 2 2 timeouts handling 27 23 tips root squash 25 6 Troubleshooting number of OSTs needed for sustained throughput 21 21 troubleshooting consideration in connecting a SAN with Lustre 21 14 default striping 21 12 drawbacks in doing multi client O_APPEND writes 21 20 erasing a file system 21 13 error messages 21 5 handling timeouts on initial Lustre setup 21 18 handling debugging bind address already in use error 21 15 handling debugging Lustre Error xxx went back in time 21 19 handling debugging error 28 21 16 identifying a missing OST 21 10 log message out of memory on OST 21 20 logs 21 5 Lustre Error slow start_page_write 21 19 OST object missing or damaged 21 9 OSTs become read only 21 10 reclaiming reserved disk space 21 13 recovering from an unavailable OST 21 7 replacing an existing OST or MDS 21 16 setting SCSI I O sizes 21 21 slowdown occurs during Lustre startup 21 20 triggering watchdog for PID NNN 21 17 write performance better than read performance 21 8 tunables RPC stream 22 12 tunables lockl
418. plenty of memory Lustre 1 8 Operations Manual October 2009 At scale the Lustre cluster can include up to 1 000 OSSs and 100 000 clients see FIGURE 1 3 FIGURE 1 3 Lustre cluster at scale MDS disk storage containing Object Storage OSS storage with Object Metadata Targets MDT Servers OSS Storage Targets OST 1 1000 s f Pool of clu stered Metadata Servers MDS 1 100 Elan Myrinet InfiniBand HSE Simultaneous MEFO Cems support of multipl 1 100 000 Dre Me 9 Shared storage enables failover OSS a Router L 4 a Gide b failover nterprise Class Storage Arrays and SAN Fabric Chapter 1 Introduction to Lustre 1 9 1 4 1 10 Files in the Lustre File System Traditional UNIX disk file systems use inodes which contain lists of block numbers where file data for the inode is stored Similarly for each file in a Lustre file system one inode exists on the MDT However in Lustre the inode on the MDT does not point to data blocks but instead points to one or more objects associated with the files This is illustrated in FIGURE 1 4 These objects are implemented as files on the OST file systems and contain file data FIGURE 1 4 MDS inodes point to objects ext3 inodes point to data File on MDT Ordin ary ext3 File Extended Attributes Direct Data Blocks Data Block gt ptrs Indirect Double Indirect inode Indirect Data Blocks Lustre 1 8 Operati
419. plicable are listed FIGURE 5 2 Product Data B Sun Microsystems Registration Client v2 0 1 fol AT Sun Microsystems Product Registration Product Registration 1 Locate or load Product Data product data 2 View product data 3 Login to Sun Online System a Product Version Account ial Red Hat Enterprise Linux 5 5 4 Determine which dE products to register jial Lustre Client 1 6 6 1 6 6 5 Summary ia2 Red Hat Enterprise Linux 5 5 sata20 Red Hat Enterprise Linux 4 4 sata20 Lustre MGS 1 6 6 1 6 6 sata20 Lustre MDS 1 6 6 1 6 6 sata21 Red Hat Enterprise Linux 5 3 sata21 Lustre OSS 1 6 6 1 6 6 sata22 Lustre OSS 1 6 6 1 6 6 sata22 SuSE Linux Enterprise Server 10 10 sata25 Lustre Client 1 6 6 1 6 6 sata25 SuSE Linux Enterprise Server 9 9 sles10 Lustre Client 1 6 6 1 6 6 sles10 SuSE Linux Enterprise Server 10 10 SUNSPO0144F8D398E SUN FIRE X4440 RR SUNSPOO144F8D3A10 SUN FIRE X4440 RR To save this information and register these products with Sun Connection later Lan TTT baie ss click the Save As button Preferences Back Next Cancel Help L 4 If the list of located products does not look complete select Back and enter a more accurate search 5 4 Lustre 1 8 Operations Manual October 2009 Note Located service tags are not limited to Lustre components The Registration client locates any Sun product on your system that is supported in the Sun inventory managemen
420. pm lingnutls gt gnutls 1 2 10 1 i386 rpm Libzo gt 1z02 2 02 1 1 fc3 rf i1386 rpm glib gt glib 2 6 1 2 i586 rpm glib devel gt glib devel 2 6 1 2 i586 rpm Lustre 1 8 Operations Manual October 2009 8 7 2 8 7 2 1 Configuring the Hardware Heartbeat v2 runs well with an un altered v1 configuration This makes upgrading simple You can test the basic function and quickly roll back if issues appear Heartbeat v2 does not require a virtual IP address to be associated with a resource This is good since we do not use virtual IPs Heartbeat v2 supports multi node clusters of more than two nodes though it has not been tested for a multi node cluster This section describes only the two node case The multi node setup adds a score value to the resource configuration This value is used to decide the proper node for a resource when failover occurs Heartbeat v2 adds a resource manager crm The resource configuration is maintained as an XML file This file is re written by the cluster frequently Any alterations to the configuration should be made with the HA tools or when the cluster is stopped Hardware Preconditions m The basic cluster assumptions are the same as those for Heartbeat v1 For the sake of clarity here are the preconditions m The setup must consist of a failover pair where each node of the pair has access to shared storage If possible the storage paths should be identical dl_q_0 dev sda d2_q_0 dev
421. ponent type SNMP also reports total or free numbers for objects like OSD and OSC or other files LOV MDC and so on The Lustre SNMP module provides one read write variable sysStatus which starts and stops Lustre The sysHealthCheck object reports status either as healthy or not healthy and provides information for the failure The Lustre SNMP module generates traps on the detection of LBUG lustrePortalsCatastropeTrap and detection of various OBD specific healthchecks lustreOBDUnhealthyTrap Chapter 14 Lustre SNMP Module 14 3 14 4 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 5 Backup and Restore This chapter describes how to perform backup and restore on Lustre and includes the following sections a Lustre Backups m Restoring from a File level Backup LVM Snapshots on Lustre Targets 15 1 151 1 Lustre Backups Lustre provides backups at several levels Generally file system level backups are recommended over device level backups File System level Backups File system level backups give you full control over the files to back up and allow restoration of individual files as needed file system level backups are also the easiest to integrate into existing backup solutions File system backups are performed from a Lustre client or many clients working parallel in different directories rather than on individual server nodes this is no different than backing up any other file syst
422. profiles utilities e g lustre_rmmod e2scan 1_getgroups llobdstat Ilstat plot llstat routerstat and ll_recover_lost_found_objs and tools to manage large clusters perform application profiling and debug Lustre Configuring Lustre on LVM Devices If you want to use the Linux LVM to make snapshots for Lustre backup purposes the MDT and or OSTs need to be initially configured on an LVM block device Use this procedure to configure Lustre using LVM 1 Create LVM volumes for the MDT and OSTs First create LVM devices for your MDT and OST targets Do not use the entire disk for the targets as some space is required for the snapshots The snapshots size start at 0 but they increase in size as you change the backup file system In general if you expect to a 20 change in the file system between backups then the most recent snapshot will be 20 of the target size the next older one will be 40 and so on Here is an example cfs21 pvcreate dev sdal Physical volume dev sdal successfully created cfs21 vgcreate volgroup dev sdal Volume group volgroup successfully created cfs21 lvcreate L200M nMDT volgroup Logical volume MDT created cfs21 lvcreate L200M nOSTO volgroup Logical volume OSTO created cfs21 lvscan ACTIVE dev volgroup MDT 200 00 MB inherit ACTIVE dev volgroup OSTO 200 00 MB inherit 3 Logical Volume Management Chapter 4 Configuring Lustre 4 11 4
423. r Kerberos uses the kernel keyring client upcall mechanism Configuring Kerberos for Lustre This section describes supported Kerberos distributions and how to set up and configure Kerberos on Lustre Kerberos Distributions Supported on Lustre Lustre supports the following Kerberos distributions a MIT Kerberos 1 3 x m MIT Kerberos 1 4 x a MIT Kerberos 1 5 x a MIT Kerberos 1 6 not yet verified On a number of operating systems the Kerberos RPMs are installed when the operating system is first installed To determine if Kerberos RPMs are installed on your OS run rpm qa grep krb If Kerberos is installed the command returns a list like this krb5 devel 1 4 3 5 1 krb5 libs 1 4 3 5 1 krb5 workstation 1 4 3 5 1 pam_krb5 2 2 6 2 2 11 2 Lustre 1 8 Operations Manual October 2009 Note The Heimdal implementation of Kerberos is not currently supported on Lustre although it support will be added in an upcoming release 11 2 1 2 Preparing to Set Up Lustre with Kerberos To set up Lustre with Kerberos 1 Configure NTP to synchronize time across all machines 2 Configure DNS with zones 3 Verify that there are fully qualified domain names FQDNs that are resolvable in both forward and reverse directions for all servers This is required by Kerberos 4 On every node install flowing packages m libgssapi version 0 10 or higher Some newer Linux distributions include libgssapi by default If yo
424. r Linux and provides a POSIX compliant UNIX file system interface The Lustre architecture is used for many different kinds of clusters It is best known for powering seven of the ten largest high performance computing HPC clusters worldwide with tens of thousands of client systems petabytes PB of storage and hundreds of gigabytes per second GB sec of I O throughput Many HPC sites use Lustre as a site wide global file system serving dozens of clusters on an unprecedented scale The scalability of a Lustre file system reduces the need to deploy many separate file systems such as one for each cluster This offers significant storage management advantages for example avoiding maintenance of multiple data copies staged on multiple file systems Hand in hand with aggregating file system capacity with many servers I O throughput is also aggregated and scales with additional servers Moreover throughput or capacity can be easily adjusted by adding servers dynamically Lustre has been integrated with several vendor s kernels We offer Red Hat Enterprise Linux RHEL and SUSE Linux Enterprise SUSE kernels with Lustre patches Lustre 1 8 Operations Manual October 2009 1 1 1 Lustre Key Features The key features of Lustre include Scalability Lustre scales up or down with respect to the number of client nodes disk storage and bandwidth Currently Lustre is running in production environments with up to 26 000 client nodes
425. r redundancy RAIDS is not acceptable for large clusters and RAID6 is a must Take a 1 PB file system 2 000 disks of 500 GB capacity The MTTF of a disk is about 1 000 days This means that the expected failure rate is 2000 1000 2 disks per day Repair time at 10 of disk bandwidth is close to 1 day 500 GB at 5 MB sec 100 000 sec 1 day If we have a RAID 5 stripe that is 10 disks wide then during 1 day of rebuilding the chance that a second disk in the same array fails is about 9 1000 1 100 This means that in the expected period of 50 days a double failure in a RAID 5 stripe leads to data loss So RAID 6 or another double parity algorithm is necessary for OST storage For better performance we recommend that you create RAID sets with no more than 8 data disks 1 or 2 parity disks as this will provide more IOPS from having multiple independent RAID sets 10 2 1 Mean Time to Failure Lustre 1 8 Operations Manual October 2009 10 1 2 File system Use RAID5 with 5 or 9 disks or RAID6 with 6 or 10 disks each on a different controller Stripe width is the optimal minimum I O size the number of 1 MB Lustre RPCs to fit evenly on one RAID stripe without an expensive read modify write cycle Use this formula to determine stripe_width lt stripe_width gt lt chunk_size gt lt disks gt lt parity_disks gt lt 1 MB where lt parity_disks gt is 1 for RAID5 RAID Z and 2 for RAID6 RAID 2Z If the R
426. r time and I O rates for all OSTs on a server the 11obdstat utility provides utilization graphs for selectable time scales Usage llobdstat lt ost_name gt lt interval gt Parameter Description ost_name The OST name under proc fs lustre obdfilter interval Sample interval in seconds Example llobdstat lustre OSTO000 2 Interpreting MDT Statistics The MDT stats files can be used to track MDT statistics for the MDS Here is sample output for an MDT stats file cat proc fs lustre mds MDT0000 stats snapshot_time 1244832003 676892 secs usecs open 2 samples reqs close 1 samples reqs getxattr 3 samples reqs process_config 1 samples reqs connect 2 samples reqs disconnect 2 samples reqs statfs 3 samples reqs setattr 1 samples reqs getattr 3 samples reqs llog_init 6 samples reqs notify 16 samples reqs 22 36 Lustre 1 8 Operations Manual October 2009 CHAPTER 23 Lustre Debugging This chapter describes tips and information to debug Lustre and includes the following sections Lustre Debug Messages Tools for Lustre Debugging Troubleshooting with strace Looking at Disk Content Ptlrpc Request History Lustre is a complex system that requires a rich debugging environment to help locate problems 23 1 23 1 23 2 Lustre Debug Messages Each Lustre debug message has the tag of the subsystem it originated in the message type and the location in the source code The
427. r write calls from a process did not access the next sequential location The offset field is reset to 0 zero whenever a different file is read written Read write offset statistics are off by default The statistics can be activated by writing anything into the offset_stats file Example cat proc fs lustre llite lustre 57dee00 rw_of snapshot_time 1155748884 591028 secs usecs R W PID RANGE STARTRANGE ENDSMALLEST EXTENTLARG fset_stats EST EXTENTOFFSET R 8385 0 128 128 128 0 R 8385 0 224 224 224 128 wW 8385 0 250 50 100 0 wW 8385 100 1110 10 500 150 wW 8384 0 5233 5233 5233 0 R 8385 500 600 100 100 610 Where Field Description R W Whether the non sequential call was a read or write PID Process ID which made the read write call Range Start Range End Range in which the read write calls were sequential Smallest Extent Smallest extent single read write in the corresponding range Largest Extent Largest extent single read write in the corresponding range Offset Difference from the previous range end to the current range start For example Smallest Extent indicates that the writes in the range 100 to 1110 were sequential with a and a maximum write of 500 This range offset of 150 That means this is the diffe minimum write of 10 was started with an rence between the last entry s range end and this entry s range start for the same file The rw_offset_stats f
428. racking of these system components so administrators can better manage their Lustre environment Note Service tags are used solely to provide an inventory list of system and software information to Sun they do not contain any personal information Service tag components that communicate information are read only and contained They are not capable of accepting information and they cannot communicate with any other services on your system For more information on service tags see the Service Tag wiki and Service Tag FAQ a2 Sal Using Service Tags To begin using service tags with your Lustre system download the service tag package and registration client The entire service tag process can be easily managed from the Sun Inventory webpage Installing Service Tags Service tag packages for RedHat and SuSE Linux are downloadable from the Sun Lustre downloads page To download and install the service tags package 1 Navigate to the Sun Lustre download page and download the service tag package sun servicetag 1 1 4 1 i386 rpm for Lustre Install the service tag package on all Lustre nodes MGSs MDSs OSSs and clients The service tag package includes several init d scripts which are started on reboot etc init d stosreg and etc init d psn start This package also adds entries in the x inetd s configuration scripts to provide remote access to the nodes needed to collect information The script restarts
429. rallel by other utilities such as rsync using multiple clients Another useful tool is a modified version of GNU tar gtar which can back up and restore extended attributes i e file striping for Lustre Other current features of Lustre are described in detail in this manual Future features are described in the Lustre roadmap 1 4 1 Files backed up using the modified version of gtar are restored per the backed up striping information The backup procedure does not use default striping rules Lustre 1 8 Operations Manual October 2009 1 2 Lustre Components A Lustre file system consists of the following basic components see FIGURE 1 1 m Metadata Server MDS The MDS server makes metadata stored in one or more MDTs available to Lustre clients Each MDS manages the names and directories in the Lustre file system s and provides network request handling for one or more local MDTs m Metadata Target MDT The MDT stores metadata such as filenames directories permissions and file layout on an MDS Each file system has one MDT An MDT on a shared storage target can be available to many MDSs although only one should actually use it If an active MDS fails a passive MDS can serve the MDT and make it available to clients This is referred to as MDS failover m Object Storage Servers OSS The OSS provides file I O service and network request handling for one or more local OSTs Typically an OSS serves between
430. ration Files and Module Parameters man5 30 19 50 2 9 30 20 MX LND MXLND supports a number of load time parameters using Linux s module parameter system The following variables are available Variable Description n_waitd Number of completion daemons max_peers Maximum number of peers that may connect cksum Enables small message lt 4 KB checksums if set to a non zero value ntx Number of total tx message descriptors credits Number of concurrent sends to a single peer board Index value of the Myrinet board NIC ep_id MX endpoint ID polling Use zero 0 to block wait A value gt 0 will poll that many times before blocking hosts IP to hostname resolution file Of the described variables only hosts is required It must be the absolute path to the MXLND hosts file For example options kmxlnd hosts etc hosts mxlnd The file format for the hosts file is IP HOST BOARD EP_ID The values must be space and or tab separated where IP is a valid IPv4 address HOST is the name returned by hostname on that machine BOARD is the index of the Myricom NIC 0 for the first card etc EP_ID is the MX endpoint ID Lustre 1 8 Operations Manual October 2009 To obtain the optimal performance for your platform you may want to vary the remaining options n_waitd 1 sets the number of threads that process completed MX requests sends and receives max_peers 1024 tells MXLND the upper limit o
431. rderly manner If two cluster nodes can communicate they usually shut down properly This means many tests do not produce a STONITH for example a Calling init 0 or shutdown or reboot on a node orderly halt no STONITH m Stopping the heartbeat service on a node again orderly halt no STONITH You have to do something drastic for example killall 9 heartbeat like pulling cables or so on before you trigger STONITH Also the alert script does a software failover which halts Lustre but does not halt or STONITH the system To use STONITH edit the fail_lustre alert script and add your preferred shutdown command after the line usr lib heartbeat hb_standby local amp Lustre 1 8 Operations Manual October 2009 A simple method to halt the system is the sysrq method Run bin bash This script forces a boot Run echo s sync echo u remount read only echo b reboot SYST proc sysrq trigger if f SYST then echo SSYST not found exit 1 fi sync unmount sync reboot echo s gt SSYST echo u gt SYST echo s gt SYST echo b gt SSYST exit 0 Chapter 8 Failover 8 15 8 6 8 16 Using MMP The multiple mount protection MMP feature protects the file system from being mounted more than one time simultaneously If the file system is mounted MMP also protects changes by e2fsprogs to the file system This feature is very important in a shared storage environme
432. re 1 6 5 1 b Unpack the kernel For this procedure we assume that the resulting source tree also known as the destination tree is in tmp kernels linux 2 6 18 2 Select a config file for your kernel located in the kernel_configs directory lustre kernel_patches kernel_config The kernel_config directory contains the config files which are named to indicate the kernel and architecture with which they are associated For example the configuration file for the 2 6 18 kernel shipped with RHEL 5 suitable for i686 SMP systems is kernel 2 6 18 2 6 rhel5 i686 smp config 3 Select the series file for your kernel located in the series directory lustre kernel_patches series The series file contains the patches that need to be applied to the kernel Chapter 3 Installing Lustre 3 15 3 3 2 3 16 4 Set up the necessary symlinks between the kernel patches and the Lustre source This example assumes that the Lustre source files are unpacked under tmp lustre 1 6 5 1 and you have chosen the 2 6 rhel5 series file Run cd tmp kernels linux 2 6 18 rm f patches series In s tmp lustre 1 6 5 1 lustre kernel_patches series 2 6 rhel5 series series ln s tmp lustre 1 6 5 1 lustre kernel_patches patches Use Quilt to apply the patches in the selected series file to the unpatched kernel Run cd tmp kernels linux 2 6 18 quilt push av The patched destination tree acts as a base Linux source tree for Lustre C
433. re formatted using the standard UNIX mkfs command Next the volumes carrying the Lustre file system targets are mounted on the server nodes as local file systems Finally the Lustre client systems are mounted in a manner similar to NFS mounts Lustre 1 8 Operations Manual October 2009 The configuration commands listed below are for the Lustre cluster shown in FIGURE 1 7 On the MDS mds your org tcp0 mkfs lustre mdt mgs fsname large fs dev sda mount t lustre dev sda mnt mdt On OSS1 mkfs lustre ost fsname large fs mgsnode mds your org tcp0 dev sdb mount t lustre dev sdb mnt ost1 On OSS2 mkfs lustre ost fsname large fs mgsnode mds your org tcp0 dev sdc mount t lustre dev sdc mnt ost2 FIGURE 1 7 A simple Lustre cluster Oe 2 TCP Network sdb OSS2 sdc Chapter 1 Introduction to Lustre 1 15 1 6 1 16 Lustre Networking In clusters with a Lustre file system the system network connects the servers and the clients The disk storage behind the MDSs and OSSs connects to these servers using traditional SAN technologies but this SAN does not extend to the Lustre client system Servers and clients communicate with one another over a custom networking API known as Lustre Networking LNET LNET interoperates with a variety of network transports through Network Abstraction Layers NAL Key features of LNET include a RDMA when supported by underlying networks such
434. re information about PowerMan go to http www llnl gov linux powerman html Other power management software is available but PowerMan is the best we have used so far and the one with which we are most familiar Power Equipment A multi port Ethernet addressable RPC is relatively inexpensive For recommended products see the list of supported hardware on the PowerMan website If you can afford them Linux Network ICEboxes are very good tools They combine both remote power control and remote serial console in a single unit Cluster management software There are two options for cluster management software that have been implemented successfully by Lustre customers Both software options are open source and available free for download m Heartbeat The Heartbeat program is one of the core components of the High Availability Linux Linux HA project Heartbeat is highly portable and runs on every known Linux platform as well as FreeBSD and Solaris For information see http linux ha org heartbeat To download see http linux ha org download m Red Hat Cluster Manager CluManager Red Hat Cluster Manager allows administrators to connect separate systems called members or nodes together to create failover clusters that ensure application availability and data integrity under several failure conditions Administrators can use Red Hat Cluster Manager with database applications file sharing services web servers and
435. re_mds principal and generate and install the keytab kadmin gt addprinc randkey lustre_mds mdthost domain REALM kadmin gt ktadd e aes128 cts normal lustre_mds mdthost domain REALM 6 Configure the OSS nodes For each OST node create a lustre_oss principal and generate and install the keytab kadmin gt addprinc randkey lustre_oss oss_host domain REALM kadmin gt ktadd e aes128 cts normal lustre_oss oss_host domain REALM To save the trouble of assigning a unique keytab for each client node create a general lustre_root principal and its keytab and then install the keytab on as many client nodes as needed kadmin gt addprinc randkey lustre_root REALM kadmin gt ktadd e aes128 cts normal lustre_root REALM Note If one client is compromised all client nodes become insecure For more detailed information on installing and configuring Kerberos see http web mit edu Kerberos krb5 1 6 documentation Chapter 11 Kerberos 11 7 11 2 1 5 Setting the Environment Perform the following steps to configure the system and network to use Kerberos System wide Configuration 1 On each MDT OST and client node add the following line to etc fstab to mount them automatically nfsd proc fs nfsd nfsd defaults 0 0 2 On each MDT and client node dd the following line to etc request key conf oe create lgssc usr sbin lgss_keyring o k t d c u oe H oe ae oe n g Networking
436. reate and Install the Lustre Packages After patching the kernel configure it to work with Lustre create the Lustre packages RPMs and install them 1 Configure the patched kernel to run with Lustre Run cd lt path to kernel tree gt cp boot config uname r config make oldconfig make menuconfig make include asm make include linux version h make SUBDIRS scripts make include linux utsrelease h UV Ur Ur VE Ur Ur Ur Run the Lustre configure script against the patched kernel and create the Lustre packages cd lt path to lustre source tree gt configure with linux lt path to kernel tree gt make rpms This creates a set of rpms in usr src redhat RPMS lt arch gt with an appended date stamp The SuSE path is usr src packages Note You do not need to run the Lustre configure script on an unpatched kernel Lustre 1 8 Operations Manual October 2009 Example lustre 1 6 5 1 2 6 18 53 xx xx el5_lustre 1 6 5 1 custom_20081021 1686 rpm lustre debuginfo 1 6 5 1 2 6 18 53 xx xx el5_lustre 1 6 5 1 custom_20081021 1686 rpm lustre modules 1 6 5 1 2 6 18 _53 xx xxel5_lustre 1 6 5 1 custom_20081021 1686 rpm lustre source 1 6 5 1 2 6 18 _53 xx xx el5_lustre 1 6 5 1 custom_20081021 1686 rpm Note If the steps to create the RPMs fail contact Lustre Support by opening a bug Note Lustre supports several features and packages that extend the core functionality of Lustr
437. rectories files not associated with a pool lfs find mnt lustre pool Finds all directories files associated with pool Chapter 27 User Utilities man1 27 11 27 2 27 12 lfsck The e2fsprogs package contains an lfsck tool which does distributed coherency checking for the Lustre file system after e2fsck is run In most cases e2fsck is sufficient to repair any file system issues and lfsck is not required To avoid lengthy downtime you can also run lfsck once Lustre is already started Synopsis lfsck h help n nofix 1 lostfound d delete force v verbose mdsdb mdsdb ostdb ostidb ost2db lt filesystem gt Note As shown the lt filesystem gt parameter refers to the Lustre file system mount point The default mount point is mnt lustre Note For Ifsck database filenames must be provided as absolute pathnames Relative paths do not work the databases cannot be properly opened Lustre 1 8 Operations Manual October 2009 Options The options and descriptions for the lfsck command are listed below Option Description n Performs a read only check does not repair the file system l Puts orphaned objects into a lost found directory in the root of the file system d Deletes orphaned objects from the file system Since objects on the OST are usually only one of several stripes of a file it is often difficult to put multiple objects back togeth
438. rectory to have more interesting names so that they are meaningful in the future 16 4 Lustre 1 8 Operations Manual October 2009 16 3 Isolating and Debugging Failures In the case of Lustre failures you need to capture information about what is happening at runtime For example some tests may cause kernel panics depending on your Lustre configuration By default debugging is not enabled in the POSIX test suite You need to turn on the VSX debugging options There are two debug options of note in the config file tetexec cfg under the TESTROOT directory VSX_DBUG_FILE output_file If you are running the test under UML with hostfs support use a file on the hostfs as the debug output file In the case of a crash the debug output can be safely written to the debug file Note The default value for this option puts the debug log under your test directory in mnt lustre TESTROOT which is not useful in case of kernel panic and Lustre or your machine crashes VSX_DBUG_FLAGS xxxxx The following example makes VSX output all debug messages VSX_DBUG_FLAGS t d n f F L 1 2 p P VSX is based on the TET framework which provides common libraries for VSX You can also have TET print out verbose debug messages by inserting the T option when running the tests For example tcc Tall5 e s scen exec a mnt lustre TESTROOT p 2 gt amp 1 tee tmp POSIX command line output log VSX prints out detailed messages in the rep
439. rent directories via any number of user level backup tools like tar cpio Amanda and many enterprise level backup tools However due to the very large size of most Lustre file systems full backups are not always possible Doing backups of subsets of the file system subdirectories per user incremental by date etc using normal file backup tools is still recommended as this is the easiest method from which to restore data TARGET RAW DEVICE LEVEL BACKUPS In some cases it is desirable to do full device level backups of an individual MDS or OST storage device for various reasons before hardware replacement maintenance or such Doing full device level backups ensures that all of the data is preserved in the original state and is the easiest method of doing a backup If hardware replacement is the reason for the backup or if there is a spare storage device then it is possible to just do a raw copy of the MDS OST from one block device to the other as long as the new device is at least as large as the original device using the command dd if dev original of dev new bs 1M If hardware errors are causing read problems on the original device then using the command below allows as much data as possible to be read from the original device while skipping sections of the disk with errors dd if dev original of dev new bs 4k conv sync noerror Even in the face of hardware errors the ext3 file system is very robust and it may be po
440. request 0x870 0x10b0 ptlrpc ptlrpc_main 0x42e 0x7c0 ptlrpc Handling Timeouts on Initial Lustre Setup If you come across timeouts or hangs on the initial setup of your Lustre system verify that name resolution for servers and clients is working correctly Some distributions configure etc hosts sts so the name of the local machine as reported by the hostname command is mapped to local host 127 0 0 1 instead of a proper IP address This might produce this error LustreError ldlm_handle_cancel received cancel for unknown lock cookie 0xe74021a4b41b954e from nid 0x7f 000001 0 127 0 0 1 21 18 Lustre 1 8 Operations Manual October 2009 21 4 16 21 4 17 Handling Debugging LustreError xxx went back in time Each time Lustre changes the state of the disk file system it records a unique transaction number Occasionally when committing these transactions to the disk the last committed transaction number displays to other nodes in the cluster to assist the recovery Therefore the promised transactions remain absolutely safe on the disappeared disk This situation arises when m You are using a disk device that claims to have data written to disk before it actually does as in case of a device with a large cache If that disk device crashes or loses power in a way that causes the loss of the cache there can be a loss of transactions that you believe are committed This is a very serious event and you shoul
441. resources2cib py script typically found in usr lib heartbeat If you are starting with V2 we recommend that you create a V1 style configuration and converting it as the V1 style is human readable The heartbeat XML configuration is located at var lib heartbeat cib xml and the new resource manager is enabled with the crm yes directive in etc ha d ha cf For additional information on CiB refer to http linux ha org ClusterInformationBase UserGuide Heartbeat log daemon Heartbeat V2 adds a logging daemon which manages logging on behalf of cluster clients The UNIX syslog API makes calls that can block Heartbeat requires log writes to complete as a sign of health This daemon prevents a busy syslog from triggering a false failover The logging configuration has been moved to etc logd cf while the directives are essentially unchanged Basic configuration No STONITH or monitor Assuming two nodes d1_q_0 and d21_q_0 d1_q_0 owns ost alpha a d2_q_0 owns ost beta m dedicated Ethernet eth0 m serial crossover link dev ttySO m remote host for health ping 192 168 0 3 Lustre 1 8 Operations Manual October 2009 Use this procedure 1 Create the basic ha cf and haresources files haresources no longer requires the dummy virtual IP address This is an example of etc ha d haresouces oss161 clusterfs com 192 168 16 35 Filesystem dev sda ost1 lustre oss162 clusterfs com 192 168 16 36 Filesystem dev sda ost
442. rfaces are specified by the ip2nets or networks module parameter all non loopback IP interfaces are used The address within network is determined by the address of the first IP interface an instance of the socklnd encounters Consider a node on the edge of an InfiniBand network with a low bandwidth management Ethernet eth0 IP over IB configured ipoib0 and a pair of GigE NICs eth1 eth2 providing off cluster connectivity This node should be configured with networks vib tcp eth1 eth2 to ensure that the socklnd ignores the management Ethernet and IPoIB Variable Description timeout Time in seconds that communications may be stalled before the 50 W LND completes them with failure nconnds Sets the number of connection daemons 4 min_reconnectms 1000 W max_reconnectms 6000 W eager_ack 0 on linux 1 on darwin W typed_conns 1 Wc min_bulk 1024 W tx_buffer_size rx_buffer_size 8388608 Wc nagle 0 Wc Minimum connection retry interval in milliseconds After a failed connection attempt this is the time that must elapse before the first retry As connections attempts fail this time is doubled on each successive retry up to a maximum of max_reconnectms Maximum connection retry interval in milliseconds Boolean that determines whether the sockInd should attempt to flush sends on message boundaries Boolean that determines whether the sockind should use different socket
443. ring Lustre file system on MDS MGS OSS and OST with RAID and LVM created above mds16 clusterfs com options lnet networks tcp dev md0 mnt mdt mgs mdt oss161 clusterfs com options lnet networks tcp dev oss_data ost0 mnt ost0 ost 192 168 16 34 tcp0 oss161 clusterfs com options lnet networks tcp dev oss_data ost1 mnt ost1 ost 192 168 16 34 tcp0 oss162 clusterfs com options lnet networks tcp dev pv_oss1 ost2 mnt ost2 ost 192 168 16 34 tcp0 oss162 clusterfs com options lnet networks tcp dev pv_oss2 ost3 mnt ost3 ost 192 168 16 34 tcp0 lustre_config v a d f lustre_config csv This command creates RAID and LVM and then configures Lustre on the nodes or targets specified in lustre_config csv The script prompts you for the password to log in with root access to the nodes After completing the above steps the script makes Lustre target entries in the etc fstab file on Lustre server nodes such as For MDS MDT dev md0 mnt mdtlustre defaults00 For OSS pv_oss1 ost2 mnt ost2lustre defaults00 Start the Lustre services run mount dev sdb mount dev sda Lustre 1 8 Operations Manual October 2009 CHAPTER 7 More Complicated Configurations This chapter describes more complicated Lustre configurations and includes the following sections Multihomed Servers m Elan to TCP Routing Load Balancing with InfiniBand Multi Rail Configurations with LNET
444. ription path The path of the file lum The returned striping information return A value of zero 0 mean the operation was successful A value of a negative number means there was a failure stripe_count Indicates the number of OSTs that this file will be striped across stripe_pattern Indicates the RAID pattern 29 4 Lustre 1 8 Operations Manual October 2009 29 1 3 llapi_file_open The 1lapi_file_open command opens or creates a file with the specified striping parameters Synopsis int llapi_file_open const char name int flags int mode unsigned long stripe_size int stripe_offset int stripe count int stripe_pattern Description The 1lapi_file_open function opens or creates a file with the specified striping parameters If it returns a zero 0 the operation was successful a negative number means there was a failure Option Description name The name of the file flags This opens flags mode This opens modes stripe_size The stripe size of the file stripe_offset The stripe offset stripe_index of the file stripe_count The stripe count of the file stripe_pattern The stripe pattern of the file Chapter 29 Setting Lustre Properties man3 29 5 29 1 4 29 6 llapi_quotactl Use llapi_quotact1 to manipulate disk quotas on a Lustre file system Synopsis include include include include int llap struct i lt liblustre h gt lt lustre lustre_idl h gt
445. rmanently deactivate the OST on the clients and the MDT On the MGS run mgs gt lctl conf_param lt OST name gt osc active 0 4 3 11 2 Restoring an OST to the File System Restoring an OST to the file system is as easy as activating it When the OST is active it is automatically added to the normal stripe rotation and files are written to it To restore an OST 1 Make sure the OST to be restored is running 2 Reactivate the OST Run mgs gt lctl conf param lt OST name gt osc active 1 4 26 Lustre 1 8 Operations Manual October 2009 4 3 12 Changing a Server NID To change a server NID 1 Update the LNET configuration in the etc modprobe conf file so the list of server NIDs 1ct1 list_nids is correct 2 Use the writeconf command to erase the configuration logs for the file system On the MDT run mdt gt tunefs lustre writeconf lt mount point gt After the writeconf command is run the configuration logs are re generated as servers restart and the current server NIDs are used 3 If the MGS s NID was changed communicate the new MGS location to each server Run tunefs lustre erase param mgsnode lt new_nid s gt writeconf dev 4 3 13 Aborting Recovery You can abort recovery with either the Ictl utility or by mounting the target with the abort_recov option mount o abort_recov When starting a target run mount t lustre L lt MDT name gt o abort_recov lt mount point gt No
446. roper with option when configuring the Lustre source code 1 Compile and install the Lustre kernel a Install the necessary build tools GCC and related tools must also be installed For more information see Required Lustre Software yum install rpm build redhat rpm config mkdir p rpmbuild BUILD RPMS SOURCES SPECS SRPMS echo S topdir echo SHOME rpmbuild gt rpmmacros b Install the patched Lustre source code This RPM is available at the Lustre download page rpm ivh kernel lustre source 2 6 18 92 1 10 e15_lustre 1 6 6 x86_64 rpm c Build the Linux kernel RPM make make make make make make H VE UE VE Ur Ur Ur Ur nh make cd usr src linux 2 6 18 92 1 10 e15 lustre 1 6 6 distclean oldconfig dep bzImage modules cp boot config uname r config oldconfig make menuconfig include asm include linux version h SUBDIRS scripts rpm d Install the Linux kernel RPM If you are building a set of RPMs for a cluster installation this step is not necessary Source RPMs are only needed on the build machine rpm ivh rpmbuild kernel lustre 2 6 18 92 1 10 e15_lustre 1 6 6 x86_64 rpm mkinitrd boot 2 6 18 92 1 10 e15_lustre 1 6 6 e Update the boot loader etc grub conf with the new kernel boot information sbin shutdown 0 r Chapter 3 Installing Lustre 3 19 3 20 2 Compile and install the MX stack cd usr src gunzip mx_1 2 7 tar gz can be obtained
447. rtbeat 8 4 8 14 Connection Handling During Failover 8 4 8 15 Roles of Nodes ina Failover 8 5 OST Failover 8 6 MDS Failover 8 6 Configuring Lustre for Failover 8 6 84 1 Starting Stopping a Resource 8 7 8 4 2 Active Active Failover Configuration 8 7 8 4 3 Hardware Requirements for Failover 8 8 Contents ix 85 Setting Up Failover with Heartbeat V1 8 9 8 5 1 Installing the Software 8 9 8 6 Using MMP 8 16 87 Setting Up Failover with Heartbeat V2 8 18 8 7 1 Installing the Software 8 18 8 7 2 Configuring the Hardware 8 19 8 7 3 Operation 8 22 8 8 Considerations with Failover Software and Solutions 8 23 9 Configuring Quotas 9 1 91 Working with Quotas 9 1 9 1 1 Enabling Disk Quotas 9 2 9 1 2 Creating Quota Files and Quota Administration 9 4 9 13 Quota Allocation 9 7 9 1 4 Known Issues with Quotas 9 10 9 1 5 Lustre Quota Statistics 9 13 10 RAID 10 1 10 1 Considerations for Backend Storage 10 2 10 1 1 Selecting Storage for the MDS or OSTs 10 2 10 1 2 Reliability Best Practices 10 3 10 1 3 Understanding Double Failures with Hardware and Software RAID5 10 4 10 1 4 Performance Tradeoffs 10 5 10 1 5 Formatting Options for RAID Devices 10 5 10 1 6 Handling Degraded RAID Arrays 10 7 10 2 Insights into Disk Performance Measurement 10 7 10 3 Lustre Software RAID Support 10 8 Lustre 1 8 Operations Manual October 2009 11 12 13 Kerberos 11 1 11 1 11 2 What is Kerberos 11 1 Lustre Setup with Kerberos 11
448. run mount t lustre mdsnode mdsA client mnt lustre To start an Elan client run mount t lustre 2 elan0 mdsA client mnt lustre Chapter 2 Understanding Lustre Networking 2 13 2 5 2 2 14 Stopping LNET Before the LNET modules can be removed LNET references must be removed In general these references are removed automatically when Lustre is shut down but for standalone routers an explicit step is needed to stop LNET Run letl network unconfigure Note Attempting to remove Lustre modules prior to stopping the network may result in a crash or an LNET hang if this occurs the node must be rebooted in most cases Make sure that the Lustre network and Lustre are stopped prior to unloading the modules Be extremely careful using rmmod f To unconfigure the LNET network run modprobe r lt any lnd and the lnet modules gt Tip To remove all Lustre modules run lctl modules awk print 2 xargs rmmod Lustre 1 8 Operations Manual October 2009 Part II Lustre Administration Lustre administration includes the steps necessary to meet pre installation requirements and install and configure Lustre It also includes advanced topics such as failover quotas bonding benchmarking Kerberos and POSIX CHAPTER 3 Installing Lustre Lustre installation involves two procedures meeting the installation prerequisites and installing the Lustre software either from RPMs or from source c
449. ry quiet Prints less information Chapter 31 System Configuration Utilities man8 31 3 31 4 Option Description reformat Reformats an existing Lustre disk stripe count hint stripes Used to optimize the MDT s inode size verbose Prints more information Examples Creates a combined MGS and MDT for file system testfs on node cfs21 mkfs lustre fsname testfs mdt mgs dev sdal Creates an OST for file system testfs on any node using the above MGS mkfs lustre fsname testfs ost mgsnode cfs21 tcp0 dev sdb Creates a standalone MGS on e g node cfs22 mkfs lustre mgs dev sdal Creates an MDT for file system myfs1 on any node using the above MGS mkfs lustre fsname myfs1 mdt mgsnode cfs22 tcp0 dev sda2 Lustre 1 8 Operations Manual October 2009 31 2 tunefs lustre The tunefs lustre utility modifies configuration information on a Lustre target disk Synopsis tunefs lustre options device Description tunefs lustre is used to modify configuration information on a Lustre target disk This includes upgrading old pre Lustre 1 8 disks This does not reformat the disk or erase the target information but modifying the configuration information can result in an unusable file system Caution Changes made here affect a file system when the target is mounted the next time With tunefs lustre parameters are additive new parameters are specifie
450. s a All clients and all servers must get two rails of bandwidth ip2nets 02ib0 ib0 o2ib2 ib1 192 168 0 1 0 252 2 even servers 02ib1 ib0 02ib3 ib1 192 168 0 1 1 253 2 odd servers 02ib0 ib0 o2ib3 ib1 192 168 2 253 0 252 2 even clients o2ib1 ib0 o2ib2 ib1 192 168 2 253 1 253 2 odd clients This configuration includes two additional proxy o2ib networks to work around Lustre s simplistic NID selection algorithm It connects even clients to even servers with o2ib0 on rail0 and odd servers with o2ib3 on raill Similarly it connects odd clients to odd servers with o2ib1 on rail0 and even servers with 02ib2 on raill Lustre 1 8 Operations Manual October 2009 CHAPTER 8 Failover This chapter describes failover in a Lustre system and includes the following sections a What is Failover a OST Failover a MDS Failover Configuring Lustre for Failover m Setting Up Failover with Heartbeat V1 m Using MMP m Setting Up Failover with Heartbeat V2 m Considerations with Failover Software and Solutions 8 1 What is Failover A computer system is highly available when the services it provides are available with minimal downtime In a highly available system if a failure condition occurs such as loss of a server or a network or software fault the services provided remain unaffected Generally we measure availability by the percentage of time the system is requir
451. s from another node Basic Configuration With STONITH STONITH automates the process of power control with the expect package Expect scripts are very dependent on the exact set of commands provided by each hardware vendor and as a result any change made in the power control hardware firmware requires tweaking STONITH Much must be deduced by running the STONITH package by hand STONITH has some supplied packages but can also run with an external script There are two STONITH modes m Single STONITH command for all nodes found in ha cf stonith lt type gt lt config file gt a STONITH command per node stonith_host lt hostfrom gt lt stonith_type gt lt params gt You can use an external script to kill each node stonith_host nodeA external foo etc ha d reset nodeB stonith_host nodeB external foo etc ha d reset nodeA Here foo is a placeholder for an unused parameter Chapter 8 Failover 8 13 8 14 To get the proper syntax run stonith L The above command lists supported models To list required parameters and specify the config file name run stonith 1 t lt model gt To attempt a test run stonith 1 t lt model gt lt fake host name gt This command also gives data on what is required To test use a real hostname The external STONITH scripts should take the parameters start stop status and return 0 or 1 STONITH _only happens when the cluster cannot do things in an o
452. s Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled Determines the minimum message fragment that should be considered for zero copy sends Increasing it above the platform s PAGE_SIZE disables all zero copy sends This option is not available on all platforms Chapter 30 Configuration Files and Module Parameters man5 30 9 30 2 3 30 10 QSW LND The QSW LND qswind is connection less and therefore does not need the acceptor It is limited to a single instance which uses all Elan rails that are present and dynamically load balances over them The address with network is the node s Elan ID A specific interface cannot be selected in the networks module parameter Variable Description tx_maxcontig 1024 mtxmsgs 8 nnblk_txmsg 512 with a 4K page size 256 otherwise nrxmsg_small 256 ep_envelopes_small 2048 nrxmsg_large 64 ep_envelopes_large 256 optimized_puts 32768 W optimized_gets 1 W Integer that specifies the maximum message payload in bytes to copy into a pre mapped transmit buffer Number of normal message descriptors for locally initiated communications that may block for memory callers block when this pool is exhausted Number of reserved message descriptors for communications that may not block for memory This pool must be sized large enough so it is never exhausted Number of small
453. s and has command line switches to produce more graphable output plot llstat sh The plot 1llstat sh utility plots the output from 11stat sh using gnuplot More proc Statistics for Application Profiling The following utilities provide additional statistics vfs_ops_stats The client vEs_ops_stats utility tracks Linux VFS operation calls into Lustre for a single PID PPID GID or everything proc fs lustre llite vfs_ops_stats proc fs lustre llite vfs_track_ pid ppid gid extents_stats The client extents_stats utility shows the size distribution of I O calls from the client cumulative and by process proc fs lustre llite extents_stats extents_stats_per_process 31 20 Lustre 1 8 Operations Manual October 2009 31 5 6 offset_stats The client offset_stats utility shows the read write seek activity of a client by offsets and ranges proc fs lustre llite offset_stats Lustre 1 6 included per client and improved MDT statistics m Per client statistics tracked on the servers Each MDT and OST now tracks LDLM and operations statistics for every connected client for comparisons and simpler collection of distributed job statistics proc fs lustre mds obdfilter exports Improved MDT statistics More detailed MDT operations statistics are collected for better profiling proc fs lustre mds stats Testing Debugging Utilities The following utilities are located in usr bin loadgen The loadgen uti
454. s find R o OST_UUID mountpoint This returns a simple list of filenames from the affected file system It is possible to read the valid parts of a striped file if necessary dd if filename of new_filename bs 4k conv sync noerror Otherwise it is possible to delete these files with unlink or munlink Appendix A Lustre Knowledge Base A 21 A 22 If you need to need to know specifically which parts of the file are missing data you first need to determine the file layout striping pattern which includes the index of the missing OST lfs getstripe v filename The following computation is used to determine which offsets in the file are affected C N X S C N X S5 S 1 N 0 1 2 where C stripe count S stripe size X index of bad ost for this file Example for a file with 2 stripes stripe size 1M bad OST is index 0 you would have holes in your file at 2 N 0 1M 2 N 0 1M IM 1 N 0 1 2 If the file system can t be mounted there isn t anything currently that would parse metadata directly from an MDS If the bad OST is definitely not starting options for mounting the file system anyway are to provide a loop device OST in its place or to replace it with a newly formatted OST In that case the missing objects are created and will read as zero filled How To New Lustre network configuration Updating Lustre s network configuration during an upgrade to version 1 4 6
455. s for different types of messages When clear all communication with a particular peer takes place on the same socket Otherwise separate sockets are used for bulk sends bulk receives and everything else Determines when a message is considered bulk Socket buffer sizes Setting this option to zero 0 allows the system to auto tune buffer sizes WARNING Be very careful changing this value as improper sizing can harm performance Boolean that determines if nagle should be enabled It should never be set in production systems Lustre 1 8 Operations Manual October 2009 Variable Description keepalive_idle 30 Wc keepalive_intvl 2 Wc keepalive_count 10 Wc enable_irq_affinity 0 We zc_min_frag 2048 W Time in seconds that a socket can remain idle before a keepalive probe is sent Setting this value to zero 0 disables keepalives Time in seconds to repeat unanswered keepalive probes Setting this value to zero 0 disables keepalives Number of unanswered keepalive probes before pronouncing socket hence peer death Boolean that determines whether to enable IRQ affinity The default is zero 0 When set sockind attempts to maximize performance by handling device interrupts and data movement for particular hardware interfaces on particular CPUs This option is not available on all platforms This option requires an SMP system to exist and produces best performance with multiple NIC
456. s lustre ost OSS ost_io stats on 192 168 16 35 tcp snapshot_time 1181074093 276072 proc fs lustre ost OSS ost_io stats 1181074103 284895 Name Cur CountCur Rate EventsUnit last min avg max stddev req waittime8 0 8 usec 2078 34 259 75 868 317 49 req_qdepth 8 0 reqs 1 0 0 12 1 0435 req_active 8 0 8 reqs 11 1 1 38 2 0 52 reqbuf_avail8 0 8 bufs 511 63 63 88 64 0 35 ost_write 8 0 8 bytes 1697677 72914212209 6238757991874 29 proc fs lustre ost OSS ost_io stats 1181074113 290180 Name Cur CountCur Rate EventsUnit lastmin avg max stddev req_waittime31 3 39 usec 30011 34 822 79 12245 2047 71 req_qdepth 31 3 39 reqs 0 0 0 03 1 0 16 req_active 31 3 39 reqs 58 1 PFI 3 0 74 reqbuf_avail31 3 39 bufs 1977 63 63 79 64 0 41 ost_write 30 3 38 bytes 10284679 15019315325 16910694197776 51 proc fs lustre ost OSS ost_io stats 1181074123 325560 Name Cur CountCur Rate Events Unit last minavgmax stddev req waittime21 2 60 usec 14970 34784 32122451878 66 req_qdepth 21 60 reqs 0 0 0 02 1 04 13 req active 21 60 reqs 33 1 1 70 3 0 70 60 bufs 1341 6363 82 64 0 39 59 bytes 7648424 15019332725 08910694 reqbuf_avail21 ost_write 21 180397 87 NNNN 22 34 Lustre 1 8 Operations Manual October 2009 Where Parameter Description Cur Count Number of events of each type sent in the last interval in this example 10s Cur Rate Number of events per second in the last interval Events Total number o
457. s or do I have to set them for each user group individually In that case the default limit will be 0 which means no limit What happens if a user group has already more files disk usage than his quotas allows Given that it will be 0 initially no users will be over quotas To preempt the next question if a user has a limit set that is less than his existing usage he will simply start to get EDQUOT errors on subsequent attempts to write data We only want group quotas do we have to enable user quotas as well We do not know of any particular failure if only group quotas are enabled but the more your use cases match our testing then the better off you will be For user quotas even if you do not want to enforce limits you can enable quotas but not set any limits Doing this makes future operation of enabling limits on users easier when if you decide to as usage will already be tracked and accounted for saving you the need to do that initial accounting It also provides you with a means to quickly assess how much space is being consumed on a user by user basis Appendix A Lustre Knowledge Base A 37 When mounting an MDT filesystm the kernel crashes What do I do On Lustre versions prior to 1 6 5 use this procedure 1 Try to mount the file system with o abort_recovery as an option 2 If this does not work try to mount the file system as t Idiskfs mount t ldiskfs 3 If that works try to truncate the last_rcvd file m
458. s quotaon ug mnt lustre To turn off user and group quotas run lfs quotaoff ug mnt lustre To display general quota information disk usage and limits for the user running the command and his primary group run lfs quota mnt lustre Chapter 9 Configuring Quotas 9 5 9 6 To display general quota information for a specific user bob in this example run Ifs quota u bob mnt lustre To display general quota information for a specific user bob in this example and detailed quota statistics for each MDT and OST run lfs quota u bob v mnt lustre To display general quota information for the group to which a specific user bob in this example belongs run lfs quota g bob mnt lustre To display general quota information for a specific group eng in this example run lfs quota g eng mnt lustre To display block and inode grace times for user quotas run lfs quota t u mnt lustre To set user and group quotas for a specific user bob in this example run lfs setquota u bob 307200 309200 10000 11000 mnt lustre In this example the quota for user bob is set to 300 MB 309200 1024 and the hard limit is 11 000 files Therefore the inode hard limit should be 11000 Note For the Lustre command 1fs setquota quota the qunit for block is KB 1024 and the qunit for inode is 1 The quota command displays the quota allocated and consumed for each Lustre device
459. s to be combined into the MD device Multiple devices are separated by space or by using shell extensions for example dev sd a b c Chapter 6 Configuring Lustre Examples 6 5 6 6 Linux LVM PV Physical Volume The CSV line format is hostname PV pv names operation mode options Where Variable Supported Type hostname Hostname of the node in the cluster PV Marker of the PV line pv names Devices or loopback files to be initialized for later use by LVM or to operation mode options wipe the label for example dev sda Multiple devices or files are separated by space or by using shell expansions for example dev sd a b c Operations mode either create or remove Default is create A catchall for other pvcreate pvremove options for example vv Linux LVM VG Volume Group The CSV line format is hostname VG vg name operation mode options pv paths Where Variable Supported Type hostname Hostname of the node in the cluster VG Marker of the VG line vg name Name of the volume group for example ost_vg operation mode options pv paths Operations mode either create or remove Default is create A catchall for other vgcreate rgremove options for example s 32M Physical volumes to construct this VG required by the create mode multiple PVs are separated by space or by using shell expansions for example dev sd k m 1 Lustre 1 8 Operations Manu
460. s with more extended interface than block oriented devices such as disks Lustre uses this name to describe to a software module that implements an object storage API in the kernel Lustre also uses this name to refer to an instance of an object storage device created by that driver The OSD device is layered on a file system with methods that mimic create destroy and I O operations on file inodes Object Storage Server A server OBD that provides access to local OSTs Object Storage Target An OSD made accessible through a network protocol Typically an OST is associated with a unique OSD which in turn is associated with a formatted disk file system on the server containing the storage objects A locking protocol introduced in the VFS by CFS to allow for concurrent operations on a single directory inode Glossary 7 pool Portal PTLRPC R Remote user handling Reply Re sent request Revocation Callback Rollback Root squash routing RPC A group of OSTs can be combined into a pool with unique access permissions and stripe characteristics Each OST is a member of only one pool while an MDT can serve files from multiple pools A client accesses one pool on the the file system the MDT stores files from for that client only on that pool s OSTs A concept used by LNET LNET messages are sent to a portal on a NID Portals can receive packets when a memory descriptor is attached to the portal Portals are implem
461. sda m Shared storage can be arranged in an active passive MDS OSS or active active OSS only configuration Each shared resource will have a primary default node The secondary node is assumed m The two nodes must have one or more communication paths for heartbeat traffic A communication path can be a Dedicated Ethernet a Serial live serial crossover cable Failure of all heartbeat communication is not good This condition is called split brain and the heartbeat software will resolve this situation by powering down one node m The two nodes must have a method to control each other s state The Remote Power Control hardware is the best There must be a script to start and stop a given node from the other node STONITH provides soft power control methods ssh meatware but these cannot be used in a production situation a Heartbeat provides a remote ping service that is used to monitor the health of the external network If you wish to use the ipfail service you must have a very reliable external address to use as the ping target Chapter 8 Failover 8 19 8 7 2 2 8 7 2 3 8 20 Configuring Lustre Configuring Lustre for Heartbeat V2 is identical to the V1 case Configuring Heartbeat For details on all configuration options refer to the Linux HA website http linux ha org ha cf As mentioned earlier you can run Heartbeat V2 using the V1 configuration To convert from the V1 configuration to V2 use the ha
462. se fields Chapter 8 Failover 8 21 8 7 3 8 7 3 1 8 22 Basic Configuration Adding STONITH As per Basic configuration No STONITH or monitor the best way to do this is to add the STONITH options to ha cf and run the conversion script For more information see http linux ha org ExternalStonithPlugins Operation In normal operation Lustre should be controlled by the Heartbeat software Start Heartbeat at the boot time It starts Lustre after the initial dead time Initial startup 1 Stop the Heartbeat software if running If this is a new Lustre file system mkfs lustre fsname spfs ost failnode oss162 mgsnode mds16 tcp0 dev sdb one 2 mount t lustre dev sdb mnt spfs ost 3 etc init d heartbeat start on one node 4 tail f var log ha log to see progress 5 After initdead this node should start all Lustre objects 6 etc init d heartbeat start on second node 7 After heartbeat is up on both the nodes failback the resources to the second node On the second node run usr lib heartbeart hb_takeover local You should see the resources stop on the first node and start up on the second node Lustre 1 8 Operations Manual October 2009 8 7 3 2 8 7 3 3 Testing 1 Pull power from one node 2 Pull networking from one node 3 After Mon is setup pull the connection between the OST and the backend storage Failback Normally do the failback manually after determi
463. ses Access control list ACL Currently the Lustre security model follows a UNIX file system enhanced with POSIX ACLs Noteworthy additional features include root squash and connecting from privileged ports only Quotas User and group quotas are available for Lustre OSS addition The capacity of a Lustre file system and aggregate cluster bandwidth can be increased without interrupting any operations by adding a new OSS with OSTs to the cluster Controlled striping The default stripe count and stripe size can be controlled in various ways The file system has a default setting that is determined at format time Directories can be given an attribute so that all files under that directory and recursively under any sub directory have a striping pattern determined by the attribute Finally utilities and application libraries are provided to control the striping of an individual file at creation time Chapter 1 Introduction to Lustre 1 3 Snapshots Lustre file servers use volumes attached to the server nodes The Lustre software includes a utility using LVM snapshot technology to create a snapshot of all volumes and group snapshots together in a snapshot file system that can be mounted with Lustre Backup tools Lustre 1 6 includes two utilities supporting backups One tool scans file systems and locates files modified since a certain timeframe This utility makes modified files pathnames available so they can be processed in pa
464. shes The recovery is complete when either all previously connected clients reconnect and their transactions are replayed or a client connection attempt times out If a connection attempt times out then all clients waiting to reconnect and their transactions are lost Chapter 19 Lustre Recovery 19 3 19 2 4 19 4 Note If you know an OST will not recover a previously connected client if for example the client has crashed you can manually abort the recovery using this command letl device lt OST device number gt abort_recovery To determine an OST s device number and device name run the lctl dl command Sample 1ct1 dl command output is shown below 7 UP obdfilter ddn_data OST0009 ddn_data OST0009_UUID 1159 In this example 7 is the OST device number The device name is ddn_data OSTO0009 In most instances the device name can be used in place of the device number Network Partition The partition can be transient Lustre recovery occurs in following sequence m Clients can detect harmless partition upon reconnecting Dropped reply cases require ReplyReconstruction m Servers evict clients a ClientUpcall may try other routers The arbitrary configuration change is possible the message Failed Recovery ENOTCONN is given for evicted clients m Process invalidates all entries and locks Eventually the file system finishes recovering and returns to normal operation You may check the progress of
465. should not normally be enabled Chapter 30 Configuration Files and Module Parameters man5 30 13 30 2 6 30 14 OpenIB LND The OpenIB LND is connection based and uses the acceptor to establish reliable queue pairs over InfiniBand with its peers It is limited to a single instance that uses only IB device 0 The address within network is determined by the address of the single IP interface that may be specified by the networks module parameter If this is omitted the first non loopback IP interface that is up is used instead It uses the acceptor to establish connections with its peers Variable Description n_connd Sets the number of connection daemons The default value is 4 4 min_reconnect_interval 1 W max_reconnect_interval 60 W timeout 50 W ntx 64 ntx_nblk 256 concurrent_peers 1024 cksum 0 W Minimum connection retry interval in seconds After a failed connection attempt this sets the time that must elapse before the first retry As connections attempts fail this time is doubled on each successive retry up to a maximum of max_reconnect_interval Maximum connection retry interval in seconds Time in seconds that communications may be stalled before the LND completes them with failure Number of normal message descriptors for locally initiated communications that may block for memory callers block when this pool is exhausted Number of reserved
466. sign here the IP of the bonded interface ONBOOT yes USERCTL no ifcfg ethx cat etc sysconfig network scripts ifcfg eth0 TYPE Ethernet DEVICE eth0 HWADDR 4c 00 10 ac 61 e0 BOOTPROTO none ONBOOT yes USERCTL no I P PV6INIT no EERDNS yes MASTER bond0 SLAVE yes Chapter12 Bonding 12 9 12 10 In the following example the bondO interface is the master MASTER while eth0 and eth1 are slaves SLAVE Note All slaves of bond0 have the same MAC address Hwaddr bond All modes except TLB and ALB have this MAC address TLB and ALB require a unique MAC address for each slave sbin ifconfig bondOLink encap EthernetHwaddr 00 C0 F0 1F 37 B4 inet addr XXX XXX XXX YYY Bcast XXX XXX XXX 255 Mask 255 UP BROADCAST RUNNING MASTER MULTICAST MTU 1500 Metric 1 RX packets 7224794 errors 0 dropped 0 overruns 0 frame 0 TX packets 3286647 errors 1 dropped 0 overruns 1 carrier collisions 0 txqueuelen 0 ethOLink encap EthernetHwaddr 00 C0 F0 1F 37 B4 inet addr XXX XXX XXX YYY Bcast XXX XXX XXX 255 Mask 255 UP BROADCAST RUNNING SLAVE MULTICAST MTU 1500 Metric 1 RX packets 3573025 errors 0 dropped 0 overruns 0 frame 0 TX packets 1643167 errors 1 dropped 0 overruns 1 carrier collisions 0 txqueuelen 100 Interrupt 10 Base address 0x1080 ethlLink encap EthernetHwaddr 00 C0 F0 1F 37 B4 inet addr XXX XXX XXX YYY Bcast XXX XXX XXX 255 Mask 255 UP BROADCAST RUNNING SLAVE MULTICAST MT
467. sing but the OST has unreferenced objects orphan object Normally this happens if there was a problem with the MDS a Multiple inodes reference the same objects This happens if there was corruption on the MDS or if the MDS storage is cached and loses some but not all writes If the file system is busy 1fsck may report inconsistencies where none exist because of files and objects being created removed after the database files were collected Examined the results closely you probably want to contact Lustre Support for guidance The easiest problem to resolve is orphaned objects Use the 1 option to 1fsck so it links these objects to new files and puts them into lost found in the Lustre file system where they can be examined and saved or deleted as necessary If you are certain that the objects are not necessary 1fsck can run with the d option to delete orphaned objects and free up any space they are using 27 18 Lustre 1 8 Operations Manual October 2009 To fix dangling inodes 1fsck creates new zero length objects on the OSTs if the c option is given These files read back with binary zeros for the stripes that had objects recreated Such files can also be read even without 1fsck repair by using this command run dd if lustre bad file of new file bs 4k conv sync noerror Because it is rarely useful to have files with large holes in them most users delete these files after reading them if useful and or restoring them from
468. sk RAID 6 array there are 8 active disks The chunk size must be chosen such that lt chunk_size gt lt 1024KB 8 Therefore the largest valid chunk size is 128KB 4 These enhancements have mostly improved write performance 10 8 Lustre 1 8 Operations Manual October 2009 a Create a RAID array for an OST On the OSS run mdadm create lt array_device gt c lt chunk_size gt 1 lt raid_level gt n lt active_disks gt x lt spare_disks gt lt block_devices gt where lt array_device gt RAID array to create in the form of dev mdX lt chunk_size gt Size of each stripe piece on the array s disks in KB discussed above lt raid_level gt Architecture of the RAID array RAID 5 and RAID 6 are commonly used for OSTs lt active_disks gt Number of active disks in the array including parity disks lt spare_disks gt Number of spare disks initially assigned to the array More disks may be brought in via spare pooling see below lt block_devices gt List of the block devices used for the RAID array wildcards may be used For the worked example the command is mdadm create dev md10 c 128 1 6 n 10 x 0 dev dsk c0t0d 01234 dev dsk c1t0d 01234 This command output displays mdadm array dev md10 started We also want an external journal on a RAID 1 device We create this from two 400MB partitions on separate disks dev dsk c9t0d20p1 and dev dsk c1t0d20p1 Chapter 10 RAID 10
469. sname spfs mdt mgs dev sda mkdir p mnt test mdt mount t lustre dev sda mnt test mdt OR For the MGS on the separate node with TCP network run mkfs lustre mgs dev sda mkdir p mnt mgs mount t lustre dev sda mnt mgs For starting the MDT on node mds16 with MGS on node mgs16 run mkfs lustre fsname spfs mdt mgsnode mgsl6 tcp0 dev sda mkdir p mnt test mdt mount t lustre dev sda2 mnt test mdt For starting the OST on TCP based network run mkfs lustre fsname spfs ost mgsnode mgs16 tcp0 dev sdas mkdir p mnt test ost0 mount t lustre dev sda mnt test ost0 Chapter 7 More Complicated Configurations 7 3 14129 7 4 Start Clients TCP clients can use the host name or IP address of the MDS run mount t lustre megan tcp0 mdsA client mnt lustre Use this command to start the Elan clients run mount t lustre 2 elan0 mdsA client mnt lustre Note If the MGS node has multiple interfaces for instance cfs21 and 1 elan only the client mount command has to change The MGS NID specifier must be an appropriate nettype for the client for example a TCP client could use uml1 tcp0 and an Elan client could use 1 elan Alternatively a list of all MGS NIDs can be given and the client chooses the correctd one For example mount t lustre mgs16 tcp0 1 elan testfs mnt testfs Lustre 1 8 Operations Manual October 2009 1 2 7 2 1 7 2 2 75259
470. so additional CPU overhead because the client cannot receive data without copying it from the network buffers In the write case the client CAN send data without the additional data copy This means that the client is more likely to become CPU bound during reads than writes Lustre 1 8 Operations Manual October 2009 21 4 3 OST Object is Missing or Damaged If the OSS fails to find an object or finds a damaged object this message appears OST object missing or damaged OST ost1l object 98148 error 2 If the reported error is 2 ENOENT or No such file or directory then the object is missing This can occur either because the MDS and OST are out of sync or because an OST object was corrupted and deleted If you have recovered the file system from a disk failure by using e2fsck then unrecoverable objects may have been deleted or moved to lost found on the raw OST partition Because files on the MDS still reference these objects attempts to access them produce this error If you have recovered a backup of the raw MDS or OST partition then the restored partition is very likely to be out of sync with the rest of your cluster No matter which server partition you restored from backup files on the MDS may reference objects which no longer exist or did not exist when the backup was taken accessing those files produces this error If neither of those descriptions is applicable to your situation then it is possible that you
471. ssible to recover file system data after e2fsck is run on the new device TARGET FILE SYSTEM LEVEL BACKUPS In other cases it is desirable to make a backup of just the file data in an MDS or OST file system instead of backing up the entire device e g if the device is very large but has little data in it if the configuration of the parameters of the ext3 file system need to be changed to use less space for the backup etc In this case it is possible to mount the ext3 file system directly from the storage device and do a file level backup Lustre MUST BE STOPPED ON THAT NODE To back up such a file system properly also requires that any extended attributes EAs stored in the file system be backed up but unfortunately current backup tools do not properly save this data so an extra step is required Appendix A Lustre Knowledge Base A 13 1 Make a mountpoint for the mkdir mnt mds file system 2 Mount the file system there m For 2 4 kernels run mount t ext3 dev mnt mds m For 2 6 kernels run mount t ldiskfs dev mnt mds 3 Change to the mount point being backed up Type cd mnt mds 4 Back up the EAs Type getfattr R d m P gt ea bak The getfattr command is part of the attr package in most distributions If the getfattr command returns errors like Operation not supported then your kernel does not support EAs correctly STOP and use a different backup method or contact us for assistance 5
472. st of available commands type help at the 1ct1 prompt To get basic help on command meaning and syntax type help command For non interactive use use the second invocation which runs the command after connecting to the device Lustre 1 8 Operations Manual October 2009 Setting Parameters with Ictl Lustre parameters are not always accessible using the procfs interface as it is platform specific As a solution 1ct1 get set _param has been introduced as a platform independent interface to the Lustre tunables Avoid direct references to proc fs sys lustre Inet For future portability use lctl get set _param When the file system is running temporary parameters can be set using the 1ct1 set_param command These parameters map to items in proc fs sys lnet lustre The lctl set_param command uses this syntax letl set_param n lt obdtype gt lt obdname gt lt proc_file_name gt lt value gt For example lctl set_param 1dlm namespaces osc 1lru_size s NR_CPU 100 Many permanent parameters can be set with the lctl conf_param command In general the lctl conf_param command can be used to specify any parameter settable in a proc fs lustre file with its own OBD device The lct1 conf_param command uses this syntax lt obd fsname gt lt obdtype gt lt proc_file_name gt lt value gt For example lctl conf_param testfs MDT0000 mdt group_upcal1 NONI lctl conf_param testfs llite max_read_ahead_mb 16
473. t lt filesystem gt Sets file system quotas for users or groups Limits can be specified with block inode softlimit hardlimit or their short equivalents b B i I Users can set 1 2 3 or 4 limits t Also limits can be specified with special suffixes b k m g t and p to indicate units of 1 2110 2120 2130 2140 and 2450 respectively By default the block limits unit is 1 kilobyte 1 024 and block limits are always kilobyte grained even if specified in bytes See Examples setquota t u g block grace lt block grace gt inode grace lt inode grace gt lt filesystem gt help exit quit Sets file system quota grace times for users or groups Grace time is specified in XXwXXdXXhXXmXXs format or as an integer seconds value See Examples Provides brief help on various Ifs arguments Quits the interactive lfs session The default stripe size is 0 The default stripe start is 1 Do NOT confuse them If you set stripe start to 0 all new file creations occur on OST 0 seldom a good idea The file cannot exist prior to using setstripe A directory must exist prior to using setstripe The old setquota interface is supported but it may be removed in a future Lustre release Chapter 27 User Utilities man1 27 7 27 8 Examples lfs setstripe s 128k c 2 mnt lustre filel Creates a file striped on two OSTs with 128 KB on each stripe lfs setstripe d mnt
474. t In this example failout mode is specified for the OSTs on MGS um11 file system testfs mkfs lustre fsname testfs ost mgsnode uml1 param failover mode failout dev sdb Caution Before running this command unmount all OSTS that will be affected by the change in the failover failout mode Note After initial file system configuration use the tunefs lustre utility to change the failover failout mode For example to set the failout mode run tunefs lustre param failover mode failout lt OST partition gt Lustre 1 8 Operations Manual October 2009 4 3 8 Running Multiple Lustre File Systems There may be situations in which you want to run multiple file systems This is doable as long as you follow specific naming conventions By default the mkfs lustre command creates a file system named lustre To specify a different file system name limited to 8 characters run mkfs lustre fsname lt new file system name gt Note The MDT OSTs and clients in the new file system must share the same name prepended to the device name For example for a new file system named foo the MDT and two OSTs would be named foo MDT0000 foo OST0000 and foo OSTO0001 To mount a client on the file system run mount t lustre mgsnode lt new fsname gt lt mountpoint gt For example to mount a client on file system foo at mount point mt lustrel run mount t lustre mgsnode foo mnt
475. t 24 m For OST file systems use mke2fs j J size 400 I 256 i 16384 dev Enable ext3 file system directory indexing Type tune2fs O dir_index dev Mount the file system Type m For 2 4 kernels run mount t ext3 dev mnt mds m For 2 6 kernels run mount t ldiskfs dev mnt mds Change to the new file system mount point Type cd mnt mds Restore the file system backup Type tar xzvpf backup file Restore the file system EAs Type setfattr restore ea bak Remove the now invalid recovery logs Type rm OBJECTS CATALOGS Again the restore of the EAs described in Step 6 is not currently required for OST devices but this may change in the future If the file system was used between the time the backup was made and when it was restored then the Ifsck tool part of Lustre e2fsprogs can be run to ensure the file system is coherent If all of the device file systems were backed up at the same time after the whole Lustre file system was stopped this is not necessary The file system should be immediately usable even if lfsck is not run though there will be IO errors reading from files that are present on the MDS but not the OSTs and files that were created after the MDS backup will not be accessible visible Appendix A Lustre Knowledge Base A 15 A 16 How do I control multiple services on one node independently You can do this by assigning an OST or MDS to a specific group often with a name
476. t New and old requests are in sleep until m The reply arrives in case of re activation of the connection and during the re send request asynchronously a The application gets a signal such as TERM or KILL m The server evicts the client which gives an I O error EIO for these requests or the connection becomes failed A timeout is effectively infinite and Lustre waits as long as it needs to avoid giving the application an EIO A client process waits until the OST is back alive unless either the process is killed which should be possible after the Lustre recovery timeout is exceeded 100s by default or the OST is explicitly marked inactive on the clients Lustre 1 8 Operations Manual October 2009 8 1 5 Note If an OST becomes unavailable and you want clients to return EIO if they access files located on the OST then deactivate the OSC on the client letl device lt failed OSC device on the client gt deactivate After the OSC is marked inactive all I O to this OST should immediately return with EIO and not hang Note Under heavy load clients may have to wait a long time for requests sent to the server to complete 100s of seconds in some cases It is difficult for clients to distinguish between heavy server load common and server death unlikely In the case where a server dies and fails over the clients have to wait for their requests to time out then they re send and wait again in
477. t S_NOT_DC input I_NOT_DC cause C_HA_ MESSAGE origin do_cl_join_finalize_respond Aug 9 09 50 47 oss161 crmd 4733 info populate_cib_nodes Requesting the list of configured nodes Aug 9 09 50 48 oss161 crmd 4733 notice populate_cib_nodes Node oss162 clusterfs com uuid 00e8c292 2a28 4492 bcfc b2625ab1c61 Sep 7 10 42 40 d1_q 0 heartbeat info Running etc ha d resource d ost1 start 8 12 Lustre 1 8 Operations Manual October 2009 In this example ost1 is the shared resource Common things to watch out for If you configure two nodes as primary for one resource then you will see both nodes attempt to start it This is very bad Shut down immediately and correct your HA resources files a If the commutation between nodes is not correct both nodes may also attempt to mount the same resource or will attempt to STONITH each other There should be many error messages in syslog indicating a communication fault a When in doubt you can set a Heartbeat debug level in ha cf levels above 5 produce huge volumes of data c Try some manual failover failback Heartbeat provides two tools for this purpose by default they are installed in usr lib heartbeat a hb standby local foreign Causes a node to yield resources to another node if a resource is running on its primary node it is local otherwise it is foreign a hb_takeover local foreign Causes a node to grab resource
478. t write inode block any more The hard limit is the absolute limit When a grace period is set you can exceed the soft limit within the grace period if are under the hard limits Lustre quota allocation is controlled by two values quota_bunit_sz and quota_iunit_sz referring to KBs and inodes respectively These values can be accessed on the MDS as proc fs lustre mds quota_ and on the OST as proc fs lustre obdfilter quota_ The proc values are bounded by two other variables quota_btune_sz and quota_itune_sz By default the tune_sz variables are set at 1 2 the unit_sz variables and you cannot set tune_sz larger than unit_sz You must set bunit_sz first if it is increasing by more than 2x and btune_sz first if it is decreasing by more than 2x Total number of inodes To determine the total number of inodes use 1fs df i and also proc fs lustre filestotal For more information on using the 1fs df i command and the command output see Querying File System Space Unfortunately the statfs interface does not report the free inode count directly but instead reports the total inode and used inode counts The free inode count is calculated for df from total inodes used inodes It is not critical to know a file system s total inode count Instead you should know accurately the free inode count and the used inode count for a file system Lustre manipulates the total inode count in order to accurately report the other two
479. t lustre The mount lustre utility starts a Lustre client or target service Synopsis mount t lustre o options directory Description The mount lustre utility starts a Lustre client or target service This program should not be called directly rather it is a helper program invoked through mount 8 as shown above Use the umount 8 command to stop Lustre clients and targets There are two forms for the device option depending on whether a client or a target service is started Option Description lt mgsspec gt lt fsname gt This mounts the Lustre file system lt fsname gt by contacting the Management Service at lt mgsspec gt on the pathname given by lt directory gt The format for lt mgsspec gt is defined below A mounted file system appears in fstab 5 and is usable like any local file system providing a full POSIX compliant interface lt disk_device gt This starts the target service defined by the mkfs lustre command on the physical disk lt disk_device gt A mounted target service file system is only useful for df 1 operations and appears in fstab 5 to show the device is in use Chapter 31 System Configuration Utilities man8 31 15 Options Option Description lt mgsspec gt lt mgsnode gt lt mgsnode gt The MGS specification may be a colon separated list of nodes lt mgsnode gt lt mgsnid gt lt mgsnid gt Each node may be specified by a comma separated list of
480. t lustre ost lustre ost lustre filesystem summary lfs df 1K blockS 0_UUID 9174328 0_UUID 94181368 1_UUID 94181368 2_UUID 94181368 lfs df h bytes 0_UUID 8 7G 0_UUID 89 8G 1_UUID 89 8G 2_UUID 89 8G lfs df i Inodes 0_UUID 2211572 0_UUID 737280 1_UUID 737280 2_UUID 737280 2211572 Lustre 1 8 Operations Manual October 2009 Used 1020024 56330708 56385748 54352012 Available Use Mounted on 8154304 11 mnt lustre MDT 37850660 59 mnt lustre OST 37795620 59 mnt lustre OST 39829356 57 mnt lustre OST 39829356 57 mnt lustre Used Available Use Mounted on 996 1M 7 8G 11 mnt lustre MDT 53 7G 36 1G 59 mnt lustre OST 53 8G 36 0G 59 mnt lustre OST 51 8G 38 0G 57 mnt lustre OST 269 5G 159 3G 110 1G 59 mnt lustre IUsed IFree IUse Mounted on 41924 2169648 1 mnt lustre MDT 12183 725097 1 mnt lustre OST 12232 725048 1 mnt lustre OST 12214 725066 1 mnt lustre OST 41924 2169648 1 mnt lustre OST 20 0 1 2 0 0 1 2 0 0 1 2 2 24 4 2 24 4 3 24 4 4 Using Stripe Allocations There are two stripe allocation methods round robin and weighted The allocation method is determined by the amount of free space imbalance on the OSTs The weighted allocator is used when any two OSTs are imbalanced by more than 20 Until then a faster round robin allocator is used The round robin order maximizes network balancing Round Robin Allocat
481. t oss161 lsmod Module Size Used by obdfilter 220532 fsfilt_ldiskfs 52228 ost 96712 mgc 60384 L ldiskfs 186896 2 fsfilt_ldiskfs lustre 401744 lov 289064 1 lustre lquota 107048 4 obdfilter mdc 95016 1 lustre ksocklind 111812 The Lustre mount command no longer recognizes the usrquota and grpquota options If they were previously specified remove them from etc fstab When quota is enabled it is enabled for all file system clients started automatically using quota_type or manually with lfs quotaon Lustre 1 8 Operations Manual October 2009 9 1 1 1 Note Lustre with the Linux kernel 2 4 does not support quotas To enable quotas automatically when the file system is started you must set the mdt quota_type and ost quota_type parameters respectively on the MDT and OSTs The parameters can be set to the string u user g group or ug for both users and groups You can enable quotas at mkfs time mkfs lustre param mdt quota_type ug or with tunefs lustre As an example tunefs lustre param ost quota_type ug Sost_dev Caution If you are using mkfs lustre param mdt quota_type ug or tunefs lustre param ost quota_type ug be sure to run the command on all OSTs and the MDT Otherwise abnormal results may occur Administrative and Operational Quotas Lustre has two kinds of quota files a Administrative quotas for the MDT which contain limits for users groups for the entire cluster a
482. t program 6 Register the service tags or save them for later use There are two options for registering service tags a Click Next to continue with the remaining steps 3 5 of the registration process including authentication to the Inventory management website and uploading your service tags Save the collected service tags and register them on another machine This option is good if the system used to collect the service tags does not have Web access Click Save As and enter a file where the tags should be saved You can then move this file using network copy a USB key etc to a machine with Web access On the Web access machine navigate to Sun Inventory and click Discover amp Register to start the Registration client Select the Locate Product on Other Subnets Specific System or Load Previously Saved Data option and check the File Name box Enter or navigate to the file where the collected service tags were saved click Next and follow the remaining steps 3 5 to complete the registration process including authentication to the Inventory management website and uploading your service tags 7 If you wish navigate to Sun Inventory and log into your account to view and manage your IT assets Note For more information about service tags see https inventory sun com which links to the http wikis sun com display ServiceTag Home wiki This wiki includes an FAQ about Sun s service tag program Ch
483. t up an external HA mechanism The recommended choice is the Heartbeat package available at www linux ha org Heartbeat is responsible to detect failure of the primary server node and control the failover The HA software controls Lustre using its built in file system mechanism to unmount and mount file systems Although Heartbeat is recommended Lustre works with any HA software that supports resource I O fencing The hardware setup requires a pair of servers with a shared connection to a physical storage like SAN NAS hardware RAID SCSI and FC The method of sharing storage should be essentially transparent at the device level that is the same physical LUN should be visible from both nodes To ensure high availability at the level of physical storage we encourage the use of RAID arrays to protect against drive level failures To have a fully automated highly available Lustre system you need power management software and HA software which must provide the following m Resource fencing Physical storage must be protected from simultaneous access by two nodes a Resource control Starting and stopping the Lustre processes as a part of failover maintaining the cluster state and so on Health monitoring Verifying the availability of hardware and network resources responding to health indications given by Lustre 8 2 1 This functionality has been available for some time in third party tools Lustre 1 8 Operati
484. t ver gt package for SuSE Server 11 x x kernel lustre default extra lt ver gt 1686 and x86_64 platform Lustre module RPMs lustre modules lt ver gt Lustre modules for the X x patched kernel lustre client modules lt ver gt Lustre modules for X patchless clients Lustre utilities lustre lt ver gt Lustre utilities package This includes userspace utilities to configure and x x run Lustre 3 10 Lustre 1 8 Operations Manual October 2009 TABLE 3 1 Lustre required packages descriptions and installation guidance Install Installon Install on on patchless patched Lustre Package Description servers clients clients lustre ldiskfs lt ver gt Lustre patched backing file system kernel module x package for the ext3 file system e2fsprogs lt ver gt Utilities package used to maintain the ext3 backing x file system lustre client lt ver gt Lustre utilities for X patchless clients Only install this kernel RPM if you want to patch the client kernel You do not have to patch the clients to run Lustre b Install the kernel modules and Idiskfs packages Use the rpm ivh command to install the kernel module and Idiskfs packages For example rpm ivh kernel lustre smp lt ver gt kernel ib lt ver gt lustre modules lt ver gt lustre ldiskfs lt ver gt c Install the utilities userspace packages Use the rpm ivh command to install the utilities packages For example rpm ivh lustre lt ver gt
485. ta_pending_commit wait_for_ino_quota lquota_pending_commit wait_for_pending_blk_quota_req qctxt_wait_pending_dqacq wait_for_pending_ino_quota_req qctxt_wait_pending_dqacq Quota slaves send a acquiring_quota request and wait for its return Quota slaves send a releasing_quota request and wait for its return Quota slaves send an acquiring quota request and do not wait for its return Quota slaves send a releasing_quota request and do not wait for its return Before data is written to OSTs the OSTs check if the remaining block quota is sufficient This is done in the lquota_chkquota function Before files are created on the MDS the MDS checks if the remaining inode quota is sufficient This is done in the Iquota_chkquota function After blocks are written to OSTs relative quota information is updated This is done in the Iquota_pending_commit function After files are created relative quota information is updated This is done in the Iquota_pending_commit function On the MDS or OSTs there is one thread sending a quota request for a specific UID GID for block quota at any time At that time if other threads need to do this too they should wait This is done in the qctxt_wait_pending_dqacq function On the MDS there is one thread sending a quota request for a specific UID GID for inode quota at any time If other threads need to do this too they should wait This is done in the qctxt_wait_pending
486. te The recovery process is blocked until all OSTs are available Chapter 4 Configuring Lustre 4 27 4 4 4 4 1 4 28 More Complex Configurations If a node has multiple network interfaces it may have multiple NIDs When a node is specified all of its NIDs must be listed delimited by commas so other nodes can choose the NID that is appropriate for their network interfaces When failover nodes are specified they are delimited by a colon or by repeating a keyword mgsnode or failnode To obtain all NIDs from a node while LNET is running run letl list_nids Failover This example has a combined MGS MDT failover pair on umll and uml2 and a OST failover pair on uml3 and uml4 There are corresponding Elan addresses on umli and uml2 uml1 gt mkfs lustre fsname testfs mdt mgs failnode uml12 2 elan dev sdal uml1 gt mount t lustre dev sdal mnt test mdt uml3 gt mkfs lustre fsname testfs ost failnode uml4 mgsnode uml1 1 elan mgsnode uml2 2 elan dev sdb uml3 gt mount t lustre dev sdb mnt test ost0 client gt mount t lustre uml1 1 elan uml2 2 elan testfs mnt testfs uml1 gt umount mnt mdt uml2 gt mount t lustre dev sdal mnt test mdt uml2 gt cat proc fs lustre mds testfs MDT0000 recovery_status Where multiple NIDs are specified comma separation for example um12 2 elan means that the two NIDs refer to the same host and that Lustre needs to choose t
487. tem lfs df h Lists space usage per OST and MDT in human readable format lfs df i Lists inode usage per OST and MDT lfs df pool lt filesystem gt lt pool gt lt pathname gt List space or inode usage for a specific OST pool 1fs quotachown i mnt lustre Changes file owner and group Chapter 27 User Utilities man1 27 9 27 10 lfs quotacheck ug mnt lustre Checks quotas for user and group Turns on quotas after making the check lfs quotaon ug mnt lustre Turns on quotas of user and group lfs quotaoff ug mnt lustre Turns off quotas of user and group Ifs setquota u bob block softlimit 2000000 block hardlimit 1000000 mnt lustre Sets quotas of user bob with a 1 GB block quota hardlimit and a 2 GB block quota softlimit lfs setquota t u block grace 1000 inode grace 1w4d mnt lustre Sets grace times for user quotas 1000 seconds for block quotas 1 week and 4 days for inode quotas lfs quota u bob mnt lustre List quotas of user bob lfs quota t u mnt lustre Show grace times for user quotas on mnt lustre lfs setstripe pool my_pool mnt lustre dir Associates a directory with the pool my_pool so all new files and directories are created in the pool lfs find mnt lustre pool poolA Finds all directories files associated with poolA Lustre 1 8 Operations Manual October 2009 lfs find mnt lustre pool Finds all di
488. the common case the server is just overloaded then they try to contact another server listed as a failover server for that node If a connection goes to the failed condition which happens immediately in failout OST mode new and old requests receive ElOs In non failout mode a connection can only get into this state by using lctl deactivate which is the only option for the client in the event of an OST failure Failout means that if an OST becomes unreachable because it has failed been taken off the network unmounted turned off etc then I O to get objects from that OST cause a Lustre client to get an EIO Roles of Nodes in a Failover A failover pair of nodes can be configured in two ways active active and active passive An active node actively serves data while a passive node is idle standing by to take over in the event of a failure In the following example using two OSTs both of which are attached to the same shared disk device the following failover configurations are possible m active passive This configuration has two nodes out of which only one is actively serving data all the time In case of a failure the other node takes over If the active node fails the OST in use by the active node will be taken over by the passive node which now becomes active This node serves most services that were on the failed node m active active This configuration has two nodes actively serving data all the
489. the file system cannot be mounted currently there is no way that parses metadata directly from an MDS If the bad OST does not start options to mount the file system are to provide a loop device OST in its place or replace it with a newly formatted OST In that case the missing objects are created and are read as zero filled Chapter 21 Lustre Monitoring and Troubleshooting 21 11 21 4 6 21 4 7 In releases prior to Lustre 1 8 you could not mount a file system with a missing OST Improving Lustre Performance When Working with Small Files A Lustre environment where an application writes small file chunks from many clients to a single file will result in bad I O performance To improve Lustre s performance m Have the application aggregate writes some amount before submitting them to Lustre By default Lustre enforces POSIX coherency semantics so it results in lock ping pong between client nodes if they are all writing to the same file at one time a Have the application do 4kB O_DIRECT sized I O to the file and disable locking on the output file This avoids partial page IO submissions and by disabling locking you avoid contention between clients m Have the application write contiguous data m Add more disks or use SSD disks for the OSTs This dramatically improves the IOPS rate Consider creating larger OSTs rather than many smaller OSTs due to less overhead journal connections etc Default Striping These are t
490. the next polling occurs The default value is 1 meaning that every network event has only one chance to be processed before polling occurs the next time N should be set to a positive value Chapter 2 Understanding Lustre Networking 2 7 2 4 1 2 2 4 2 2 8 USOCK_TIMEOUT N Specifies the network timeout measured in seconds Network options that are not completed in N seconds time out and are canceled The default value is 50 seconds N should be a positive value USOCK_POLL_TIMEOUTEN Specifies the polling timeout how long usocklnd sleeps if no network events occur N results in a slightly lower overhead of checking network timeouts and longer delay of evicting timed out events The default value is 1 second N should be set to a positive value USOCK_MIN_BULK N This tunable is only used for typed network connections Currently liblustre clients do not use this usocklnd facility OFED InfiniBand Options For the SilverStorm Infinicon InfiniBand LND iiblnd the network and HCA may be specified as in this example options lnet networks 021ib3 ib3 This specifies that the node is on o2ib network number 3 using HCA ib3 Module Parameters Routing The following parameter specifies a colon separated list of router definitions Each route is defined as a network number followed by a list of routers route lt net type gt lt router NID s gt Examples options lnet networks 02ib0 routes tcp0 192 168 10
491. the same procedure as Backing Up the MDS except skip Step 5 and for each OST device file system replace mds with ost in the commands Chapter 15 Backup and Restore 15 3 15 1 5 Performing File level Backups In some situations you may want to back up the individual files on the MDT or OST file system rather than back up all of the blocks in the device This may be a preferred backup strategy if the storage device is large but has relatively little data if parameter configurations on the ext3 file system need to be changed or to use less space for backup You can mount the ext3 file system directly from the storage device and do a file level backup However you MUST STOP Lustre on that node To do this back up the Extended Attributes EAs stored in the file system As the current backup tools do not properly save this data perform the following procedure 15 2 Restoring from a File level Backup To restore data from a file level backup you need to format the device restore the file data and then restore the EA data 1 Format the device To get the optimal ext3 parameters run mkfs lustre fsname fsname reformat mgs mdt ost dev sda Caution Only reformat the device you want to restore 2 Enable ext3 file system directory indexing Run tune2fs O dir_index dev 3 Mount the file system Run mount t ldiskfs dev mnt mds 4 Change to the new file system mount point Run cd mnt mds
492. thernet broadcast udpport 694 beast eth0 Use manual failback auto_failback off 8 10 Lustre 1 8 Operations Manual October 2009 Cluster members name must match hostname node oss161 clusterfs com oss162 clusterfs com remote health ping ping 192 168 16 1 respawn hacluster usr lib heartbeat ipfail Create etc ha d haresources m This file must be identical on both the nodes m It specifies a virtual IP address and a service m Sample haresources oss161 clusterfs com 192 168 16 35 Filesystem dev sda ost1 lustre oss162 clusterfs com 192 168 16 36 Filesystem dev sda ost1 lustre Create etc ha d authkeys a Copy the example from usr share doc heartbeat lt version gt a chmod the file 0600 Heartbeat does not start if the permissions on this file are incorrect m Sample authkeys files auth 1 1 shal PutYourSuperSecretKeyHere a Start Heartbeat root oss161 ha d service heartbeat start Starting High Availability services OK Chapter 8 Failover 8 11 b Monitor the syslog on both nodes After the initial deadtime interval you should see the nodes discovering each other s state and then they start the Lustre resources they own You should see the startup command in the log Aug 9 09 50 44 oss161 crmd 4733 info update dc Set DC to lt null gt lt null gt Aug 9 09 50 44 oss161 crmd 4733 info do_election_count_vote Election check vote from oss162 clusterfs co
493. thread count exceeds RH Each of these arguments is exclusive with the regioncount argument When generating the next I O task do not select the next chunk in the next stream but shift a random number with a maximum noise of shifting k regions ahead The run will complete when all regions are fully written or read This merely introduces a randomization of the ordering The argument is a byte specifier or a list of byte specifiers During the run s write S bytes to each region The arguments are byte specifiers Generate runs with a range of region sizes starting at TL increasing P until the region size exceeds RH Each argument is exclusive with the regionsize argument PIOS runs with T threads performing I O A sequence of values may be given Generate runs with a range of thread counts starting at TL increasing TP until the thread count exceeds TH Each of these arguments is exclusive with the threadcount argument A random amount of noise not exceeding ms is inserted between the time that a thread identifies as the next chunk it needs to read or write and the time it starts the I O Where threads write to files fpp indicates files per process behavior where threads write to multiple files e sff indicates single shared files where all threads write to the same file Verify a written file or set of files A single timestamp or sequence of timestamps can be given for each run respectively If no argument is passed
494. through a bitmap Lustre log catalog An llog with records that each point at an llog Catalogs were introduced to give llogs almost infinite size llogs have an originator which writes records and a replicator which cancels record usually through an RPC when the records are not needed Logical Metadata Volume A driver to abstract in the Lustre client that it is working with a metadata cluster instead of a single metadata server Lustre 1 8 Operations Manual October 2009 LND LNET Load balancing MDSs Lock Client Lock Server LOV LOV descriptor Lustre Lustre client Lustre file Lustre lite Lvfs M Mballoc Lustre Network Driver code module that enables LNET support over a particular transport such as TCP and various kinds of InfiniBand Elan or Myrinet Lustre Networking A message passing network protocol capable of running and routing through various physical layers LNET forms the underpinning of LNETrpc A cluster of MDSs that perform load balancing of on system requests A module that makes lock RPCs to a lock server and handles revocations from the server A system that manages locks on certain objects It also issues lock callback requests calls while servicing or for objects that are already locked completes lock requests Logical Object Volume The object storage analog of a logical volume in a block device volume management system such as LVM or EVMS The LOV is primarily
495. tics files that share a common format and are updated at a specified interval in seconds To stop statistics printing type CTRL C h Options Option Description C Clears the statistics file i Specifies the interval polling period in seconds 8 Specifies graphable output format h Displays help information stats_file Specifies either the full path to a statistics file or a shorthand reference mds or ost Chapter 31 System Configuration Utilities man8 31 25 31 26 Example To monitor proc fs lustre ost OSS ost stats llstat i 1 ost Files The llstat files are located at proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 proc fs 1 us us us us us us us us us us us tre mdt MDS stats tre mds exports stats tre mdc stats tre ldlm services stats tre ldlm namespaces pool stats tre mgs MGS exports stats tre ost OSS stats tre osc stats tre obdfilter exports stats tre obdfilter stats tre llite stats Lustre 1 8 Operations Manual October 2009 at 1 second intervals run 31 5 11 Ist The Ist utility starts LNET self test Synopsis ist Description LNET self test helps site administrators confirm that Lustre Networking LNET has been correctly installed and configured The self test also confirms that LNET the network software and the underlying hardware are performi
496. tics indicates that I O is not 1 MB check sys block lt device gt queue max_sectors_kb If it is less than 1024 set it to 1024 to improve the performance If changing this setting does not change the I O size as reported by Lustre you may want to examine the SCSI driver code Chapter 21 Lustre Monitoring and Troubleshooting 21 21 21 4 23 Belongs To Identifying Which Lustre File an OST Object Use this procedure to identify the file containing a given object on a given OST 1 On the OST as root run debugfs to display the FID of the file associated with the object For example if the object is 34976 on dev lustre ost_test2 the debug command 1S debugfs c R Stat 0 0 dS 34976 32 34976 dev lustre ost_test2 The command output is debugfs 1 41 5 sun2 23 Apr 2009 dev lustre ost_test2 catastrophic mode not reading inode or group bitmaps Inode 352365 Type regular Mode 0666 Flags 0x80000 Generation 1574463214 Version 0xea020000 00000000 User 500 Group 500 Size 260096 File ACL 0 Directory ACL 0 Links 1 Blockcount 512 Fragment Address 0 Number 0 Size 0 ctime 0x4a216b48 00000000 Sat May 30 13 22 16 2009 atime 0x4a216b48 00000000 Sat May 30 13 22 16 2009 mtime 0x4a216b48 00000000 Sat May 30 13 22 16 2009 crtime 0x4a216b3c 975870dc Sat May 30 13 22 04 2009 Size of extra inode fields 24 Extended attributes stored in inode body fid e2 00 11 00 00 00 00 00 25
497. tilities man8 31 29 31 5 13 routerstat The routerstat utility prints Lustre router statistics Synopsis routerstat interval Description The routerstat utility watches LNET router statistics If no interval is specified then statistics are sampled and printed only one time Otherwise statistics are sampled and printed at the specified interval in seconds Options The routerstat output includes the following fields Field Description msgs_alloc msgs_max errors send_length send_count recv_length recv_count route_length route_count Um AH mM ge drop_length drop_count Files Routerstat extracts statistics data from proc sys inet stats 31 30 Lustre 1 8 Operations Manual October 2009 31 5 14 ll_recover_lost_found_objs The 11_recover_lost_found_objs utility helps recover Lustre OST objects file data from a lost and found directory Synopsis 11 recover_lost_found_objs hv d directory Description The first time Lustre writes to an object it saves the MDS inode number and the objid as an extended attribute on the object so in case of directory corruption of the OST it is possible to recover the objects Running e2fsck fixes the corrupted OST directory but it puts all of the objects into a lost and found directory where they are inaccessible to Lustre Use the 11_recover_lost_found_objs utility to recover all or at least most objects from a lost and found directory back t
498. ting in a more successful recovery from a downed OST For more information about the VBR feature see Version based Recovery In Lustre 1 6 and earlier the success of the recovery process was limited by uncommitted client requests that are unable to be replayed Because clients attempted to replay their requests to the OST and MDT in serial order a client that could not replay its requests causes the recovery stream to stop and left the remaining clients without an opportunity to reconnect and replay their requests Write Performance Better Than Read Performance Typically the performance of write operations on a Lustre cluster is better than read operations When doing writes all clients are sending write RPCs asynchronously The RPCs are allocated and written to disk in the order they arrive In many cases this allows the back end storage to aggregate writes efficiently In the case of read operations the reads from clients may come in a different order and need a lot of seeking to get read from the disk This noticeably hampers the read throughput Currently there is no readahead on the OSTs themselves though the clients do readahead If there are lots of clients doing reads it would not be possible to do any readahead in any case because of memory consumption consider that even a single RPC 1 MB readahead for 1000 clients would consume 1 GB of RAM For file systems that use socklnd TCP Ethernet as interconnect there is al
499. tion b Uses the 1024 byte blocksize for the output By default this blocksize is used by Lustre since OSTs may use different block sizes e Uses the extent mode when printing the output l Displays extents in LUN offset order S Synchronizes the file before requesting the mapping V Uses the verbose mode when checking file fragmentation Examples Lists default output filefrag mnt lustre foo mnt lustre foo 6 extents found Lists verbose output in extent format filefrag ve mnt lustre foo Checking mnt lustre foo Filesystem type is bd00bd0 Filesystem cylinder groups is approximately 5 File size of mnt lustre foo is 157286400 153600 blocks ext device_logical start end physical start end length 0 0 49151 212992 262144 49152 1 49152 73727 270336 294912 24576 2 73728 76799 24576 27648 3072 3 Ores 57343 196608 253952 57344 4 57344 65535 139264 147456 8192 5 65536 76799 163840 175104 11264 mnt lustre foo 6 extents found device flags remote remote remote remote remote remote HBA Chapter 27 User Utilities man1 27 21 27 4 27 22 Mount Lustre uses the standard Linux mount command and also supports a few extra options In Lustre 1 4 the server side options should be added to the XML configuration with the mountfsoptions argument Here are the Lustre specific options Sever options Description exten
500. to this target quiet Prints less information Lustre 1 8 Operations Manual October 2009 Option Description verbose Prints more information writeconf Erases all configuration logs for the file system to which this MDT belongs and regenerates them This is very dangerous All clients and servers should be stopped All targets must then be restarted to regenerate the logs No clients should be started until all targets have restarted In general this command should only be executed on the MDT not the OSTs Examples Changing the MGS s NID address This should be done on each target disk since they should all contact the same MGS tunefs lustre erase param mgsnode lt new_nid gt writeconf dev sda Adding a failover NID location for this target tunefs lustre param failover node 192 168 0 13 tcp0 dev sda Chapter 31 System Configuration Utilities man8 31 7 31 3 31 8 Ictl The Ictl utility is used to directly control Lustre via an ioctl interface allowing various configuration maintenance and debugging features to be accessed Synopsis letl letl device lt OST device number gt lt command args gt Description The Ictl utility can be invoked in interactive mode by issuing the Ictl command After that commands are issued as shown below The most common Ictl commands are dl device network lt up down gt list_nids ping nid help quit For a complete li
501. tre 13 11 1 Make a complete restorable file system backup before downgrading Lustre 2 Install the 1 6 x packages on the Lustre component server or client For help determining where to install a specific package see TABLE 3 1 Lustre packages descriptions and installation guidance a Install the kernel modules and Idiskfs packages For example rpm ivh kernel lustre smp lt ver gt kernel ib lt ver gt lustre modules lt ver gt lustre ldiskfs lt ver gt b Downgrade the utilities userspace packages using the oldpackage option For example rpm Uvh oldpackage lustre lt ver gt Note You do not need to downgrade or take any action with e2fsprogs 3 Unload the old Lustre modules by either a Rebooting the node OR m Removing the Lustre modules manually Run lustre_rmmod several times and use 1smod to check the currently loaded modules 4 If the upgraded component is a server fail back services to it If you have a problem upgrading Lustre contact us via the Bugzilla bug tracker 13 12 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 4 Lustre SNMP Module The Lustre SNMP module reports information about Lustre components and system status and generates traps if an LBUG occurs The Lustre SNMP module works with the net snmp The module consists of a plug in lustresnmp so which is loaded by the snmpd daemon and a MIB file Lustre MIB txt This chapter describes
502. tre Properties man3 This chapter describes how to use 1lapi to set Lustre file properties 29 1 29 1 1 Using llapi Several llapi commands are available to set Lustre properties llapi_file_create llapi_file_get_stripe and llapi_file_open These commands are described in the following sections llapi_file_create llapi_file_get_stripe llapi_file_open llapi_quotactl llapi_file_create Use llapi_file_create to set Lustre properties for a new file Synopsis include lt lustre liblustreapi h gt include lt lustre lustre_user h gt int llapi_file_create char name long stripe_size int stripe offset int stripe count int stripe pattern 29 1 29 2 Description The llapi_file_create function sets a file descriptor s Lustre striping information The file descriptor is then accessed with open Option Description llapi_file_create If the file already exists this parameter returns to EEXIST If the stripe parameters are invalid this parameter returns to EINVAL stripe_size This value must be an even multiple of system page size as shown by getpagesize The default Lustre stripe size is 4MB stripe_offset Indicates the starting OST for this file stripe_count Indicates the number of OSTs that this file will be striped across stripe_pattern Indicates the RAID pattern Note Currently only RAID 0 is supported To use the system defaults set these val
503. tre tests sub directory These scripts enable quick setup of some simple standard Lustre configurations Note We recommend that you use dotted quad IP addressing IPv4 rather than host names This aids in reading debug logs and helps greatly when debugging configurations with multiple interfaces 1 Define the module options for Lustre networking LNET by adding this line to the etc modprobe conf filel options lnet networks lt network interfaces that LNET can use gt This step restricts LNET to use only the specified network interfaces and prevents LNET from using all network interfaces As an alternative to modifying the modprobe conf file you can modify the modprobe local file or the configuration files in the modprobe d directory Note For details on configuring networking and LNET see Configuring LNET 2 Optional Prepare the block devices to be used as OSTs or MDTs Depending on the hardware used in the MDS and OSS nodes you may want to set up a hardware or software RAID to increase the reliability of the Lustre system For more details on how to set up a hardware or software RAID see the documentation for your RAID controller or see Lustre Software RAID Support 1 The modprobe conf file is a Linux file that lives in etc modprobe conf and specifies what parts of the kernel are loaded 4 2 Lustre 1 8 Operations Manual October 2009 3 Create a combined MGS MDT file system a Consider the
504. tripe 29 4 Ilapi_file open 29 5 Ilapi_quotactl 29 6 Ilapi_path2fid 29 9 XX Lustre 1 8 Operations Manual October 2009 30 31 Configuration Files and Module Parameters man5 30 1 30 1 Introduction 30 1 30 2 Module Options 30 2 30 2 1 30 2 2 30 2 3 30 2 4 30 2 5 30 2 6 30 2 7 30 2 8 30 2 9 LNET Options 30 3 SOCKLND Kernel TCP IP LND 30 8 QSW LND 30 10 RapidArray LND 30 11 VIBLND 30 12 OpenIB LND 30 14 Portals LND Linux 30 15 Portals LND Catamount 30 18 MXLND 30 20 System Configuration Utilities man8 31 1 mkfs lustre 31 2 31 1 31 2 31 3 31 4 31 5 tunefs lustre 31 5 Ictl 31 8 mount lustre 31 15 Additional System Configuration Utilities 31 18 31 5 1 31 5 2 31 5 3 31 5 4 31 5 5 31 5 6 31 5 7 31 5 8 31 5 9 31 5 10 lustre_rmmod sh 31 18 e2scan 31 18 Utilities to Manage Large Clusters 31 19 Application Profiling Utilities 31 20 More proc Statistics for Application Profiling 31 20 Testing Debugging Utilities 31 21 Flock Feature 31 22 l_getgroups 31 23 Ilobdstat 31 24 stat 31 25 Contents xxi 31 5 11 Ist 31 27 31 5 12 plot llstat 31 29 31 5 13 routerstat 31 30 31 5 14 Il_recover_lost_found_objs 31 31 32 System Limits 32 1 32 1 Maximum Stripe Count 32 1 32 2 Maximum Stripe Size 32 2 32 3 Minimum Stripe Size 32 2 32 4 Maximum Number of OSTs and MDTs 32 2 32 5 Maximum Number of Clients 32 2 32 6 Maximum Size of a File System 32 3 32 7 Maximum File Siz
505. ts mballoc Use extent mapped files Use Lustre file system allocator required Lustre 1 6 server options Description abort_recov nosvc exclude Abort recovery when starting a target currently an lconf option Start only MGS MGC servers Start with a dead OST Client options Description flock user_xattr nouser_xattr retry Enable disable flock support Enable disable user extended attributes Number of times a client will retry mount Lustre 1 8 Operations Manual October 2009 27 5 Handling Timeouts Timeouts are the most common cause of hung applications After a timeout involving an MDS or failover OST applications attempting to access the disconnected resource wait until the connection gets established When a client performs any remote operation it gives the server a reasonable amount of time to respond If a server does not reply either due to a down network hung server or any other reason a timeout occurs which requires a recovery If a timeout occurs a message similar to this one appears on the console of the client and in var log messages LustreError 26597 client c 810 ptlrpc_ expire one request timeout req a2d45200 x5886 t0 0o38 gt mds_svc_UUID NID_mds_UUID 12 lens 168 64 ref 1 fl RPC 0 0 rc 0 Chapter 27 User Utilities man1 27 23 27 24 Lustre 1 8 Operations Manual October 2009 CHAPTER 28 Lustre Programming Interfaces man2 This
506. ts It can be invoked interactively without any arguments or in a non interactive mode with one of the supported arguments Chapter 27 User Utilities man1 27 3 27 4 Options The various 1 s options are listed and described below For a complete list of available options type help at the 1fs prompt Option Description check Displays the status of the MDS or OSTs as specified in the command or all servers MDS and OSTs df Reports file system disk space usage or inode usage of each MDT OST Can limit the scope to a specific OST pool find atime ctime mtime obd Searches the directory tree rooted at the given directory filename for files that match the given parameters The maxdepth option limits find to decend at most N levels of directory tree The print and print0 options print the full filename followed by a new line or NUL character correspondingly Using before an option negates its meaning files NOT matching the parameter Using before a numeric value means files with the parameter OR MORE Using before a numeric value means files with the parameter OR LESS File was last accessed N 24 hours ago There is no guarantee that atime is kept coherent across the cluster OSTs store a transient atime that is updated when clients do read requests Permanent atime is written to the MDS when the file is closed However on disk atime is only updated if it is more than 60 seconds old
507. ts before logging out run 1fs flushctx lfs flushctx k Here k also means destroy the on disk Kerberos credential cache It is equivalent to kdestroy Otherwise it only destroys established contexts in the Lustre kernel 11 16 Lustre 1 8 Operations Manual October 2009 CHAPTER 1 2 Bonding This chapter describes how to set up bonding with Lustre and includes the following sections m Network Bonding m Requirements m Using Lustre with Multiple NICs versus Bonding NICs m Bonding Module Parameters m Setting Up Bonding m Configuring Lustre with Bonding 12 1 Network Bonding Bonding also known as link aggregation trunking and port trunking is a method of aggregating multiple physical network links into a single logical link for increased bandwidth Several different types of bonding are supported in Linux All these types are referred to as modes and use the bonding kernel module Modes 0 to 3 provide support for load balancing and fault tolerance by using multiple interfaces Mode 4 aggregates a group of interfaces into a single virtual interface where all members of the group share the same speed and duplex settings This mode is described under IEEE spec 802 3ad and it is referred to as either mode 4 or 802 3ad 802 3ad refers to mode 4 only The detail is contained in Clause 43 of the IEEE 8 the larger 802 3 specification For more information consult IEEE 12 1 12 2
508. ts in messages such as the following appearing on the system console which normally indicates a system configuration error af0ac_mds_scratch_2b27fc413e does not match last_rcvd UUID 8a9c5_mds_scratch_8d2422aa88 In some cases it is possible to get the incorrect UUID in the configuration file for example by regenerating the xml configuration file a second time In this case you must specify the device UUIDs when the configuration file is built with the ostuuid or mdsuuid options to match the original UUIDs instead of generating new ones each time imc add ost node ostnode lov lovi dev dev sdc ostuuid 3dbf8_OST_ostnode_ddd780786b imc add mds node mdsnode mds mds_scratch dev dev sdc mdsuuid 8a9c5_mds_scratch_8d2422aa88 How do I set up multiple Lustre file systems on the same node Assuming you want to have separate file systems with different mount locations you need a dedicated MDS partition and Logical Object Volume LOV for each file system Each LOV requires a dedicated OST s For example if you have an MDS server node mds_server and want to have mount points mnt foo and mnt bar the following lines are an example of the setup leaving out the add net lines Two MDS servers using distinct disks imc m test xml add mds node mds_server mds foo mds group foo mds fstype ldiskfs dev dev sda imc m test xml add mds node mds_server mds bar mds group bar mds fstype ldiskfs
509. u do not have libgssapi build and install it from source http www citi umich edu projects nfsv4 linux libgssapi libssapi 0 10 tar gz keyutils Chapter 11 Kerberos 11 3 11 2 1 3 Configuring Lustre for Kerberos To configure Lustre for Kerberos 1 Configure the client nodes a For each client node create a lustre_root principal and generate the keytab kadmin gt addprinc randkey lustre _root client_host domain REALM kadmin gt ktadd e aes128 cts normal lustre _root client_host domain REALM b Install the keytab on the client node Note For each client OST pair there is only one security context shared by all users on the client This protects data written by one user to be passed to an OST by another user due to asynchronous bulk I O The client OST connection only guarantees message integrity or privacy it does not authenticate users 2 Configure the MDS nodes a For each MDS node create a lustre_mds principal and generate the keytab kadmin gt addprinc randkey lustre_mds mdthost domain REALM kadmin gt ktadd e aes128 cts normal lustre_mds mdthost domain REALM b Install the keytab on the MDS node 3 Configure the OSS nodes a For each OSS node create a lustre_oss principal and generate the keytab kadmin gt addprinc randkey lustre_oss osthost domain REALM kadmin gt ktadd e aes128 cts normal lustre_oss osshost domain REALM b Install the keytab on the OSS node Tip To avoid
510. ues stripe_size 0 stripe_offset 1 stripe_count 0 stripe_pattern 0 Lustre 1 8 Operations Manual October 2009 Examples System default size is 4 MB char tfile TESTFILE int stripe size 65536 To start at default run int stripe_ offset 1 To start at the default run int stripe count 1 To set a single stripe for this example run int stripe pattern 0 Currently only RAID 0 is supported int stripe pattern 0 int rc fd rc llapi_ file create tfile stripe_size stripe offset stripe count stripe pattern Result code is inverted you may return with EINVAL or an ioctl error if rc fprintf stderr llapi_file_create failed d s 0 rc strerror rc return 1 llapi_file create closes the file descriptor You must re open the descriptor To do this run fd open tfile O CREAT O_RDWR O_ LOV DELAY CREATE 0644 if fd lt 0 fprintf stderr Can t open s file s0 tfile str error errno return 1 Chapter 29 Setting Lustre Properties man3 29 3 29122 Ilapi_ file get_stripe Use llapi_file_get_stripe to get striping information Synopsis int llapi_file_get_stripe const char path struct lov_user_md lum Description The llapi_file_get_stripe function returns the striping information to the caller If it returns a zero 0 the operation was successful a negative number means there was a failure Option Desc
511. uired OSTs 2 Verify that the obdecho ko module is present 3 Run the obdfilter_survey script with the parameter case disk For example nobjhi 2 thrhi 2 size 1024 case disk sh obdfilter survey To perform a manual run 1 List all OSTs you want to test You do not have to specify an MDS or LOV 2 On all OSSs run mkfs lustre fsname spfs mdt mgs dev sda Caution Write tests are destructive This test should be run before the Lustre file system is started If you do this you will not need to reformat to restart Lustre system However if the obdfilter_survey test is terminated before it completes you may have to remove objects from the disk 1 The sgpdd survey profiles individual disks This script is destructive and should not be run anywhere you want to preserve existing data 18 6 Lustre 1 8 Operations Manual October 2009 18 2 2 2 3 Determine the obdfilter instance names on all Lustre clients The device names appear in the fourth column of the 1ct1 dl command output For example pdsh w oss 01 02 lctl dl grep obdfilter sort oss01 0 UP obdfilter oss01 sdb oss01 sdb_UUID 3 oss01 2 UP obdfilter oss01 sdd oss01 sdd_UUID 3 oss02 0 UP obdfilter oss02 sdi oss02 sdi_UUID 3 In this example the obdfilter instance names are oss01 sdb oss01 sdd and oss02 sdi Since you are driving obdfilter instances directly set the shell array variable targets to the names of the obdfilter instances
512. ult stripe size 1fs setstripe lt new_filename gt 0 1 N You may want to make a simple wrapper script that only accepts the lt stripe_count gt parameter Usage info via lfs help setstripe Lustre 1 8 Operations Manual October 2009 How do I set striping for a large number of files at one time You can set a default striping on a directory and then any regular files created within that directory inherit the default striping configuration To do this first create a directory if necessary and then set the default striping in the same manner as you do for a regular file lfs setstripe lt directory gt lt stripe_size gt 1 lt stripe_count gt If the stripe_size value is zero 0 it uses the system wide stripe size If the stripe_count value is zero 0 it uses the default stripe count If the stripe_count value is 1 it stripes across all available OSTs The best performance for many clients writing to individual files is at 1 or 2 stripes per file and maximum stripes for large shared I O files i e many clients reading or writing the same file at one time If I set the striping of N and B for a directory do files in that directory inherit the striping or revert to the default All new files get the new striping parameters and existing files will keep their current striping even if overwritten To undo the default striping on a directory to use system wide defaults again set the striping to 0 1 0 Appendix A L
513. ultaneous availability of multiple network types with routing between them Management Server MGS The MGS stores configuration information for all Lustre file systems in a cluster Each Lustre target contacts the MGS to provide information and Lustre clients contact the MGS to retrieve information The MGS requires its own disk for storage However there is a provision that allows the MGS to share a disk co locate with a single MDT The MGS is not considered part of an individual file system it provides configuration information for all managed Lustre file systems to other Lustre components Chapter 1 Introduction to Lustre 1 7 1 3 1 8 Lustre Systems Lustre components work together as coordinated systems to manage file and directory operations in the file system see FIGURE 1 2 FIGURE 1 2 Lustre system interaction in a file system File open Directory Operations file operi close metadata and concurrency Recovery file status and file creation File 1 0 and file locking The characteristics of the Lustre system include Typical number of systems Performance Clients 1 100 000 1 GB sec I O 1 000 metadata ops sec OSS 1 1 000 500 2 5 GB sec MDS 2 3 000 15 000 2 100 in future metadata ops sec Required attached storage None File system capacity OSS count 1 2 of file system capacity Desirable hardware characteristics None Good bus bandwidth Adequate CPU power
514. ults should match lfs getstripe printf Confirming our results with lfs getsrtipe n sprintf sys_cmd usr bin lfs getstripe s s MY_LUSTRE_DIR TESTFILE system sys_cmd printf All done n exit rc Makefile for sample application gcc g 02 Wall o lustredemo libtest c llustreapi clean rm f core lustredemo o run make rm f mnt lustre ftest lustredemo rm f mnt lustre ftest lustre_dummy cp lustredemo mnt lustre ftest 24 22 Lustre 1 8 Operations Manual October 2009 CHAPTER 25 Lustre Security This chapter describes Lustre security and includes the following sections a Using ACLs m Using Root Squash 25 1 25 1 1 Using ACLs An access control list ACL is a set of data that informs an operating system about permissions or access rights that each user or group has to specific system objects such as directories or files Each object has a unique security attribute that identifies users who have access to it The ACL lists each object and user access privileges such as read write or execute How ACLs Work Implementing ACLs varies between operating systems Systems that support the Portable Operating System Interface POSIX family of standards share a simple yet powerful file system permission model which should be well known to the Linux Unix administrator ACLs add finer grained permissions to this model allowing for more complicated permission schemes For a det
515. umber of other variables size of the file system number of allocated blocks distribution of allocated blocks on the disk disk speed CPU speed and amount of RAM on the server Reasonable e2fsck runtimes without serious file system problems is expected to take five minutes to two hours Presently Lustre has 1 inode per 16 KB of space in the OST file system by default In many environments this is far too many inodes for the average file size As a general guideline the OSTs should have at least a number of inodes indicated by this formula num_ost_inodes 4 lt num_mds_inodes gt lt default_stripe_count gt lt number_osts gt To specify the number of inodes on OST file systems use the N lt num_inodes gt option to mkfsoptions Alternately if you know the average file size you can also specify the OST inode count for the OST file systems using i lt average_file_size number_of_stripes 4 gt For example if the average file size is 16 MB and there are by default 4 stripes per file then mk soptions i 1048576 would be appropriate For more details on formatting MDT and OST file systems see Formatting Options for RAID Devices 20 4 Network Tuning During IOR runs especially reads one or more nodes may become CPU bound which may slow down the remaining nodes and compromise read rates This issue is likely related to RX overflow errors on the nodes caused by an upstream e1000 driver To resolv
516. un bulk_rw display server stats for 30 seconds lst stat servers amp sleep 30 kill tear down lst end_session 31 28 Lustre 1 8 Operations Manual October 2009 brw read brw write 31 5 12 plot llstat The plot llstat utility plots Lustre statistics Synopsis plot llstat results_filename parameter_index Options Option Description results_filename Output generated by plot Ilstat parameter_index Value of parameter_index can be 1 count per interval 2 count per second default setting 3 total count Description The plot Ilstat utility generates a CSV file and instruction files for gnuplot from Ilstat output Since Ilstat is generic in nature plot Ilstat is also a generic script The value of parameter_index can be 1 for count per interval 2 for count per second default setting or 3 for total count The plot llstat utility creates a dat CSV file using the number of operations specified by the user The number of operations equals the number of columns in the CSV file The values in those columns are equal to the corresponding value of parameter_index in the output file The plot llstat utility also creates a scr file that contains instructions for gnuplot to plot the graph After generating the dat and scr files the plot stat tool invokes gnuplot to display the graph Example llstat i2 g c lustre OST0000 gt log plot llstat log 3 Chapter 31 System Configuration U
517. upply ext3 mkfs options when we create the OST like j J and so on in the following manner where dev sdj has been formatted before as a journal The journal size should not be larger than 1 GB 262144 4 KB blocks as it can consume up to this amount of RAM on the OSS node per OST mke2fs O journal dev b 4096 dev sdj optional size Chapter 20 Lustre Tuning 20 11 Tip A very important tip on the S2A DDN 8500 storage array you need to create one OST per TIER especially in write through see output below This is of concern if you have 16 tiers Create 16 OSTs consisting of one tier each instead of eight made of two tiers each Performance is significantly better on the S2A DDN 9500 and 9550 storage arrays with two tiers per LUN Do NOT partition the DDN LUNs as this causes all I O to the LUNs to be misaligned by 512 bytes The DDN RAID stripes and cachelines are aligned on 1 MB boundaries Having the partition table on the LUN causes all 1 MB writes to do a read modify write on an extra chunk and ALL 1 MB reads to instead read 2 MB from disk into the cache causing a noticeable performance loss You are not obliged to lock in cache the small LUNs Configure the MDT on a separate volume that is configured as RAID 1 0 This reduces the MDT I O and doubles the seek speed For example one OST per tier LUNLabel Owner Status Capacity Block Tiers Tier list Mbytes Size 0 1 Ready 102400 512 1 T 1
518. uring a write or sync operation the file in question resides on an OST which is already full New files that are created do not use full OSTs but existing files continue to use the same OST You need to expand the specific OST or copy stripe the file over to an OST with more space available You encounter this situation occasionally when creating files which may indicate that your MDS has run out of inodes and needs to be enlarged To check this use df i 21 16 Lustre 1 8 Operations Manual October 2009 21 4 14 You may also receive this error if the MDS runs out of free blocks Since the output of df is an aggregate of the data from the MDS and all of the OSTs it may not show that the file system is full when one of the OSTs has run out of space To determine which OST or MDS is running out of space check the free space and inodes on a client grep 0 9 proc fs lustre osc kbytes free avail total grep 0 9 proc fs lustre osc files free total grep 0 9 proc fs lustre mdc kbytes free avail total grep 0 9 proc fs lustre mdc files free total You can find other numeric error codes in usr include asm errno h along with their short name and text description Triggering Watchdog for PID NNN In some cases a server node triggers a watchdog timer and this causes a process stack to be dumped to the console along with a Lustre kernel debug log being dumped into tmp by default The presence of a watchdog ti
519. us errors All of these macros depend on having the DEBUG_SUBSYSTEM variable set at the top of the file define DEBUG_SUBSYSTEM S_PORTALS Macro Description LBUG A panic style assertion in the kernel which causes Lustre to dump its circular log to the tmp lustre log file This file can be retrieved after a reboot LBUG freezes the thread to allow capture of the panic stack A system reboot is needed to clear the thread LASSERT Validates a given expression as true otherwise calls LBUG The failed expression is printed on the console although the values that make up the expression are not printed LASSERTF Similar to LASSERT but allows a free format message to be printed like printf printk 23 10 Lustre 1 8 Operations Manual October 2009 Macro Description CDEBUG CERROR ENTRY and EXIT LDLM_DEBUG and LDLM_DEBUG_NOLOCK DEBUG_REQ OBD_FAIL_CHECK OBD_FAIL_TIMEOUT OBD_RACE OBD_FAIL_ONCE The basic most commonly used debug macro that takes just one more argument than standard printf the debug type This message adds to the debug log with the debug mask set accordingly Later when a user retrieves the log for troubleshooting they can filter based on this type CDEBUG D_INFO This is my debug message the number is d n number Behaves similarly to CDEBUG but unconditionally prints the message in the debug log and to the console This is appropriate for serious errors or fatal conditions
520. used to present a collection of OSTs as a single device to the MDT and client file system drivers A set of configuration directives which describes which nodes are OSS systems in the Lustre cluster providing names for their OSTs The name of the project chosen by Peter Braam in 1999 for an object based storage architecture Now the name is commonly associated with the Lustre file system An operating instance with a mounted Lustre file system A file in the Lustre file system The implementation of a Lustre file is through an inode on a metadata server which contains references to a storage object on OSSs A preliminary version of Lustre developed for LLNL in 2002 With the release of Lustre 1 0 in late 2003 Lustre Lite became obsolete A library that provides an interface between Lustre OSD and MDD drivers and file systems this avoids introducing file system specific abstractions into the OSD and MDD drivers Multi Block Allocate Lustre functionality that enables the ext3 file system to allocate multiple blocks with a single request to the block allocator Normally an ext3 file system only allocates only one block per request Glossary 5 MDC MDD MDS MDT Metadata Write back Cache MGS Mountconf N NAL NID NIO API O OBD OBD API OBD type Glossary 6 MetaData Client Lustre client component that sends metadata requests via RPC over LNET to the Metadata Target MDT MetaData Disk Dev
521. ustre Knowledge Base A 9 A 10 Can I change the striping of a file or directory after it is created You cannot change the striping of a file after it is created If this is important e g performance of reads on some widely shared large input file you need to create a new file with the desired striping and copy the data into the old file It is possible to change the default striping on a directory at any time although you must have write permission on this directory to change the striping parameters How do I replace an OST or MDS The OST file system is simply a normal ext3 file system so you can use any number of methods to copy the contents to the new OST If possible connect both the old OST disk and new OST disk to a single machine mount them and then use rsync to copy all of the data between the OST file systems For example mount t ldiskfs dev old mnt ost_old mount t ldiskfs dev new mnt ost_new rsync aSv mnt ost_old mnt ost_new note trailing slash on ost_old If you are unable to connect both sets of disk to the same computer use rsync to copy over the network using rsh or ssh with e ssh rsync aSvz mnt ost_old new_ost_node mnt ost_new The same can be done for the MDS but it needs an additional step cd mnt mds_old getfattr R e base64 d gt tmp mdsea lt copy all MDS files as above gt cd mnt mds_new setfattr restore tmp mdsea Lustre 1 8 Operations Manual October 2009
522. ustre OST0000 osc ffff81012b2c48e0 checksum type cat proc fs lustre osc lustre OST0000 osc ffff81012b2c48e0 checksum type crc32 adler Chapter 24 Striping and I O Options 24 17 24 8 24 18 Striping Using Ilapi Use llapi_file_create to set Lustre properties for a new file For a synopsis and description of 1lapi_file_ create and examples of how to use it see Setting Lustre Properties man3 You can set striping from inside programs like ioctl To compile the sample program you need to download libtest c and liblustreapi c files from the Lustre source tree A simple C program to demonstrate striping API libtest c mode c c basic offset 8 indent tabs mode nil vim expandtab shiftwidth 8 tabstop 8 lustredemo simple code examples of liblustreapi functions include lt stdio h gt include lt fcntl h gt include lt sys stat h gt include lt sys types h gt include lt dirent h gt include lt errno h gt include lt string h gt include lt unistd h gt include lt stdlib h gt include lt lustre liblustreapi h gt include lt lustre lustre_user h gt define MAX _OSTS 1024 define LOV_EA_SIZE lum num sizeof lum num sizeof lum gt lmm_objects define LOV_EA_MAX lum LOV_EA_SIZE lum MAX_OSTS This program provides crude examples of using the liblustre API functions 7 Change these definitions to suit
523. uters live_router_check_interval dead_router_check_interval auto_down check_routers_before_use and router_ping_timeout In a routed Lustre setup with nodes on different networks such as TCP IP and Elan the router checker checks the status of a router The auto_down parameter enables disables 1 0 the automatic marking of router state The live_router_check_interval parameter specifies a time interval in seconds after which the router checker will ping the live routers In the same way you can set the dead_router_check_interval parameter for checking dead routers You can set the timeout for the router checker to check the live or dead routers by setting the router_ping_timeout parameter The Router pinger sends a ping message to a dead live router once every dead live_router_check_interval seconds and if it does not get a reply message from the router within router_ping_timeout seconds it considers the router to be down The last parameter is check_routers_before_use which is off by default If it is turned on you must also give dead_router_check_interval a positive integer value The router checker gets the following variables for each router m Last time that it was disabled m Duration of time for which it is disabled The initial time to disable a router should be one minute enough to plug in a cable after removing it If the router is administratively marked as up then the router checker clears the timeout When
524. utput Permanent disk data Target temp OSTffff Index unassigned Lustre FS temp Mount type ldiskfs Flags 0x72 OST needs_index first_time update Persistent mount opts errors remount ro extents mballoc Parameters mgsnode 10 2 0 1 tcp Chapter 4 Configuring Lustre 4 7 4 8 checking for existing Lustre data not found device size 16MB 2 6 18 formatting backing filesystem ldiskfs on dev sdd target name temp OST fff 4k blocks 0 options I 256 q O dir_index uninit_groups F mkfs_cmd mkfs ext2 j b 4096 L temp OSTffff I 256 q O dir_index uninit_groups F dev sdc Writing CONFIGS mountdata Mount the OSTs Mount each OST ost1 and ost2 on the OSS where the OST was created a Mount ost1 On oss1 node run root oss1i mount t lustre dev sdc mnt ostl The command generates this output LDISKFS fs file extents enabled LDISKFS fs mballoc enabled Lustre temp OST0000 new disk initializing Lustre Server temp OST0000 on device dev sdb has started Shortly afterwards this output appears Lustre temp OST0000 received MDS connection from 10 2 0 1 tcp0 Lustre MDS temp MDT0000 temp OSTO0000_ UUID now active resetting orphans b Mount ost2 On oss2 node run root oss2 mount t lustre dev sdd mnt ost2 The command generates this output LDISKFS fs file extents enabled LDISKFS fs mballoc enabled Lustre temp OST0000 new disk initializing Lustre Server temp OST0000
525. very because doing a read only file system check Pass 1 Checking inodes blocks and sizes root mds exit Script done file is tmp foo Appendix A Lustre Knowledge Base A 31 A 32 In many cases the extent of corruption is small some unlinked files or directories or perhaps some parts of an inode table have been wiped out If there are serious file system problems e2fsck may need to use a backup superblock reports if it does This causes all of the group summary information to be incorrect In and of itself this is not a serious error as this information is redundant and e2fsck can reconstruct this data If the primary superblock is not valid then there is some corruption at the start of the device and some amount of data may be lost The data is somewhat protected from beginning of device corruption which is one of the more common cases because of the large journal placed at the start of the file system The amount of time taken to run such a check is usually 4 hours for a 1 TB MDS device or a 2 TB OST device but varies with the number of files and the amount of data in the file system If there are severe problems with the file system it can take 8 12 hours to complete the check Depending on the type of corruption it is sometimes helpful to use debugfs to examine the file system directly and learn more about the corruption root mds script root debugfs sda root mds debugfs dev sda debugfs 1 35 lfsk8 05 Feb 2
526. w configured via module parameters Parameters should be specified in the etc modprobe conf file for example alias lustre llite options lnet networks tcp0 elan0 The above option specifies that this node should use all the available TCP and Elan interfaces Module parameters are read when the module is first loaded Type specific LND modules for instance ksockind are loaded automatically by the LNET module when LNET starts typically upon modprobe ptlrpc Under Linux 2 6 LNET configuration parameters can be viewed under sys module generic and acceptor parameters under LNET and LND specific parameters under the name of the corresponding LND Under Linux 2 4 sysfs is not available but the LND specific parameters are accessible via equivalent paths under proc 30 1 Important All old pre v 1 4 6 Lustre configuration lines should be removed from the module configuration files and replaced with the following Make sure that CONFIG_KMOD is set in your linux config so LNET can load the following modules it needs The basic module files are modprobe conf for Linux 2 6 alias lustre llite options lnet networks tcp0 elan0 modules conf for Linux 2 4 alias lustre llite options lnet networks tcp0 elan0 For the following parameters default option settings are shown in parenthesis Changes to parameters marked with a W affect running systems Unmarked parameters can only be set when LNET loads for the first time Chan
527. w files and sub directories is done per the striping parameter settings of the root directory Once you set striping on the root directory then by default it applies to any new child directories created in that root directory unless they have their own striping settings Using a Specific Striping Pattern File Layout for a Single File To use a specific striping pattern file layout for a specific file lfs setstripe creates a file with a given stripe pattern file layout lfs setstripe fails if the file already exists Chapter 24 Striping and I O Options 24 7 243 3 Creating a File on a Specific OST You can use 1fs setstripe to create a file on a specific OST In the following example the file bob will be created on the first OST id O lfs setstripe count 1 index 0 bob dd if dev zero of bob count 1 bs 100M 1 0 records in 1 0 records out lfs getstripe bob OBDS 0 home OSTO0000 UUID ACTIVE Lil bob obdidx objid objid group 0 33459243 Ox1lfe8c2b 0 24 8 Lustre 1 8 Operations Manual October 2009 24 4 24 4 1 Managing Free Space in Lustre In Lustre 1 6 the MDT assigns file stripes to OSTs based on location which OSS and size considerations free space to optimize file system performance Emptier OSTs are preferentially selected for stripes and stripes are preferentially spread out between OSSs to increase network bandwidth utilization The weighting factor between these two optimizatio
528. with changing existing flags sysctl w lnet debug net The various options available to print to kernel debug logs are listed in Inet include libcfs libcfs h The Ictl Tool Lustre s source code includes debug messages which are very useful for troubleshooting As described above debug messages are subdivided into a number of subsystems and types This subdivision allows messages to be filtered so that only messages of interest to the user are displayed The 1ct1 tool is useful to enable this filtering and manipulate the logs to extract the useful information from it Use 1ct1 to obtain the necessary debug messages 1 To obtain a list of all the types and subsystems lctl gt debug_list lt subs types gt 2 To filter the debug log lctl gt filter lt subsystem name debug type gt Chapter 23 Lustre Debugging 23 7 23 8 Note When 1ct1 filters it removes unwanted lines from the displayed output This does not affect the contents of the debug log in the kernel s memory As a result you can print the log many times with different filtering levels without worrying about losing data 3 To show debug messages belonging to certain subsystem or type lctl gt show lt subsystem name debug type gt debug_ kernel pulls the data from the kernel logs filters it appropriately and displays or saves it as per the specified options lctl gt debug_kernel output filename If the debugging is being done on User Mode L
529. wrong for a shared target Soon file system naming will be made as fail safe as possible Currently Linux disk labels are limited to 16 characters To identify the target within the file system 8 characters are reserved leaving 8 characters for the file system name lt fsname gt MDT0000 or lt fsname gt OST0al19 To mount by label use this command mount t lustre L lt file system label gt lt mount point gt This is an example of mount by label mount t lustre L testfs MDTO000 mnt mdt Caution Mount by label should NOT be used in a multi path environment Although the file system name is internally limited to 8 characters you can mount the clients at any mount point so file system users are not subjected to short names Here is an example mount t lustre umll tcp0 shortfs mnt lt long file_system name gt Mounting a Server Starting a Lustre server is straightforward and only involves the mount command Lustre servers can be added to etc fstab mount t lustre The mount command generates output similar to this dev sdal on mnt test mdt type lustre rw dev sda2 on mnt test ost0 type lustre rw 192 168 0 21 tcp testfs on mnt testfs type lustre rw In this example the MDT an OST ost0 and file system test fs are mounted ABEL test fs MDTO000 mnt test mdt lustre defaults netdev noauto 0 0 ABEL test fs OSTO000 mnt test ost0 lustre defaults _netdev noauto 0 0
530. x inetd killall HUP xinetd 1 gt dev null 2 gt amp 1 If this is a new installation format the OSTs MDTs MGSs and Lustre clients Mount the OSTs MDTs MGSs and Lustre clients and verify that the Lustre file system is running normally 5 2 1 This is the current service tag package The version number is subject to change Lustre 1 8 Operations Manual October 2009 5 2 2 Discovering and Registering Lustre Components After installing the service tag package on all of your Lustre nodes discover and register the Lustre components To perform this procedure Lustre must be fully configured and running 1 FIGURE 5 1 a wN p v Navigate to the Sun Lustre download page and download the Registration client eis regclient jar Install the Registration client on one node the collection node that can reach all Lustre clients and servers over a TCP IP network Install Java Virtual Machine Java VM on the collection node Java VM is available at the Sun Java download site Start the Registration client run java jar eis regclient jar The Registration Client utility launches TAFT Product Registration Locate or load product data View product data Login to Sun Online Account Determine which products to register Summary Registration Client Sun Microsystems Registration Client 2 0 1 ox Sun Microsystems Product Registration Product Data System a Product V
531. xt3 Currently testing is underway to allow file systems up to 16 TB You can have multiple OST file systems on a single node Currently the largest production Lustre file system has 448 OSTs in a single file system There is a compile time limit of 8150 OSTs in a single file system giving a theoretical file system limit of nearly 64 PB Several production Lustre file systems have around 200 OSTs in a single file system The largest file system in production is at least 1 3 PB 184 OSTs All these facts indicate that Lustre would scale just fine if more hardware is made available 32 7 Maximum File Size Individual files have a hard limit of nearly 16 TB on 32 bit systems imposed by the kernel memory subsystem On 64 bit systems this limit does not exist Hence files can be 64 bits in size Lustre imposes an additional size limit of up to the number of stripes where each stripe is 2 TB A single file can have a maximum of 160 stripes which gives an upper single file limit of 320 TB for 64 bit systems The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped 32 8 Maximum Number of Files or Subdirectories in a Single Directory Lustre uses the ext3 hashed directory code which has a limit of about 25 million files On reaching this limit the directory grows to more than 2 GB depending on the length of the filenames The limit on subdirectories is the
532. y If you have a problem mounting the file system check the syslogs for errors Tip Now that you have configured Lustre you can collect and register your service tags For more information see Service Tags 4 4 Lustre 1 8 Operations Manual October 2009 4 1 0 1 Simple Lustre Configuration Example To see the steps in a simple Lustre configuration follow this worked example in which a combined MGS MDT and two OSTs are created Three block devices are used one for the combined MGS MDS node and one for each OSS node Common parameters used in the example are listed below along with individual node parameters Common Parameters Value MGS node 10 2 0 1 tcp0 file system temp network type TCP IP Description Node for the combined MGS MDS Name of the Lustre file system Network type used for Lustre file system temp Node Parameters Value Description MGS MDS node MGS MDS node mdt1 block device dev sdb mount point mnt mdt MDS in Lustre file system temp Block device for the combined MGS MDS node Mount point for the mdt1 block device dev sdb on the MGS MDS node First OSS node OSS node ossi OST ost1 block device dev sdc mount point mnt ost1 First OSS node in Lustre file system temp First OST in Lustre file system temp Block device for the first OSS node oss1 Mount point for the ost1 block device dev sdc on the oss1 node Second OSS node OSS node oss2 OST ost2

Lustre 1.8 Operations Manual

Contents

Download Pdf Manuals

Related Search

Related Contents