Home

HowTo Build a Diskless Debian Cluster

image

Contents

1. with the following content don t forget to postmap it external domain smtp external mail host external fqdn cluster error relay to other domains forbidden qu 3 Create the file etc postfix generic with the following content don t forget to postmap it 1 naslQexternal domain nas2 domain nas2 external_domain Coe Restart postfix Emails from nasl will be relayed to external_domain The external_mail_host only accepts mail from resolvable hostnames so we have to rewrite the from address the rules are in etc postfix transport In order to check the configuration try to send some lt mails from inside to master node some external mail address etc Check in var log mail log and check the mail queue with lt mailq command You can delete queued mails with post super D ALL GlusterFS Our Problem We need more disk space but we have no money for another NAS Possible Solution Put a disk in each node and combine them all into to one big volume using glusterfs Unfortunately we need to store information for the gluster daemon xper node lt Prepare Semi stateful nodes 25 51 53 55 Yes this is hilarous we go through hoops to create a diskless stateless cluster only to go back and put disks in it We create script which
2. This means the card does not get maximum speed Speed 5GT s Width x8 but only half of the maximum Speed 2 5GT s Width x8 This mainboard is only capable of PCIe 1 0 2 5 GTransfers s and the card will not reach the theoretical bandwitdh of 20 GB s but only half of that The data rate is given per direction which is still impressive For our use case the latency is more important anyway 5 1 2 Necessary Software We need to install drivers and several network test packages aptitude install perftest ibutils libmlx4 1 opensm In order to make our InfiniBand cards able to talk to each other there need to be a so called subnet manager One such SM is included in the package opensm We can load the driver for our IB cards with modprobe m1x4_core if it is not loaded yet Confirm succesfull detectionf of the card with dmesg grep Mellanox The card s status can be viewed with ibstatus command root testserver ibstatus Infiniband device mlx4 0 port 1 status default gid fe80 0000 0000 0000 0002 c903 000d b717 base lid 0x2 sm lid 0 1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Upon installing opensm the cards state should change to ACTIVE If you need the port of your card for opensm you can use the command ibstat p and add the port in etc default opensm rootQtest master node 3i ibstat p 0x0002c903000db717 Now we need to load the IP over Infiniband kernel modules They are not load
3. As all of the cluster is at least in our private subnet there has to be a gateway for mails Using the master node is obviously the first choice Another point is we want to be able to send these mails to any address which means we need to relay outbound mail to our institutes mail server Here is short abstract about what we want our mail server to be able do e receive mails from anybody on out local private subnet Welcome page dozor fkp physik tu darmstadt de Main Search Views Aggregate Graphs Compare Hosts Events Automatic Rotation Live Dashboard Last hour 2hr 4hr day week Metric lload one Grid gt unspecified gt choo lod unspecified Cluster Report for Fri 31 Jan 2014 12 16 18 0100 month year job or from to Sorted ascending descending by name Show only nodes matching Filter graphs to show p Overview of unspecified 2014 01 31 12 16 CPUs Total Hosts up Hosts down 416 20 ul 500 Current Load Avg 15 5 1m 63 63 63 Avg Utilization last week 59 Loads Procs Utilization heatma nspecified Cluster Load last week Sun 26 Tue 28 Thu 30 Now 263 2 Min 152 3 246 7 Now 20 0 Min 20 0 Avg 20 0 Now 416 0 Min 416 0 Avg 416 0 Now 271 7 156 0 Avg 253 4 unspecified Cluster Memory Last week 300 G t m Total Now Jooumy 1 1801
4. P 4 Client connecting to 10 0 0 102 TCP port 5001 TCP window size 23 5 KByte default 5 local 10 0 0 161 port 41893 connected with 10 0 0 102 port 5001 3 local 10 0 0 161 port 41890 connected with 10 0 0 102 port 5001 4 local 10 0 0 161 port 41891 connected with 10 0 0 102 port 5001 6 local 10 0 0 161 port 41892 connected with 10 0 0 102 port 5001 ID Interval Transfer Bandwidth 5 0 0 10 0 sec 188 MBytes 157 Mbits sec 3 0 0 10 0 sec 203 MBytes 170 Mbits sec 4 0 0 10 0 sec 174 MBytes 146 Mbits sec 6 0 0 10 3 sec 200 MBytes 164 Mbits sec SUM 0 0 10 3 sec 766 MBytes 627 Mbits sec It is usually around 1000 MBit s per second the reason for the discrepancy is maybe that the NAS was not idle during this test Here are the results for ib_write_bw and ib_write_lat 21 11 13 10 12 rootQnas 2 4 ib write bw 192 168 1 1 RDMA_Write BW Test Number of qp s running 1 Connection type RC Each Qp will post up to 100 messages each time Inline data is used up to 1 bytes message local address LID 0x01 QPN 0x004b PSN 0x2984f8 RKey 0x10001b00 VAddr lt 0x007fbee52f7000 remote address LID 0x02 0x2a004b PSN 0x97f807 RKey 0xa6001b00 VAddr 0x007f5d2balf000 Mtu 2048 bytes iterations BW peak MB sec BW average MB sec 65536 5000 1497 12 1497 03 rootQnas 2 4 ib write lat 192 168 1 1 RDMA Write Latency Test Inline data is used up to 400 bytes message Connection type
5. Percent unspecified Cluster CPU last week 30 Thu 30 Max Tue 28 Avg 46 8 Avg 0 0 0 Avg 6 9 8 0 Avg 2784720161789 5 Max ld Avg 46 21 Max 64 4 unspecified Cluster Network last week 20M 15M 10 5M Bl In B out Sun 26 Now 54 6k Now Tue 28 Thu 30 1 7M Min 547 0k Avg 1 4M Max Figure 1 Ganglia cluster overview Jou 1 1801 Min 21 8k Avg 203 3k 14 1 7 1 e send and relay emails to any outbound addresses e do this only for mails originating from our local subnet obviously Following are the important parts of our configuration in teh postfix configuration file etc postfix main cf myorigin etc mailname myhostname master node cluster mydestination dozor fkp physik tu darmstadt de master node myhostname localhost cluster agvogel localhost localhost localdomain relayhost relayhost_fqdn mynetworks 127 0 0 0 8 ffff 127 0 0 0 104 1 128 10 0 0 0 24 Short explanation e myorigin specifies the domain that appears in mail that is posted on this machine e mydestination lists the domains this machine will deliver locally Do not forget to accept mail to localhost e relayhost will handle our non local emails e mynetworks forward mail from clients in mynetworks to any destination very detailed description about this configurationc an be found on the postfi
6. high bandwidth as well as low latency InfiniBand is optimal for this cases especiially for low latency The fastest way to access NFS via InfiniBand is to use RDMA Remote Direct Memory Access In our case I decided for simplicity to use IP over Infiniband und use regular NFS with TCP IP A short instruction on how to actually make our InfiniBand work is following Background information about InfiniBand can be found in the Wikipedia article and the references therein A very good HowTo is offered from inqbus 5 1 1 Physical Installation The cards have a PCIe 2 0 8x interfcce Open the computer and select a slot where they can fit in In our NAS the lowest port is only PCIe2 0 x4 and slows down the card unnecessarily You can check with lspci for the installed card rootQtest master node 4 lspci 04 00 0 InfiniBand Mellanox Technologies MT26418 ConnectX VPI PCIe 2 0 5 GT s IB DDR 10GigE rev a0 The card is installed now lets look at it more thoroughly with lspci vv rootQtest master node i lspci s 04 00 0 vv 04 00 0 InfiniBand Mellanox Technologies MT26418 ConnectX VPI PCIe 2 0 5 GT s IB DDR 10GigE rev a0 Subsystem Mellanox Technologies Device 0001 Control I O Mem BusMaster SpecCycle MemWINV VGASnoop ParErr lt Stepping SERR FastB2B DisINTx 18 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 4T 49 Sta
7. 4 dual quad core AMD Opteron CPUs with 8 GB RAM one of these is the master node c 2 NAS with quad Xeon processors one with 22 GB RAM and one with 12 GB RAM 3 shared home directories will be provided by a NAS 4 shared data directories will be provided by the other NAS The master node is connected via 20 GB s InfiniBand actually only 10 GB s due to old mainboards We are as of now 2013 12 still running our Cluster on Debian Lenny and we really need to upgrade to the current stable version Wheezy This new setup should work with native Debian packages with one exception allowed ganglia webfrontend and ideally it should still work with the next debian release Jessie This HowTo intends to explain the basic steps to get this cluster up and computing and includes a description about setting up the master node as well as how to create the nfs root for the diskless compute nodes Nevertheless this short document can not give all the background information which is maybe needed but an effort is undertaken to explain the why for the critical parts Why do we do this manually anyway Aren t there a number of tools to do that e perceus e warewolf e kestrelhpc e pelicanhpc e oneSIS e and more I found most of the tools either lack documentation or are way to complex for our simple needs Diskless nodes are sometimes to complicated to setup or not even possible Some of the packages have a lot of external de
8. MByte 1 1 GByte 4 3 GByte 17 2 GByte 4 1 kByte 68 7 GByte record size file size IB performance statistics via IB mounted NFS share from a 12 Disk RAID6 NAS 23 11 local speed on the NAS of about 600 MBytes sec At smaller file sizes the caching of the server is the only limitting factor CPU cache and RAM speed which are much faster than disks According to various sources on the net one could expect an improvement of about 1096 to 2096 by usage of so called RDMA NFS Remote Direct Memmory Access We are already quite close to the local throughput so I do not think it is worth the complication for our case I found no numbers about the latency differences but i would expect that RDMA NFS performes better 5 1 6 BTW e nmap fails on wheezy due to the infiniband card 5 2 MRTG Setup on master node Here is a shsort installation note about MRTG This little program collects sttatistics about the traffic from each port of the switch via SNMP Do not forget to enable SNMP on the switch or this will not work This can be done either with the routers web interface or by ssh if the switch supports it The following commands enable SNMP acces from the master node on an SMC TigerSwtich ssh admin router configure management snmp client master node ip exit copy running config startup config Then install mrtg aptitude install mrtg mrtgutils with recommends mkdir var ww mrtg chown www data www d
9. RC local address LID 0x01 QPN 0x2004b PSN 0xe85578 RKey 0x12001b00 VAddr 0 lt x000000023a0002 remote address LID 0x02 QPN 0x2c004b PSN 0x916bac RKey 0xa8001b00 VAddr lt 0x00000000962002 Mtu 2048 bytes iterations t_ min usec t max usec t typical usec 2 1000 1 30 66 57 1 33 5 1 4 Tuning You can change the connection mode from datagram to connected which will allow for MTU sizes up to 65520 bytes instead of 2044 bytes but drops mutlicast packets This settings change is accomplished via the sys virtual file system echo connected gt sys class net ib0 mode ifconfig ibO mtu 65520 Further information about InfiniBand and tuning can be found in the Mellanox OFED for Linux User s Manual PDF and Performance Tuning Guide for Mellanox Network Adapters PDF 5 1 5 NFS over IP over IB Using iozone to check the performance of the InfiniBand mounted NFS share we get following performance statistics You can see that once the file size is around bigger than the RAM size 22 GB in this case the write speed is consistently at around 500 MBytes s This corresponds to the 22 1 6 GBytels 1 4 GBytels 1 2 GByte s 1 0 GByte s throughput 800 0 MByte s 600 0 MByte s 400 0 MByte s 65 5 kByte 262 1 kByte Figure 2 serial write 1 6 GBytels 14 GBytels 1 2 GByte s 1 0 GBytels 800 0 MByte 600 0 MByte 400 0 MByte 1 0 MByte 4 2 MByte 16 8 MByte 67 1 MByte 268 4
10. configuration Complete DNSmasqd configuration file etc dnsmasq conf This is the file used during transition from the old master node to the new master node The outbound NIC is eth1 with the IP 10 0 0 161 domain needed Never forward plain names without a dot or domain part bogus priv Never forward addresses in the non routed address spaces no resolv Do not get nameservers from etc resolv conf local cluster queries in these domains are answered from etc hosts or lt DHCP only server 10 0 0 254 eth1 send queries to 10 0 0 254 via ethl listen address 192 168 0 254 listen on address addn hosts etc hosts_cluster read host names from these files addn hosts etc hosts_icluster domain cluster domain for dnsmasq dhcp range 192 168 0 248 192 168 0 253 24h we need one range for dhcp server to work dhcp ignore tag known Ignore any clients which are not specified in dhcp host lines dhcp option 42 0 0 0 0 the NIP time server address to be the same machine as is running dnsmasq dhcp option 40 cluster set the NIS domain name dhcp boot pxelinux 0 set the boot filename for netboot PXE enable tftp tftp root srv tftp Configure the ip lt gt host name assignment This is done in etc dnsmasq d nodes We want the nodes physical position to be easy to correspond to an sepcific ip that means for exmaple high IP numbers at top of the rack Skip this if you d
11. we will run after the system booted with etc rc local Following steps are needed befor starting glusterfs server m Mount the glusterfs relevant directories via NFS Mount the bricks create proper mount points 3 Only then start the glusterfs dameon N 6 bin bash echo Mounting glusterfs related directories mkdir p var lib glusterd mount t nfs4 nas1 stateful HOSTNAM E var lib glusterd var lib glusterdmkdir p var log gluster f smount tn f s4nas1 stateful HOSTNAME var log glusterfs var log glusterfs XFSDEVS blkid odevice tTY PE xfs sort echo Foundz f sdevices XFSDEVS BRICK_NUM 1 for dev in X FSDEV S doecho Mountingbrick dev to gluster brickB RIC KNU M mkdir p gluster brick BRICK_NUM mount dev sdal gluster brickB RIC K yU Mvar BRICK_NUM 1 done if pgrep gluster then etc init d glusterfs server restart else etc init d glusterfs server start fi 4 GlusterFS Volume Configuration If you have different sized bricks you should set a cluster min free size quota to a specific value and not 5 gluster volume set lt volname gt cluster min free disk 30GB Best is to avoid uneven brcik sizes as that is not tested much Furthermore is the algorithm used to spread the files not meant for that usecase If you see following error in the logs 0 management connection attempt failed Connection refused check if the UUID of all hosts match if not detach the off
12. 2 2 Optional Setup eatmydata for faster debootstrapping Using eatmydata speeds up the build process quite a bit by ignoring the multiple fsync etc calls from dpkg apt get etc This means that the data is not yet commited to hard disk in case of a hard reset this unwritten data could be lost Eventually the kernel will write the data to disk This simple two line patch does the trick found somewhere on the net the patch is actually reversed usr share debootstrap scripts wheezy sid 2013 02 15 11 03 15 384977238 0500 sid orig 2013 02 15 10 50 23 381293976 0500 16 7 416 7 Go esac work out debs required get debs Priority required eatmydata required get_debs Priority required if doing variant doing variant fakechroot then required required get debs Priority important 68 7 68 7 QQ second stage install setup devices export LD PRELOAD usr lib libeatmydata libeatmydata so x_core_install smallyes in_target dpkg force depends install debfor 2 3 Important Preparations for Successful Network Boot In order to succesfully boot from the network some configuration details have to be obeyed strictly When setting this up for the first time it was really frustating because every configuration change requires a node reboot Without ipmi and console redirection this becomes even more tedious The follo
13. 8 0 20 12h Another similar file in the same directory serves the IP Adresses for IPMI network interfaces e one way to get the MAC addresses is to switch a compute node on and check the arp table with arp When a new MAC is found power on the next node etc This is how it is done with the real cluster tools like perceus kestrel warewulf etc e another is read the MAC during boot up tedious e yet another one is the type the number from the delivery documents if available i e burn in test protocols e if you have an existing cluster try arp ni private iface gt This will get you a list of currently configured MAC and IP addresses Add the node names to etc hosts The dnsmasq daemon will use this to answer DNS requests The entries can be created by script i e like this for i in 1 20 do echo 192 168 0 i linux printf 96021 i linux printf 9602i i cluster done will create entries along the following patter which need to be appended to the hosts file gt gt etc hosts 192 168 0 1 linux 0l cluster linux 01 192 168 0 2 linux 02 cluster linux 02 Check the DNS resolver configuration The file etc resolv conf contains among the domain and evtl the domain search name a list of up to three nameservers De pending on the configuration of the outbound NIC this file could be overwritten by the DHCP client daemon To make sure that the local DNS server dnsamsq is asked first one should mo
14. ENGTH 5 test common pitfall is to copy the key from the master to the nodes nfsroot The UID and GID of the munge daemon on the master and nodes do not necessarily conform Make sure that the owner of the file in the nfsroot is indeed the UID GID of the nfsroot munge You can check for the UID GID and set the owner with chroot srv nfsroot id munge chroot srv nfsroot chown munge munge etc munge munge key chroot srv nfsroot chmod 600 etc munge munge key 2 5 2 SLURM There is a web configuration script in usr share doc slurm 1l1n1 but it seems to be slightly outdated gang scheduling is no longer a seperate module it is now builtin In order for gang scheduling to work one must set the preemptmode appropiately see below for an example slurm conf preemptmode GANG Following are the current active settings root test master node cat etc slurm llnl slurm conf grep v grep v Ass grep v ControlMachine test master node ControlAddr 192 168 0 254 AuthType auth munge CacheGroups 0 CryptoType crypto munge JobCheckpointDir var lib slurm 11n1 checkpoint MpiDefault none ProctrackType proctrack pgid ReturnToService 1 SlurmctldPidFile var run slurm l1n1 slurmctld pid SlurmctldPort 6817 15 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 SlurmdPidFile var run slurm 11n1 slurmd pid SlurmdPort 6818 SlurmdSpo
15. HowTo Build a Diskless Debian Cluster Markus Rosenstihl Contents 1 HowTo Build a Diskless Debian Cluster 2 Ll Objective se 2 24642588446 24608 E REOR 2 1 2 Basic Master Node Setup 3 SB a RRS ee eo a 3 L22 ONS MISS a oc fcc Ste a ewe Ok WG wk oe BR 3 eee E LU AC 6 Dd TSO ecce fue ee ee 7 12D AGA os s homo eo e dot ee x 7 126 Gangha Web uoi See o3 he eRe Re 8 Ley NES Exports Y ea Bas RON E DER em 8 1235 Posbhx Maiserver lt i pamase ee eee ow he 3 RC ES ES 3 xeg 8 2 The Live Image for the Compute Nodes 10 21 Some general remarks gt lt seco rsa 6x oy RE Ee ee ee ere ee 10 2 2 Optional Setup eatmydata for faster debootstrapping 11 2 3 Important Preparations for Successful Network Boot 11 2 8 1 Configure PXE bootparamter lt ss sa ca sa ae Caisa aona 11 232 Initramis configuration aca su g a eh 8 px 12 2 3 3 Overlay File System 12 2 3 4 Script for Overlay File System aa 12 2 3 5 Prevent reconfiguration of the node network interfaces 13 2 3 6 These directories should be created a 13 2 3 7 Silence etc mtab errors 14 2 3 8 also need the nfs client softw
16. al 0 Irwxrwxrwx 1 33 Dec 10 14 18 10 fastcgi conf gt conf available 10 fastcgi conf Irwxrwxrwx 1 37 Dec 10 14 19 15 fastcgi php conf gt conf available 15 fastcgi php conf Installing the ganlia web frontend itself fairly straight forward Download the tarball from their SourceForge site and uncompress it Then edit the Makefile to your needs Location where gweb should be installed to excluding conf dwoo dirs GDESTDIR var www ganglia Gweb statedir where conf dir and Dwoo templates dir are stored GWEBSTATEDIR var lib ganglia web Gmetad rootdir parent location of rrd folder GMETAD ROOTDIR var lib ganglia APACHE_USER www data Finally execute make install and the files will be copied to the given directory Make sure the owner and permissions are right then go to http master node ganglia and watch the status of your cluster you should see some pretty graphs 1 2 7 NFS Exports We use NFSv4 exports i e do not forget the fsid 0 paramter for the root srv direc tory Do not forget that NFS booting itself does NOT work with NFSv4 adjust the mount point ind the PXE configuration file accordingly that means you have to use host ip srv nfsrooot instead of host ip nfsroot at the kernel command line 1 2 8 Postfix Mailserver SLRUM can send status emails for jobs we also want to send emails for ceratin events like high room temperature failed harddrives and similar stuff
17. are 14 OCA d seee bee at oe we ooh naa Se ee 14 25 SLURM and Munge oe ss esasa 60 p 14 20d MNP s 224 SG ee POR ee Rae EUR Se 14 2 5 2 SLURM os ese dar dk Gas Woe oom Boe ewe 15 26 RSYSLOG ssa won eh Ra eR Se OE ee Sk Goss ws 16 24 Mounted tle systems e so e m REIR Remedy ese ERU ROM 17 2 8 Prevent Starting of Daemons c o so esou ecte S RR RR 17 3 Troubleshooting 4 Useful Packages 5 Miscellaneous Stuff 5 1 Infinfiband Setup and Testing 5 1 1 Physical Installation 2 22 sd 5 1 2 Necessary Software soaa len 5 1 3 InfiniBand Performance Sla TUMIE lu dones no ee Eom oA Ge Ro E Rs S 5 15 NFS over IP over IB 25 2 12 5 9o 9o 9 WE OE BIWI HL 52 MRTG Setup on masternod os s 64 xc e S u Be eee 6 bin bash Printer Setup 1 HowTo Build a Diskless Debian Cluster PDF Version of this document 1 1 Objective 17 18 18 18 18 20 21 22 22 24 24 26 26 We want to build a diskless Debian cluster for high performance computing HPC The main purpose of the cluster are molecular dynamic simulations using GROMACS Our current cluster consists of 1 one master node head node 2 about 20 compute nodes without any hard disk drives a 16 dual dodeca 12 core AMD Opteron CPUs with 16 GB RAM b
18. are united to the new root mount point The ramdisk will be writeable and contains all the changes to the filesystem in memory until the next reboot The nfsroot will be read only 2 3 5 Prevent reconfiguration of the node network interfaces Make sure that the IMGDIR etc network interfaces file has this entry iface ethO inet manual This insures that the NIC configuration will be left as it is and not reconfigured which would break the connection to the NFS root 2 3 6 These directories should be created The aufs script needs these directories to exist on the live image mkdir srv node image nfsroot srv node image ramdisk 13 2 3 7 Silence etc mtab errors To prevent the error etc mtab not found wesimply link proc mounts to etc mtab chroot srv node image ln s proc mounts etc mtab 2 3 8 We also need the nfs client software Of course we need the NFS client to connect to our shared home and data directories apt get install nfs common 2 4 NIS The NIS clients need to add the and similar to etc passwd etc shadow and etc group files This will make the clients ask the master server for user credentials like UID GID etc The file etc nsswitch should be changed slightly change all occurances for passwd etc from compat to nis compat Make the NIS client start on system start up etc default nis NISCLIENT true Then the client needs to know which server to ask etc yp conf ypse
19. ata var www mrtg create config cfgmaker public router ip gt etc mrtg router cfg run mrtg once to create initital rrds and graphs env LANG C usr bin mrtg etc mrtg router cfg create html index file indexmaker etc mrtg router cfg gt var www mrtg index html Add the following line to your crontab with the statistics every 5 minutes 5 env LANG C usr bin mrtg etc mrtg smc switch cluster cfg gt dev null crontab e to update The output should be seen with browser on http host mrtg Postfix Mail steup on master node Goals Accept emails from the private network so we can send status mails from various hosts and services like from a NAS or RAID Managmant Software 24 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 4T 49 x Relay mails to our own mailserver external mail host and xonly to our mailserver with domain external domain This setup can be done in the follong way with the postfix MTA 1 Edit the etc postfix conf file transport maps hash etc postfix transport mydestination external hostname master node master node cluster localhost mynetworks 127 0 0 0 8 ffff 127 0 0 0 104 1 128 192 168 0 0 24 smtp_generic_maps hash etc postfix generic 2 Create the file etc postfix transport
20. dify the DHCP client configuration in the file etc dhcp dhclient conf and add the entry prepend domain name servers 127 0 0 1 This insures that the DNS server on the localhost will always be asked first If the outbound NIC is configured manually this is not necessary the resolv conf file will not change I recommend not to install network manager as it could interfere with the manual configuration in etc network interfaces 1 3 1 2 3 NIS There are a lot of available How Tos floating on the net describing the process of setting up NIS Good ones are Arch Linux FreeBSD Handbook which is absolutely marvellous and despite beeing BSD based still useful for Linux e the original NIS HowTo Debian specific Insure that the NIS domain is set correctly The NIS domain has to be the same for all the computers in the cluster accessed by the users e dpkg reconfigure nis e check set with commands ypdomainname or nisdomainname e check content of etc defaultdomain Master node as master NIS server The master node will be the master NIS server for our cluster one or both of the NAS could be set up as slave server to provide redundancy Check NIS server configuration in the file etc default nis NISSERVER master or slave or false Initialize yp database In order to initialize the yp database issue the following com mand ypinit m If something does not work Sometimes it helps to reinitialize update the servers yp maps To that ef
21. ed automatically so do not forget to add them to etc modules mlx4 ib ib ipoib ib umad rdma ucm rdma cm 20 Load them with modprobe Then we can finally assign an IP to our new card ifconfig 110 up 192 168 2 1 Following this instruction on the second host should allow you to ping the cards Create new entry in etc network interfaces to make the assigned IP permanent auto ibO iface ibO inet static address 192 168 1 1 netmask 255 255 255 0 5 1 3 InfiniBand Performance You can use iperf to test the IP network bandwidth as well as ib read lat ib write bw etc to test the read latency or the write bandwidth respectively Here are some results from iperf root testserver iperf c 192 168 1 2 P4 Client connecting to 192 168 1 2 TCP port 5001 TCP window size 649 KByte default local 192 168 1 1 port 56344 connected with 192 168 1 2 port 5001 local 192 168 1 1 port 56341 connected with 192 168 1 2 port 5001 local 192 168 1 1 port 56342 connected with 192 168 1 2 port 5001 local 192 168 1 1 port 56343 connected with 192 168 1 2 port 5001 0 0 9 0 sec 1 40 GBytes 1 34 Gbits sec 0 0 9 0 sec 1 42 GBytes 1 35 Gbits sec 0 0 9 0 sec 1 41 GBytes 1 34 Gbits sec 0 0 10 0 sec 2 04 GBytes 1 75 Gbits sec SUM 0 0 10 0 sec 6 27 GBytes 5 39 Gbits sec 5 4 3 6 ID Interval Transfer Bandwidth 4 3 6 5 For comparison look at the speed of 1G ethernet root testserver iperf c 10 0 0 102
22. ending host and probe again 7 Printer Setup We do not yet have our own subnet we do not yet use a configuration managment system like puppet salt or cfgengine so we have to manage some stuff via command line Like setting up a printer dsh c M lpadmin colorqube E ipp 130 83 32 235 ipp P xr ColorQube8570 ppd printer is shared false o PageSize A4 o XRXOptionFeatureSet DN dsh c M lpoptions d colorqube 26
23. fect issue the commands in a terminal cd var yp make all You can modify the var yp Makefile to suit your needs an example you can only serve UIDs in a certain numerical range with the MINUID and MAXUID variables Define our subnet For security reasons define our subnet in root testserver cat etc ypserv securenets This will make sure that only requests from within our subnet will be answered Always allow access for localhost 255 0 0 0 127 0 0 0 This line gives access to our subnet 192 168 0 0 24 255 255 255 0 192 168 0 0 11 1 2 4 RSYSLOG We want the compute nodes to log everything to the head node and nothing on the node itself i e all log data will be forwared to the head node Otherwise it could happen that the small RAM filessystem fills up due to logging The following configuraton makes the rsyslogd process listen for incoming log mes sages cat etc rsyslog conf grep v e e ModLoad imuxsock provides support for local system logging ModLoad imklog provides kernel logging support ModLoad immark provides MARK message capability ModLoad imudp UDPServerRun 514 ModLoad imtcp Input TCPServerRun 514 ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat the rest left unchanged 1 2 5 Ganglia Ganglia is used to monitor the health of our cluster It stores collected data in RRD files for example network usage uptime load etc Basically ganglia consists of one ore mo
24. kernel logging support ModLoad immark provides MARK message capability ActionFileDefault Template RSYSLOG ForwardFileFormat FileOwner root FileGroup adm FileCreateMode 0640 DirCreateMode 0755 16 10 12 Umask 0022 WorkDirectory var spool rsyslog IncludeConfig etc rsyslog d conf 192 168 0 254 2 7 Mounted file systems The filesystems needed on the nodes shoudl be defined in the file etc fstab Here is an example do not change the first three entries etc fstab static file system information lt file system mount point type lt options gt dump pass proc proc proc defaults 0 0 sysfs sys sysfs defaults 0 0 devpts dev pts devpts rw nosuid noexec relatime gid 5 mode 620 0 0 192 168 0 254 home home nfs4 defaults 0 0 192 168 0 200 data data nfs4 defaults 0 0 2 8 Prevent Starting of Daemons The following command will prevent the start of daemons services during installation as we do not want to start a second ssh daemon in the nfs root chroot More details of how this actually works can be found here and in the man pages of invoke rc d man invoke rc d Use the following two commands to create a file and set the mode to be excutable echo e bin sh necho Not starting daemon nexit 101 V gt IMAGEDIR usr sbin policy rc d chmod 755 IMAGEDIR usr sbin policy rc d 3 Troubleshooting Here are some useful commands for troubleshooti
25. ng Parameters starting with are variables you have to change to your needs cd var yp make all recreate yp maps dig short hostname check DNS returns IP of hostname dig dns_Server_ip short hostname check DNS using dns server ip server ypcat passwd byname check NIS on client server returns part of etc passwd lynis check system for obvious configuration oversights security wise ifconfig ethO 192 168 0 1 set IP of NIC ethO to 192 168 0 1 mpirun n 4 H linux 20 localhost cmd execute cmd linux 20 and locally dsh a c m cmd 4 exec cmd on all nodes concurrently prepend node name to output usermod R IMAGEDIR R give the chroot envirnmoent for usermod pwck and friends 17 4 Useful Packages Here is a list of useful packages with a short explanation of what they do e etckeeper manages the etc directory I use the following options AVOID DAILY AUTOCOMMITS 1 AVOID COMMIT BEFORE INSTALL 1 e lynis is a auditing tool helps to find obvious security problems e schroot makes it easy to maintain chroot environments e dsh executes commands remotely on several different machines at the same time e vim is my favorite editor for config files YMMV 5 Miscellaneous Stuff 5 1 Infinfiband Setup and Testing I wanted to use InfiniBand and bought two Mellanox ConnectX 20 GB s QDR NICs The idea was to use one node exclusively for data analysis which needs fast I O i e
26. o not want this a node can also be identified with ipmi identify ipmi chassis i10 h ilinux 10 will turn it on for 10 seconds A blinking LED will identify the node on our nodes the light is blue and thus easy to distinguish of the other blinking LEDs which are green and red this file is read automatically assigns IP and hostname to each node dhcp host 00 23 54 91 86 61 node 03 192 168 0 3 12h dhcp host 00 23 54 91 86 64 n0ode 04 192 168 0 4 12h dhcp host 00 25 90 13 c3 96 n0ode 05 192 168 0 5 12h dhcp host 00 25 90 13 c0 ce node 06 192 168 0 6 12h dhcp host 00 25 90 13 cl ba node 07 192 168 0 7 12h dhcp host 00 25 90 12 84 60 node 08 192 168 0 8 12h dhcp host 00 25 90 57 48 70 node 09 192 168 0 9 12h dhcp host 00 25 90 57 48 b6 node 10 192 168 0 10 12h 12 14 16 18 20 dhcp host 00 25 90 57 48 60 node 11 192 168 0 11 12h dhcp host 00 25 90 57 48 3a node 12 192 168 0 12 12h dhcp host 00 25 90 57 46 ca node 13 192 168 0 13 12h dhcp host 00 25 90 57 48 aa node 14 192 168 0 14 12h dhcp host 00 25 90 57 48 node 15 192 168 0 15 12h dhcp host 00 25 90 57 48 52 node 16 192 168 0 16 12h dhcp host 00 25 90 57 48 da node 17 192 168 0 17 12h dhcp host 00 25 90 57 46 08 node 18 192 168 0 18 12h dhcp host 00 25 90 57 45 8 node 19 192 168 0 19 12h dhcp host 00 25 90 57 48 ce node 20 192 16
27. olDir var lib slurm 11n1 slurmd SlurmUser slurm StateSaveLocation var lib slurm I1 n1 slurmctld SwitchType switch none TaskPlugin task none InactiveLimit 0 KillWait 30 MinJobAge 300 SlurmctldTimeout 120 SlurmdTimeout 300 Waittime 0 FastSchedule 1 SchedulerTimeSlice 60 SchedulerType sched builtin SchedulerPort 7321 Select Type select cons_res Select TypeParameters CR Core Memory PreemptMode GANG AccountingStorageType accounting storage none AccountingStoreJobComment YES ClusterName cluster agvogel JobCompType jobcomp none JobAcctGatherFrequency 30 JobAcctGatherType jobacct gather none SlurmctldDebug 3 SlurmctldLogFile var log slurm I1n1 slurmctld log SlurmdDebug 3 SlurmdLogFile var log slurm 11n1 slurmd log NodeName linux 01 04 RealMemory 8000 Sockets 2 CoresPerSocket 4 ThreadsPerCore 1 State UNKNOWN NodeName linux 05 20 RealMemory 16000 Sockets 2 CoresPerSocket 12 ThreadsPerCore 1 State UNKNOWN PartitionName nodes Nodes linux 01 20 Default YES MaxTime INFINITE State lt UP Shared FORCED 2 rootGQtest master node 2 6 RSYSLOG This will configure rsyslog to send all messages to the head node complete file without comments and empty lines root test master node 3 grep v e 3 srv test bootstrap etc rsyslog conf ModLoad imuxsock provides support for local system logging ModLoad imklog provides
28. pendencies mysql server whereas some are just outdated and do not offer Debian Wheezy packages if they are packaged at all others are end of live and won t be supported anymore Debian live would be another actually rather appealing possibility but for some rea son the boot process never went past the ipconfig part I think now that the reason was the second NIC interfering in the IP discovery process see this bug report which seems to be finally fixed after this setup was made to work Anyway let s start building our diskless cluster 1 2 Basic Master Node Setup 1 2 1 SSH Setup SSH for the cluster nodes less questions less secure This can be removed when the live image is finalized and the host key is not changing as often The better alternative would be to change the image creation script and use a pre existing host key as iit is done now rootQtest master node 4 cat etc ssh ssh config grep v 3 Host 192 168 0 StrictHostKeyChecking no Host linux x StrictHostKeyChecking no 1 2 2 DNSmasq We use dnsmasqto provide host name resolution for the nodes The nice side effect of this is that we also get a tftp server for network boot The biggest advantage though is 11 13 15 10 the very simple configuration compared to isc bind and isc dhcpd server For our two dozen of nodes dnsmasq is very sufficient Make sure that the daemon is enabled Check in etc default dnsmasq ENABLED 1 Our basic
29. re data collectors gnetad and multiple data sources gmond Installinge the Debian package apt get install ganglia monitor gmetad Ganglia daemon gmond Delete the lines with mcast because we do not use need broadcast addresses and replace the line in the send section with a host entry in etc ganglia gmond conf udp send channel host 192 168 0 254 port 8649 udp recv channel 8649 tcp accept channel port 8649 D Configuration of gmetad Change the trusted data source for the gmetad collection daemon in etc ganglia gmetad conf i Allow gmetad to receive data on localhost and our internal IP data source my cluster localhost 192 168 0 254 1 2 6 Ganglia Web The ganglia gmetad daemon is now running and collects data in RRD archives In order to visualize the data one needs the ganglia web frontend Installing it via apt get leads to the installation of apache2 I wanted to use a smaller http server and chose lighttpd for this task To this end one can also install the ganglia website directly from the sources the dependencies need to be fulfilled manually make the ganglia web interface work one needs to install the package php5 cgi Then one has to enable lighttpd to execute php cgi scripts This is accomplished with the following links in the configuration directory etc lighttpd conf enabled root test master node etc lighttpd conf enabled ls lgG tot
30. rver 192 168 0 254 That s it for the NIS client setup If somthing is not working check first if DNS is working then recreate the yp maps on the master node 2 5 SLURM and Munge CRUCIAL The time for all hosts on the cluster has to be correct otherwise munge will not work use ntpdate for the first clock setup then install ntp daemon to keep it synchronized 2 5 1 Munge SLURM will use the munge daemon to issue commands securely on the nodes The installation is very simple aptitude install munge create a new key create munge key The key file etc mung munge key needs to be accesible and identical for every node The permissions have to be set to 0600 otherwise munge won t start Check in var log munge munged log for errors check with pgrep munge if the daemon is running Testing the munge installation is easy once a node is running 14 11 13 15 17 11 root test master node echo test munge MUNGE AwQDAADCETqjEZ3xHGxnSQI aZk16N K35T4 2vf30O3 lt YxHa6z3lCzxZz0OMAYXq9uZV8pBYrSYAVtatYbPtxIrx3Ke6Dgi AIzt127038ABm lt IyTk104bB8I root test master node 3 echo test munge ssh linux 20 unmunge STATUS Success 0 ENCODE HOST testserver cluster agvogel 10 0 0 161 ENCODE TIME 2013 12 09 15 02 41 1386597761 DECODE TIME 2013 12 09 15 02 43 1386597763 TTL 300 CIPHER aes128 4 MAC shal 3 71 0 UID root 0 GID root 0 L
31. ttmpfssize echo Root tmpfs size is ROOTTMPFSSIZE sleep 1 esac done modprobe nfs modprobe af_packet modprobe aufs udevadm trigger wait_for_udev 5 configure_networking test d nfsroot mkdir nfsroot test d ramdisk mkdir ramdisk 12 23 25 27 29 31 33 35 37 39 41 1 test d rootmnt mkdir rootmnt sleep 3 mount t tmpfs o rw size ROOTTIMPFSSIZE tmpfs ramdisk retry_nr 0 max_retry 30 while retry_nr 1t max_retry amp amp e nfsroot init do log begin msg Trying nfs mount nfsmount o nolock ro NFSOPTS NFSROOT nfsroot bin sleep 1 retry_nr retry_nr 1 log end msg done overlay ramdisk rw over nfsroot ro and mount it on mount t aufs o dirs ramdisk rw nfsroot ro none rootmnt echo hostname gt rootmnt etc hostname echo cluster gt rootmnt etc defaultdomain echo live node gt rootmnt etc debian_chroot This script will first load a couple of kernel modules lines 15 17 1 nfs to be able to mount NFS volumes 2 af_packet allows the user to implement protocol modules in user space 3 aufs our overlay file system driver Afterwarsds it will wait for udevto popululate the devices before it creates a tmpfs file system in RAM with 500 MBytes lines 19 26 It then goes on and tries to mount the NFS root directory until it succeeds Upon success the ramdisk and the nfs root
32. tus Cap 66MHz UDF FastB2B ParErr DEVSEL fast gt TAbort lt TAbort lt MAbort gt SERR lt PERR INTx Latency 0 Cache Line Size 64 bytes Interrupt pin routed to IRQ 19 Region 0 Memory at 4400000 64 bit non prefetchable size IM Region 2 Memory at fd000000 64 bit prefetchable size 8M Capabilities 40 Power Management version 3 Flags PMEClk DSI D1 D2 AuxCurrent 0mA PME D0 D1 D2 D3hot D3cold Status DO NoSoftRst PME Enable DSel 0 DScale 0 PME Capabilities 48 Vital Product Data Product Name Eagle DDR Read only fields PN Part number 19 EC Engineering changes Al SN Serial number MT1045X00466 VO Vendor specific PCIe Gen2 8 RV Reserved checksum good 0 byte s reserved Read write fields V1 Vendor specific N A YA Asset tag N A RW Read write area 111 byte s free End Capabilities 9c MSI X Enable Count 256 Masked Vector table BAR 0 offset 0007c000 PBA BAR 0 offset 0007d000 Capabilities 60 Express v2 Endpoint MSI 00 DevCap MaxPayload 256 bytes PhantFunc 0 Latency LOs 64ns L1 unlimited ExtTag AttnBtn AttnInd PwrInd RBE FLReset DevCtl Report errors Correctable Non Fatal Fatal Unsupported RlxdOrd ExtTag PhantFunc AuxPwr NoSnoop MaxPayload 128 bytes MaxReadReq 512 bytes DevSta CorrErr UncorrErr FatalErr UnsuppReq A
33. uxPwr TransPend LnkCap Port 8 Speed 5GT s Width x8 ASPM LOs Latency LO unlimited L1 unlimited ClockPM Surprise LLActRep BwNot LnkCtl ASPM Disabled RCB 64 bytes Disabled Retrain CommClk ExtSynch ClockPM AutWidDis BWInt AutBWlInt LnkSta Speed 2 5GT s Width x8 TrErr Train SlotClk DLActive lt BWMemt ABWMgmt DevCap2 Completion Timeout Range ABCD TimeoutDis DevCtl2 Completion Timeout 50us to 50ms TimeoutDis LnkCtl2 Target Link Speed 5GT s EnterCompliance SpeedDis Selectable De emphasis 6dB Transmit Margin Normal Operating Range EnterModifiedCompliance ComplianceSOS Compliance De emphasis 6dB LnkSta2 Current De emphasis Level 6dB EqualizationComplete lt EqualizationPhasel EqualizationPhase2 EqualizationPhase3 LinkEqualizationRequest Capabilities 100 v1 Alternative Routing ID Interpretation ARI ARICap MFVC ACS Next Function 1 19 51 ARICtl MFVC ACS Function Group 0 Kernel driver in use mlx4 core The important lines are the ones beginning with LnkCap and LnkSta which tell us the cards capability and current link status rootQGtest master node 3 lspci s 04 00 0 vv grep e LnkSta e LnkCap LnkCap Port 8 Speed 5GT s Width x8 ASPM LOs Latency LO unlimited L1 unlimited LnkSta Speed 2 5GT s Width x8 TrErr Train SlotClk DLActive BWMemt ABWMgmt
34. wing setup works and is in actual use 2 3 1 Configure PXE bootparamter PXE boot environment configuration is done with the file srv tftp pxelinux cfg default Here is an example APPEND line 11 1 3 11 13 15 17 19 21 APPEND boot aufs nfsroot 192 168 0 254 srv node image ro initrd initrd img 3 2 0 4 amd64 CRUCIAL Leave out the ip kernel parameter 2 3 2 Initramfs configuration If the DEVICE parameter is left empty the ipconfig command of the kernel will request an IP address from all NICs The second NIC is not connected and thus waits forever for an answer The DEVICE ethO ensures that only ethO will request an IP on that device Paramters to change in srv node image etc initramfs tools initrams conf DEVICE eth0 NFSROOT auto 2 3 3 Overlay File System Add aufs to the srv node image etc initramfs tools modules We will need this module to overlay the read only NFS root directory so that some important files can be written The file will be overlayed over the original file echo aufs gt gt srv node image etc initramfs tools modules 2 3 4 Initrd Script for Overlay File System The kernel parameter boot aufs in the PXE config above starts the script aufs in the folder srv node image etc initramfs tools scripts adapted from hero bin sh mountroot ROOTTMPFSSIZE 500M for x in cat proc cmdline do case x in roottmpfssize x ROOTTMPFSSIZE x roo
35. x web site 2 The Live Image for the Compute Nodes 2 1 Some general remarks Following is an abbreviated description of the script used to build the NFS root for the diskless compute nodes The most important steps are outlined here the actual procedure is written in a script so that no step is forgotten and the result is really reproduceable i e no typos etc Important is that the head node is configured and works appropriately This is espe cially important for NIS NFS DNS stuff which can take a while to debug Furthermore it is nice if one has access to IPMI enabled nodes This makes debugging the start up procedure not really more comfortable but at least somewhat bearable The script to build the live nfs root bootstrap sh is hosted here on BitBucket Usage is very simple cd debian diskless cluster bootstrap sh d nfsroot directory Check out the test sh script if you need to rebuild often 1 It deletes the preset nfsroot 2 debotstrap into it 10 10 12 14 16 18 3 reboots a node via ipmi Another helper script is diskless lib This provides mount chroot image and umount chroot image commands to un mount a chrooot and the proc sys and devpts special directories That script is adapted from kestrel hpc THe bootstrap sh script has been tested with sid 2013 12 10 as the target version and the node boots properly The mismatch of the SLURM versions prevents usage in a mixed system though

Download Pdf Manuals

image

Related Search

Related Contents

FOUR MULTIFONCTIONS 42 LITRES    E86MON(TM) Software User`s Manual  SCC-RTD01 Resistance Temperature Detector    ETHEREAL es una herramienta gráfica utilizada por los  Tripp Lite B119-000-REC User's Manual      C:\Users\Gobez\Desktop\moteur Brisson\moteur001.jpg - club  

Copyright © All rights reserved.
Failed to retrieve file