Home
        Here - Kluster ISC`12
         Contents
1.          CIT    Karlsruhe Institute of Technology       Building and optimizing a small cluster  for the Student Cluster Competition    Thesis of    Jos Ewert    At the Faculty of Mathematics  Institute for Applied and  Numerical Mathematics 4    Reviewer  Prof  Dr  Vincent Heuveline  Second reviewer  Prof  Dr  Wolfgang Karl       KIT     University of the State of Baden Wuerttemberg and National Research Center of the Helmholtz Association www kit edu       Selbst  ndigkeitserkl  rung    Hiermit versichere ich  dass ich die vorliegende Arbeit selbst  ndig angefertigt habe und  nur die angegebenen Hilfsmittel und Quellen verwendet wurden     Karlsruhe  den 15  Oktober 2012    Unterschrift  zu  ee Se    iii    Contents    1 Introduetion    22  2 22 a Dale Rn al 1  2 Rules of the Student Cluster Competition                     1  3 Hardware 2 031   Ton ail  en ar ee 2  3 1 Initial Systems ere PE ae hs HR ee a 2  3 2 The Intel Xeon System    2 2    CE m nn nn 4  4 TOENN Ia EEEE Ae de Mt Ae Me ns Nr oa I nr te the 8  4 1 Operating System   2  22 22 20 nen En 8  4 2 Kites  tico ea ker 4 sick  ee ee ee 8  4 3 GRE OS a evict eae ees te ar rn aed hates 9  4 4 Lidl DER a a Ber ao atte ER N ee ee rise 9  4 5 SLURM  tetera ctf he aad en er des Bh are Me 9  4 6 Environment Modules    aoaaa 10  4 7 Dte MPI   2 034  Me va ne an Dar 10  5 HPCC and EINPACK    vs ss  23 020 84 3 eed de aia EA ae ng 11  6 Open FOAMY mars er ee van arten ae ee aan chs Ose Ser re    11  6 1 OpenFOAM on GPU      
2.      RAM  clock speed  GHz   motherboard   power  W  avg  runtime  s  work  Wh   Kingston  1 8  S2600CP 236 596 39 1  Kingston  2 6  S2600CP 326 592 53 6  Samsung  1 8  S2600CP 221 550 33 7  Samsung  2 6  S2600CP 385 487 52 1  Samsung  1 8  Jefferson Pass 265 528 38 8  Samsung  1 8  X9DRi F 234 513 33 3       Table 1  Work of OpenFOAM on the test systems    DDR3 1600 memory from Samsung  The memory could be used at 1 35 V as opposed to  the regular 1 5 V  which we believed would gain us an increase in power efficiency     We had 2 different models of CPUs at our disposition  which were the Intel Xeon E5 2650L  clocked at 1 8 GHz with a TDP of 70 W and the E5 2670 clocked at 2 6 GHz with a TDP  of 115 W  We also tested two different kinds of RAM  DDR3 1333 from Kingston running  at 1 5 V and the Samsung green memory DDR3 1600 with a lower voltage of 1 35 V   Initially we had a S2600CP dual socket server motherboard from Intel  later we tested  with a Jefferson Pass system having a dual socket S2600JF motherboard  and finally with  a X9DRi F dual socket motherboard from Supermicro     We gathered the power usage data with a power measuring device and only considered the  highest peak power usage  as we wanted to avoid tripping the circuit breaker at 3kW  In  Figure 3 1 you can see the measurements we took on several systems by running Open   FOAM several times in a row     I tried gauge the efficiency of the CPUs with our applications and thus determine which  CPU model would b
3.   Runtime  s         a    of processes on the cluster  b  Process pinning    Figure 6 1  Distribution of processes    node  the runtime for a modified damBreak problem with 64 processes took on average  1635 s  If I used only 4 cores per node  so only every second core  the runtime was 564 s  on average  I nearly had a 3x speedup by doubling the number of nodes used  but only  using half of the cores on each     During the competition  I used fewer processes distributed over the same amount of nodes   As one can see in Figure 6 1a the runtime with the maximum amount of processes is not  the best  I achieved the best runtime for backStep with 52 processes and around 10000  cells per process  with two cores per CPU used  With square the best runtime was achieved  with 104 processes  four cores per CPU  with around 30000 cells per process  The runtime  of backStep dropped from 88 seconds on 208 processes to 34 s with 52 processes  decreasing  the runtime by about 60   On square from 245 s with 208 to 169 s with 104 processes   decreasing runtime by about 30      As we can also see in HPCC   s result for starSTREAM and singleSTREAM  the memory  bandwidth is nearly doubled using only 1 core instead of having all 8 competing  Only in  the latter case  where the amount of processes is changed  the increased amounts of I O  and communications are playing a role  as with the former the amount of processes and  thus communication stayed stable  Intel   s Turbo Boost  that increases the
4.   a sufficiently low error with some compiler options  I thus recommend first testing the  cluster   s hardware to match the applications and the goals     Now that we have a better understanding of the goals  which is to run small jobs as fast  as possible  for a future project like this I would most likely choose CPUs with a higher  clock  maybe fewer nodes  If we exceed the power limit  it is easy to clock CPUs down   this is considering the 10  higher power usage on low clocks compared to actual low power  CPUs  Most likely only the LINPACK score would suffer from the decreased efficiency  an  acceptable trade off to speed up applications     16    Bibliography                 15    16    17     TOP500 Introduction and Objectives  http    www top500 org project   introduction   ISC   12 Student Cluster Competition  http   www hpcadvisorycouncil com     events 2012 ISC12 Student Cluster Competition      Student Cluster Competition Rules  http   www hpcadvisorycouncil com   events 2012 ISC12 Student Cluster Competition rules php     Student Cluster Competition FAQ  http   www  hpcadvisorycouncil com events   2012 1SC12 Student Cluster Competition attendee_reg php     Einar Rustad  NumaConnect  A high level technical overview of the NumaConnect  technology and products  http   www numascale com numa_pdfs numaconnect   white paper pdf     Carlos Sosa and Brant Knudson  IBM System Blue Gene Solution  Blue Gene P  Application Development  IBM  fourth edition  August 2009     Bo
5.   cate which resources they want  For example they can specify how many processes  nodes   how much RAM  or time is needed  SLURM allocates resources and schedules jobs once  the requested resources are available  It also tracks the state of the cluster by monitor   ing how many nodes are allocated  are in a ready state  or are unavailable  In addition   SLURM can also do accounting on how many resources a user needed     There are three ways to use SLURM for a user  They can use the srun command  which  allocates the resources and executes the application with as many processes in parallel as  specified  With sbatch salloc  SLURM only allocates resources and then executes the spec   ified program once  so the software needs to take care of spawning itself on the different    10 Contents       nodes  Srun and salloc block until the resources are freed and then write the output to the  console  while sbatch returns immediately and once it is scheduled writes the application   s  output to files  In a typical cluster usage MPI is responsible for spawning the right pro   cesses on the correct nodes and users only schedule jobs to be executed and do not need  immediate feedback written to their console  so in most cases sbatch is used     SLURM supports scripts that can be run at different points during execution  I configured  it to call echo 3  gt   proc sys vm drop_caches to clear file system caches at the end of each  job  This was because if the filesystem caches grew over a
6.   like done by Michael          Generalized geometric algebraic multi grid     Preconditioned conjugate gradient on GPU   3 Algebraic Multigrid      The first directory in the path indicates the solver used in the tutorial applications    12    6  OpenFOAM 13       Moylesa  Peter Nash and Ivan Girotto  30  or Massimiliano Culpo  14   Their studies  determined that on our size of cluster around 30  35  of the time would be spent on  communications and 25  35  on I O  leaving in the best case only 45  of the time for  computation  which would drastically reduce the percentages gathered while profiling just  one time step     According to the cycle estimation of Valgrind  the cavity problem spent 3 5  running  PBiCG  solve   and 10 4  running PCG  solve    so about 14  of the time were the linear  solvers  DamBreak spent 0 5  running PCG  solve    the PBiCG solver ran too shortly to  be measured  PlateHole with 1000 cells spent 13 2   of its time in the GAMG solver  with  the cell count increased to 100 thousand it was 20  of the time  BackStep used 50  of its  time with GAMG and 1  with PBiCG  square 63  with GAMG  1  with PBiCG     From the data we can see on the plateHole test that the larger the dataset  the more time  the application will spend with solving the linear algebra  It  however  also depends on  what problem is solved  the smaller cavity problem spent a larger percentage of its time  with the linear solvers than damBreak and plateHole in its default 1000 cell set
7.  CPU   s clock in  certain cases  is not the deciding factor as it was disabled on the IC1 cluster  My theory is  that while in the case of changing the amount of processes  the change in communications  does influence runtime  that the main speedup in both cases is due to the higher memory  bandwidth available for the single cores and therefore allows for a more effective processing   Like the work from Michael Moylesa  Peter Nash and Ivan Girotto  30  hints  these tests  show that there is an optimal amount of cells per process  where the communications and  I O overhead  processing speed of the single threads and memory bandwidth are in an  optimal balance  With the limited data I cannot make a prediction about how many cells  lead to the most efficient execution time  It also seems to be dependent on the application  that is run  I would have to do more tests to effectively reduce the search space to find  the optimal distribution  Right now a guess would be for our system  somewhere between  10000 and 30000 cells     I also made tests with process pinning  to prevent processes to be scheduled on different  processor cores during execution  The default of Intel MPI is to enable process pinning   while for OpenMPI is to disable it  The tests in Figure 6 1b were made with Intel MPI   but the results with OpenMPI were quite similar  The    cell    in Intel MPI defines what it  should consider as independent processor cells to bind processes to  unit means any logical  CPU  c
8.  a GPU cluster from NUDT which had 2 6 TFlops   We received the    best LINPACK on x86 cluster    award for this  Here are a few highlights  of the HPCC run at the competition      e 2 27 TFlops with HPL  e 109 GFlops with MPIFFT  the parallel MPI version of FFT  e SingleSTREAM add 7 27 GB s memory bandwidth with 1 process    e StarSTREAM add 3 83 GB s memory bandwidth with 16 independent processes per  node    e 2 6 ys average ping pong latency with InfiniBand  e 5 3 GB s ping pong bandwidth with InfiniBand    6 OpenFOAM    One of the required applications at the Cluster Competition was OpenFOAM  25   Open  Source Field Operation and Manipulation  which I was tasked to optimize  It is an open    11    12 Contents       source set of libraries  distributed and promoted by the OpenFOAM foundation  26   that  can be used to create Computational Fluid Dynamics  CFD  applications and comes with a  wide variety of ready to use solvers and tutorial jobs  Solvers in OpenFOAM are programs  that use the OpenFOAM libraries to calculate a certain kind of problem  like interFOAM  for incompressible fluids  and can use linear solvers like GAMG or PCG to solve linear  problems     In the following subsections  I will first present my research into optimizing OpenFOAM  to use GPUs  then in Subsection 6 2 I give a quick description of what compilers I used   In Subsection 6 3  I describe the effects that careful process distribution and pinning has  on OpenFOAM   s performance  followed  in 
9.  rank inter   leaving  Channel interleaving distributes accesses to subsequent memory pages over the  CPU   s memory channels and rank interleaving distributes them over the DIMMs ranks   which are a set of DRAM  that save the data  attached to their chip select  that determines  where the data is saved  This means that accessing a sequence of memory pages will likely  result in being distributed over all channels and ranks  thus hiding latency as one module  can already serve data while other modules are still processing their request  I started a  test with both set to 1  but canceled the test after 900 s because the baseline with rank  interleaving set to 8 and channel interleaving to 4 was 494 s  it seemed useless to continue    8 Contents       at this point  I did not test how the performance changes if rank interleaving was set to  2  which was the number of ranks on our DIMMs     Our 2 socket system was aNUMA system  15   that means that both CPUs had their own  local memory  and could only access the other CPU   s memory with a performance penalty   Since accessing the local memory is faster  it is important that programs are NUMA aware   which means that they will try to allocate their data to the local node  Usually the Linux  kernel allocates memory for newly created processes on the correct node  however in case  multiple threads are used  the application has to tell the kernel where to allocate memory  by linking to libnuma  Our applications used MPI  which para
10.  their team changed a few times  I joined the team several months before the  competition and the members of the team for the last few months  that eventually went  to the competition on the ISC   12  were Markus Grotz  Pascal Kraft  Robin Lutz  Claudia  Pfreundt  Christian Winkler  and myself Jos Ewert     I was mostly concerned with the cluster   s hardware  installing the operating system  opti   mizing OpenFOAM  which was one of the benchmark applications  and I also configured  the HPCC benchmark     In this paper  I will describe what the team and I tested and planned to have the best  cluster for the competition  Section 2 starts with the rules of the competition and ends  with the conclusions we took from these rules  In Section 3  I present the choices we had  for our hardware and the optimizations I tried on our final system   s hardware  Section  4 covers our software stack to manage and run the cluster  Section 5 contains a short  presentation of our HPC Challenge Benchmark  HPCC  results and finally OpenFOAM  is presented in Section 6     2 Rules of the Student Cluster Competition    For the competition we had to follow certain rules which could be reviewed on the com   petition   s website  3  and were later on clarified but also changed in the FAQ  4      Any team of 6 undergraduate students could compete with a cluster that was limited to  a maximum of 3 kW power usage  but was otherwise not restricted on the software or  hardware  We were free to use experime
11. 000 cells per CPU core  which  according to our consultant  would utilize the  CPU cores efficiently and simulate typical usage  I changed the amount of time simulated  during the job to have a runtime of around ten minutes and repeated each run at least  three times     During the tests on the S2600CP motherboard  I had issues setting the Samsung RAM to  the correct frequency and voltage and noticed strange variations in the test results after a  reboot  For example in Table 1 it is recorded that with Kingston RAM and the 2 6 GHz    6 Contents       CPU the power usage was 326 W  but with the same setup after a reboot the test needed  375 W and only took 546 seconds leading to a work of 56 9 Wh  The power usage increased  by around 50 W and the runtime was decreased by around 50 s just by rebooting  We had  no way of reading the sensor data provided by IPMI because the tools did not support the  model of motherboard yet  so we could never verify what the actual voltage and memory  clock were  There was no official support for DDR3 1600 with a voltage of 1 35 V in this  motherboard  13   so I assume that the motherboard had conflicting data  with the RAM  requesting 1 35 V but the spec requesting 1 5 V and would change the frequency and maybe  voltage of the RAM randomly and thus change the results  As OpenFOAM s performance  is bound to the memory bandwidth  14   these effects were quite significant  This issue  did not appear with the X9DRi F motherboard from Supermicro  wh
12. 2    nenn 12  6 2 Compiler n e T a ae a sen le en 13  6 3 Process Distribution and Pinning        2    2 2 2 n nen 13  6 4 Other Notes  calcita 2 2 E 2 alee nee a el  Gh 15  6 5 Results Compared to our Competitors      2    2222222  15  7 Conicl  si  n vico aia en a ee ie 16  References   2  28 3 eh Bean len  16    1  Introduction 1       1 Introduction    The Student Cluster Competition or Student Cluster Challenge  SCC  is a biannual event  organized by the HPC Advisory Council where international teams of 6 undergraduate  students build and manage their own small cluster  The clusters are presented at the  Supercomputing Conference  SC  in the USA or at the International Super Computing  Conference  ISC  in Germany  During these conferences the clusters are benchmarked  with HPCC  a collection of benchmarks including LINPACK  which is also used to rate  the TOP500 supercomputers  1   and several applications    to demonstrate the incredible  capabilities of state of the art high performance cluster hardware and software     2   The  competition was held for the first time at the ISC in 2012  where the competing teams  were the University of Colorado and Stony Brook University from the USA  Tsinghua  University and NUDT from China  and the Karlsruhe Institute of Technology  KIT  from  Germany     The Engineering Mathematics and Computing Lab  EMCL  at the KIT took part in the  competition for the first time  and during the approximately 9 months of preparation for  the SCC 
13. Mellanox ConnectX 3 VPI FDR InfiniBand card  e 1 Enermax Platimax 500 W power supply    We received the Supermicro boards after the deadline for the submission of the final  architecture  so we based the decision on how many nodes we needed on the tests from the  S2600CP system with the 1 8 GHz CPU  We calculated that we wanted around 5  safety  towards the power limit  we hoped to gain another 5  in power usage by using a better  power supply and another 5  by using the Samsung RAM at 1 35V  In addition we reserved  around 200W for the peripherals  like the switches and the head node  By calculating   3000W     Preripherats   PopenFOAM   1 05   0 95   0 95     3000W     200W   221W x  1 05 x 0 95 x 0 95   we determined that our power limit could support 13 nodes  a Mellanox  SX6025 36 port 56Gb s InfiniBand Switch  an Ethernet switch  and one head node used  to control the cluster  The head node had one Intel Core i7 processor  16 GB of RAM  and a 600 GB SSD  In total we had 208 cores  1 6 TB of RAM  and 3 3 TB Storage for  computing     Our final system had enough buffer in power to accommodate another node  unfortunately  it was too late to have another node and we believed that we could no longer change the  architecture of our cluster  We assume that this was because the final power supplies were  more power efficient than we anticipated  compared to the one we had to use during our  tests     To optimize the hardware I briefly tested settings on channel interleaving and
14. Subsection 6 3  by some small observations  I made while preparing OpenFOAM and conclude this section by comparing our results  with our competitors in Subsection 6 4     6 1 OpenFOAM on GPU    Early plans were to use a hybrid solution  so I investigated several ways to optimize  OpenFOAM on the GPU  I found several implementations  an open source library called  ofgpu  27  and two proprietary solutions  SpeedIT  28  and Culises  29   All of them had  the drawback that they did not support drop in replacements for all the linear solvers   which in our case was problematic  as we did not know which problems would be run at  the competition  It was also unclear if we could use different linear algebra solvers than the  ones defined in the system fvSolution file provided during the competition  In the worst  case we could not modify the problem to suit our needs  for example by replacing GAMG   with PCGGPU  as a linear solver and the AMG  preconditioner  At least SpeedIT needed  to use the binaries provided by the vendor for the solvers  for example we would have to  use their binary version of interFoam to be able to use their library  This could mean that  in some cases we would have to run the problem on CPUs  as we either lacked binaries for  the solver or linear algebra solver  leaving the GPUs idle  Moreover no data on speedup  provided by the GPU solution providers was convincing enough to take such a risk  This  was the main reason we decided against the usage of GPUs fo
15. ame to the conclusion that we would have had an    4 Contents       350    255 M hie   245    220  215 u  210 Fk ya pa on apra N  205        200      ll  00       1 8 Ghz CPU  Samsung Ram  S2600CP         2 6 Ghz CPU  Samsung Ram  S2600CP            1 8 Ghz CPU  Samsung Ram  Jefferson Pass       ap    Power  W   N  u  o          foi pail t       175                                                                            150 i i i i 1  1000 1500 2000 2500 3000 3500    Time  s     0 5    Figure 3 1  The power usage of the S2600CP and Jefferson Pass system    underperforming fabric that used more power than the actual processors  thus nullifying  any kind of efficiency gains we could have had by using these processors     We also considered hybrid approaches combining CPUs and GPUs  and we noticed that  software support for CUDA from Nvidia was more mature and widespread than OpenCL  used by AMD  We could gain Nvidia as a sponsor for several Tesla M2090 cards to test  on our system and applications  As we evaluated the applications  we realized that only  LINPACK would be able to efficiently use GPUs without extensive modification of the  source code  which was beyond our teams capabilities as we lacked manpower and time   While there were several implementations for support of GPUs for our applications  most  of them were in early stages and had rather large drawbacks  For example some only  supported a single GPU  which could be an issue if the data that needed to be proces
16. b Walkup  Blue Gene P System and Optimization Tips  IBM Watson Research  Center     TOP500 List June 2011  http    www top500  org list 2011 06 100     AMD Opteron 6100 models  http    products amd com en us OpteronCPUResult   aspx f1 AMD 0Opteron E2 84 A2 6100 Series Processor     HPC Advisory Council  OpenFOAM Performance Benchmark and Profiling   http    www hpcadvisorycouncil com pdf OpenFOAM_Analysis_and_Profiling_  Intel pdf  October 2010     HPC Advisory Council  Interconnect Analysis  10GigE and InfiniBand in High Per   formance Computing  http    www hpcadvisorycouncil com pdf IB_and_10GigE_  in_HPC  pdf  2009    Mellanox Technologies  216 Port InfiniBand FDR Switch Platform Hardware User  Manual  2012     Intel Server Board S2600CP Family Technical Product Specification     Massimiliano Culpo  Current Bottlenecks in the Scalability of OpenFOAM on Mas   sively Parallel Clusters  Technical report  Partnership for Advanced Computing in  Europe  2011     Supermicro  X9DR3 F  X9DRi F USER   S MANUAL   Operating System share for 06 2011  http   top500 org charts list 37 os     Systemlmager  http   systemimager sourceforge net      17    18    Bibliography       18  19  20  21    22  23  24       25  26  27    28  29  30        31          uo    Parallel BZIP2  http   compression ca pbzip2    FraunhoferFS  http   www fhgfs com cms    Zabbix  http    www zabbix com      SLURM  A Highly Scalable Resource Manager  http   www schedmd com   slurmdocs      Environment Modules Pr
17. bout 90 GB  there was an issue  with the Linux kernel not being completely NUMA aware  The caches were allocated  in a non uniform way over the two NUMA nodes which eventually lead to new memory  allocations  for example by an application  on one node having to be moved to the other   thus creating a NUMA miss  This large number of NUMA misses lead to a significant  drop in performance  which increased OpenFOAM   s runtime from about 500 s to 1200 s   Clearing the caches after each job resolved the issue and should not have impacted the  performance of the subsequent jobs  as they usually had different data  However  this was  only a temporary fix  as a large enough job could fill up the file system caches and again  cause the same problem     4 6 Environment Modules    To manage the different installations of various versions of libraries  MPI  compilers  etc    we used environment modules  an open source project which    provides for the dynamic  modification of a user s environment via modulefiles     22   For example  we installed  various versions of GCC to their own directories and configured modules to change the   PATH environment variable to point to the version of GCC that was selected  It could  also be used with libraries by changing the  LD_LIBRARY_PATH  We also configured Intel  MPI with by setting its configuration environment variables in its modulefile  Thanks to  modules we could easily and quickly switch compilers or MPI implementations and build  the neede
18. cations  The tuning job that I started on OpenFOAM was canceled  after 14 hours with a job to be benchmarked that had a runtime of 10 minutes  I assume  that a good test for the autotuner are 2 timesteps in OpenFOAM to ensure some kind of  communication  maybe include a step to write to disk too  However reducing OpenFOAM   s    10    5  HPCC and LINPACK 11       grid resolution  to shorten time  is detrimental to the results as it does not simulate typical  usage  The default settings created by the autotuner  that were not specific to OpenFOAM   had a negative impact on OpenFOAM  there was a 2 3  increase in runtime     5 HPCC and LINPACK    The HPC Challenge Benchmark  HPCC   23  is a benchmark used on clusters to test  a wide variety of aspects that are important for a cluster   s performance  like the raw  processing power  but also the performance of the network and memory  It is a collection  of benchmarks  to measure processing power it includes HPL  which is the LINPACK  benchmark  DGEMM and FFT  LINPACK measures the floating point rate at solving  a linear system of equations  DGEMM measures the floating point rate of execution of  double precision real matrix matrix multiplications  and FFT measures the floating point  rate of execution with Discrete Fourier Transform  To benchmark the memory  HPCC  includes STREAM to measure sustainable memory bandwidth and RandomAccess for the  rate of integer random updates of memory  Finally it also uses PTRANS to test the tota
19. ch a task  In  addition to the hardware problems  the sheer complexity of this setup and the rules of the  competition  which meant no powering down the system  made such an effort impossible     We evaluated interconnects and found studies from the HPC Advisory Council  10  11   that show that InfiniBand had advantages even at smaller scale clusters and allowed for  a faster execution of jobs  We realized quickly that a specialized low latency solution like  InfiniBand was preferable over Ethernet     While looking at low power CPUs  like the ARM and Atom processors  we realized that  they required a large number of nodes to reach the 3 kW limit  which needed a large   InfiniBand  fabric to communicate that might even use more power than the processing  cores  For example if we took an Intel Atom N2600 processor  we would likely have had  less than 15 W of power usage per node  3 5 W for the processor  several watt for the  InfiniBand card  and a few watt for the rest of the electronics  Such a low power usage  would mean around 200 nodes could be supported  however a switching system that size  needed around 2300 W  12   In addition to the very high percentage of power that the  interconnect would need  low power CPU   s small memory bandwidth  compared to server  CPUs  would slow down the interconnect and processing in general  Moreover it seemed  that there were no  or only few  ARM processors that supported the needed PCI Express  bandwidth for the InfiniBand cards  We c
20. d environment in scripts or our shell     4 7 Intel MPI    For MPI we mainly used Intel MPI  as we expected a gain in performance from using  its automatic tuning utility  Even though it is likely that most MPI implementations   if tuned correctly  have similar performance  we did not have the experience and time  to do that properly  We chose to configure Intel MPI to have no fall back option if the  InfiniBand fabric failed  to avoid running into unexpected performance issues if InfiniBand  was not available  As with our file system setup  we preferred failure of our cluster over a  complicated setup with unexpected problems  We used the SHM DAPL fabric to use shared  memory for communication on the local node and InfiniBand RDMA for communication  with other nodes  For MPI to spawn itself on other nodes we configured it to use SLURM  to avoid having to distribute public keys on each node  As illustrated in Subsection 6 3   process pinning can lead to a significant performance increase so we enabled it explicitly     The autotuner   s default configuration runs over a multitude of benchmarks and tries every  fabric  this process can potentially take days  We limited it to SHM DAPL and it took  around 1 hour to finish on our entire cluster  The autotuner can also be used to tune  individual applications  however they should be configured to have a very short runtime  as it takes too long otherwise  Unfortunately  time was too short to use the MPI tuning  utility on our appli
21. d interconnects like 10 gigabit Ether    net and InfiniBand  Finally we considered ways to power down components depending on   the currently needed resources     While we evaluated the IBM PowerPC solutions  a Blue Gene P  we quickly realized that  their software stack  with a custom proprietary kernel  6  that only supported a subset of  POSIX  for example no fork    7   and their XL compiler  differed quite substantially from  x86 clusters running a Linux operating system  OS  with the Intel or GNU compilers     3  Hardware 3       Neither our team nor our valuable assistance from the Steinbuch Centre for Computing   who host x86 based clusters powered by a Linux OS  had any experience with this toolchain  yet  Additionally  most of the open source software we needed to run for the competition  was only tested with GNU or Intel compilers on Linux  We realized that we would have  to learn the particularities of the system and quite possibly spend valuable time  that we  could otherwise use to optimize the system  debugging and modifying the source code to  adapt the software to the new toolchain     Researching the hardware side  we realized that the IBM systems were quite capable  with  the Roadrunner system still in 10th place of the TOP500  8   but their power efficiency  of around 0 5 GFlops W was not convincing compared to other systems that could reach  over 1 GFlops W     We also looked into a Numascale system  which is a cluster connected with a 3d or 2d  torus netw
22. e the most beneficial  As shown in Figure 3 1   using the same  motherboard  the 2 6 GHz model peaked at about 325 W and the 1 8 GHz model at about  220 W  As power usage alone does not indicate the systems efficiency  I decided to use  GFlops W for LINPACK and Wh for the other software packages to gauge their efficiency     In early tests we noticed that OpenFOAM and LINPACK were the most power consuming  applications  Initially LINPACK needed around 10  more power than OpenFOAM  which  translated into 300 W on our entire cluster  while the difference between OpenFOAM and  the other applications was around 5   around 150 W on the entire system  Early tests  also showed that our final nodes would each need around 230   380 W and we were unsure  how well we could tune the power usage by clocking down the CPUs or how that would  affect the applications    performance  We decided to use OpenFOAM instead of LINPACK  as the benchmark for power usage because we could potentially gain an additional node  that ran at the maximum clock  whereas using another benchmark instead of OpenFOAM  would likely not  We decided to clock down the CPUs during a LINPACK run  so we could  have the maximum clock with OpenFOAM and the other applications  This arrangement  would give us the necessary safety towards the power limit  without sacrificing too much  performance with the other applications     I used the damBreak fine tutorial job of OpenFOAM  and increased its resolution to have  around 100
23. howed no change in power usage or performance     After having issues with the S2600CP motherboard  we heard that there might be a  developer motherboard  that would allow us to force certain voltages and clocks  build  into a preview Jefferson Pass system  It is an integrated 2U four node server system with  a redundant 1200 W power supply  but we could only use one of the four nodes as we lacked  parts  Unfortunately the BIOS did not allow us to set the voltages and we experiences the  same unpredictable behavior of the memory clock  again I compared the numbers to a run  not recorded in Table 1  but that had a similar runtime using the same RAM and CPU on  the S2600CP board with 524 s at 236 W  34 5 Wh  The performance was about the same   as expected due to the very similar motherboard and other hardware  however the power  usage rose by about 10   which is likely due to the system not being used efficiently with  only one node     With this Jefferson Pass system not solving our problems  I ran the remaining tests on  the S2600CP motherboard again  With 1 Samsung DIMM on each CPU  the runtime of  the OpenFOAM test case was around 1500 s  at 161 W  67 1 Wh  Using 2 DIMMs per  CPU  the runtime dropped to 840 s  175 W and needed 40 8 Wh  Finally  with 4 DIMMs  per CPU  the job was done in about 540 s peaked at 210 W and thus used 31 5 Wh   Later  after the system had been build  another test was done with 8 DIMMs per CPU   on that system the baseline with 4 DIMMs was a runti
24. ich could be forced  to use the values for clock and voltage requested by the DIMM and we could verify the  RAM voltage with IPMI and the clock using the BIOS  In Table 1 I only recorded the  most efficient tests     We can see in Table 1 using the 1 8 GHz CPU instead of the 2 6 GHz CPU in conjunction  with the Kingston RAM results in a 25  increase in efficiency and about 40  with the  Samsung RAM  When clocking down the 2 6 GHz CPUs to 1 8 GHz  the runtime of the  application was the same as on the 1 8 GHz CPU  but there was an increase of about 10   in power usage compared to the 1 8 GHz CPU     When comparing the two tests with the Kingston RAM  we can see that there is barely a  difference between the runtimes  however the same tests with the Samsung RAM show a  significant change in runtime  This is likely due to the random change in clock  as I also  had a test with the Kingston RAM that needed 546 s to finish  and thus showing a similar  behavior to the tests with he Samsung RAM  This made it hard to compare results and I  decided to use the time that mirrors the behavior of in the system with the Samsung RAM   that is 546 s  375 W leading to a work of 56 9 Wh for Kingston RAM with the 2 6 GHz  CPU  With that result instead of the one in the table  the runtime is decreased roughly by  10  when the Samsung RAM instead of the Kingston RAM is used  increasing efficiency  by 14  on the 1 8 GHz CPU and 9  with the 2 6 GHz CPU  Comparing Samsung 8GB  DIMMs and 16 GB DIMMs s
25. ion I made during  tests and was not explored much further     6 5 Results Compared to our Competitors     A while after the competition we received the anonymized results  For OpenFOAM   s  square job we took 2nd place with our 33 s  and for backStep we took 3rd  Comparing  our results to the team that was closest to us  I saw that we beat them with 33 seconds  against their 51 s in square  but they beat our 169 s on backStep with 114 s  So we beat  each other roughly by 35 40  runtime improvements     I can only speculate on why that is  During the execution of our runs  our cluster was  nowhere near the 3 kW we could use  the power usage was somewhere around 2 2 kW  or less  This of course meant that we could have used the remaining 800 W with higher  clocks  maybe more nodes too  I believe in this case however a higher CPU clock would  have been more beneficial  Since only every second or even fourth core was used  each core  had a much higher memory bandwidth  As I noticed  OpenFOAM is quite memory bound   which explains the only small runtime improvement in Table 1 with a higher CPU clock   a 45  increase in CPU clock lead to around 12  speedup  when all 8 cores are used     15    16 Contents       However having twice the memory bandwidth would most likely have helped the cores  perform closer to their maximum capacity  likely leading to a more significant speedup  I  believe now that using CPUs with around 3 GHz would have benefited us here  The low  usage of the CPU 
26. ith the official OpenFabrics OFED stack included in SLES  I noticed its lack  of support for FDR InfiniBand  so I switched to the Mellanox OFED stack  which is a  modified version of the OpenFabrics OFED  to have FDR support     4 2 Kite    Once the Operating system with the OFED stack was installed on our head node  we  configured Kite to be able to manage our cluster and reimage the nodes if needed  We  configured our head node as DHCP server and PXE boot server as a prerequisite for Kite  and it also acted as a router for the compute nodes    internal Ethernet network and as  subnet manager for the InfiniBand fabric     All compute nodes were configured to boot into PXE  which they either would leave imme   diately to boot from the local hard disk  or  in case a new OS image was available  would  run Kite  Kite uses SystemImager  17  to create and install a new image  SystemImager  creates a copy of the entire system it is installed on and creates a compressed tarball with  Parallel BZIP2  PBZIP2   18   On bootup  the PXE environment first copies and executes  a small image that includes a Bittorrent client  aria2c   to download a small system that  will execute all the operations needed to install the final system  Bittorrent is used to  reduce the load on the head node and avoid a bottleneck on its Ethernet connections by  having each compute node distribute the data it already received to other nodes  Once the    4  Software 2       small system is downloaded  it will f
27. l  communications capacity of the network     I configured HPCC according to an article published by Intel to use Intel MKL   s BLAS  and FFTW implementations  24   HPCC can be configured to use larger problem sizes   by changing the    N    value of the configuration  to adapt it to the cluster   s size  For  most tests the only configuration is done via the    N    parameter  and most of HPCC   s  configuration file is for tuning HPL  Unlike most other benchmarks included in HPCC   HPL   s performance is heavily influenced by the matrix size N  With N   1000  we could  only achieve 20 GFlops on one node  however  if it was increased to N   100000  we could  achieve over 200 GFlops  To configure HPL  we used a process distribution of P   13 by  Q   16 due to our cluster   s layout of 13 nodes with 16 cores each  At the competition there  were issues with the power measurements that did not measure the current voltage  so we  could not run the cluster at full capacity and we reached 2 27 TFlops with N   100000  in HPCC and in a dedicated HPL run 2 37 TFlops with N   250000  Once the size of N  was reasonably large  tuning with N  the matrix size  and nb  the block size used in the  algorithm  could gain a few GFlops  but the improvements were a few percent at best     We had the following results with our cluster at the competition  We only had about 2 7  kW at our disposal because of the power measurement  We took 2nd place in LINPACK  with 2 4 TFlops and were only beaten by
28. llelizes applications by  creating new processes  and libnuma  so we could safely assume that their memory was  allocated to the correct node  This was confirmed by the low number of NUMA misses I  observed  around 20 a day   Thus I chose to set the motherboard to be NUMA as opposed  to interleaved memory access or Sufficiently uniform memory access  SUMA   However  I  did not test if there was a difference in performance     4 Software    While we were still deliberating our hardware setup we had to choose what software stack  would power our cluster  At the bare minimum we needed an operating system that  supported our InfiniBand network  a job scheduler and an MPI implementation  We also  chose a cluster management system to be able to change our cluster   s setup quickly and to  use a monitoring solution to be able to see issues and record performance  I present some  of those software packages in the following subsections     4 1 Operating System    Like most applications and drivers used in high performance computing  the required  applications for the SCC were all thoroughly tested with the GNU Compiler Collection   GCC  and  due to Linux    ubiquity on clusters  16   had excellent Linux support  Linux  systems are also heavily used at the the Steinbuch Centre for Computing  and we decided  to use the same system used by the KIT  It was based on Suse Enterprise Linux Server  11 SP2  SLES  and also includes their own cluster management called Kite  After some  testing w
29. me 511 s and increased to 525 s  with 8 DIMMs  We more than doubled the efficiency by using 4 DIMMs  However a slight  increase in runtime is seen when more memory modules are used than channels  The CPU    3  Hardware 7       supports four memory channels to access up to four memory modules at the same time   15   Because each DIMM  if plugged into the correct socket  takes one channel  it follows  that four DIMMs use all four channels and lead to the best performance  The decrease  in performance with more DIMMs than channels is due to a higher congestion on each  channel  which can lead to more memory accesses blocking each other  thus decreasing the  memory   s performance     Thanks to the greater performance and energy efficiency of the Samsung RAM we decided  to use four 16 GB Samsung DDR 1600 DIMMs for each CPU     We had problems with the unpredictable behavior of the memory clock on both Intel  systems  and we were not able to test the the Jefferson Pass system under load  so we  chose to try the X9DRi F motherboard from Supermicro  We managed to have the system  running reliably with the Samsung RAM running as DDR3 1600 at 1 35 V and we had the  most efficient test result with this setup  Our final setup was as follows  with each node  having     e a Supermicro X9DRi F motherboard  e 2 Intel Xeon E5 2650L CPUs at 1 8 GHz    e 8 Samsung Green Memory DDR3 1600 16 GB Registered DIMMs  forming a total  of 128 GB    e 1 Samsung Green Memory SSD with 256 GB storage  e 1 
30. ntal or unreleased software and hardware  During  the 3 days at the ISC   12 no one was allowed to touch or power down the systems  Several  benchmarks were executed to determine the winners of the    Overall Winner    and the     Highest LINPACK    trophy  in addition a    Fan Favorite    trophy was awarded to the team  with the highest number of votes from the ISC   s audience  The benchmarks were the HPC  Challenge benchmark  HPCC   which includes LINPACK  HPL   four known applications  that were OpenFOAM  CPMD  CP2K  and Nemo  and also two unknown applications  that were revealed during the event at the ISC  HPCC defined the hardware baseline for  the rest of the benchmarks  making only benchmarks valid that were executed on the  exact same hardware  HPCC also produced the LINPACK score  which is the number of  floating point operations per second the system can perform  and counted for 10  of the  overall score  the performance of the six applications added 80  to the score  and the last  10  were interviews taken during the competition  The applications    performance was  at first to be gauged by counting the amount of jobs a cluster could complete over the    2 Contents       course of the day  However  this was changed only a few weeks before the competition to  measure the runtime of a few selected jobs that could  if necessary  be executed several  times with different configurations to achieve the best time possible  In addition HPL  could be executed on its own t
31. o achieve a higher LINPACK score  that was  however  only  used for the    Highest LINPACK    trophy     The rules stated that the power usage was measured with an APC AP7821 PDU  power  distribution unit  and also enforced with a circuit breaker limit of 3 kW  no power source  other than the provided circuit was allowed for the cluster  Each team additionally had a  separate circuit for systems and equipment that were not actively involved in the execution  of benchmarks  like screens or monitoring systems  Different to the rules  during the com   petition at the ISC we were informed that we could have power usage peaks between 3  kW and 3 1 kW that could last for up to one minute and that the circuit breaker would  limit at 3 5 kW     Teams could apply for the competition until Friday  October 21  2011  and if they were  accepted into the competition  had to submit the final architecture of the cluster before  Monday  April 30  2012  The competition itself was from Monday  June 18  2012 to  Wednesday  June 20  2012     After evaluating the rules  we tried to find a design that was capable of executing the  highest number of jobs and  if necessary  execute those jobs in parallel that had datasets  that were too small to utilize the cluster efficiently  This meant that we did not try to  achieve the best possible run time of a job  but the highest possible efficiency in order to  have a greater number of nodes in our cluster  As we were not allowed to touch or shut down  part
32. oject  http    modules sourceforge net    HPC Challenge Benchmark  http   icl cs utk edu hpcc      Vipin Kumar  Use of Intel MKL in HPCC benchmark  http   software intel   com en us articles performance tools for software developers use of   intel mkl in hpcc benchmark   May 2012     OpenFOAM  http   www openfoam com    The OpenFOAM Foundation  http    www openfoam org      GPU v0 2 Linear Solver Library for OpenFOAM  http   www symscape com gpu   0 2 openfoam     SpeedIT  http   speedit vratis com    Culises  http    www fluidyna com content culises     Michael Moylesa  Peter Nash  and Ivan Girotto  Performance Analysis of Fluid   Structure Interactions using OpenFOAM  Technical report  Partnership for Advanced  Computing in Europe  2012     FluiDyna  Revolutionising high performance computing  Employing Fluidyna   s hard   and software solutions based on Nvidia Tesla  2012     18    
33. ompiler has completely  implemented the standard yet  GCC is quite popular  so it is likely that OpenFOAM  uses GCC   s incomplete implementation of C  11 for testing  Recent additions to Intel   s  compiler lead to a more complete implementation of C  11  and broke the build for  OpenFOAM  It is often advisable to start out testing with GCC and then when you have  more experience with the application test with other compilers     Tests with both icpc and g   showed no difference in execution speed  however as we had  more information about tuning with icpc  I decided to use the Intel suite  It seems that  the defaults of OpenFOAM were quite well tuned in the    Opt    build as any change on  the compiler flags resulted in a drop of performance  For example using  xHOST instead  of the default  xSSE3 to tune the binary to the machine resulted in an increase of 2  in  runtime     6 3 Process Distribution and Pinning    I did several tests on the KIT   s IC1 cluster and on our cluster changing the distribution  of processes  During the initial tests on the University   s cluster  I kept the same amount  of processes but used a different distribution over the cluster  With all 8 cores used per    13    14 Contents          square mumm process pinning with 8 processes process pinning with 16 processes  backstep    660  640  620  600  580  560  540  520  500  208 104 52 26 480     Processes pin off Cell unit Cell none pin off Cell core Cell unit          Runtime  s     Runtime  s 
34. only a  small performance penalty  however for each migration the scheduler has to interfere and a  greater share of time is not used to process data  Moreover  an uneven spread of processes  will likely impact the performance greatly as they have to be synchronized and wait for  the slowest process to finish  and thus the processes on the congested CPU would slow  everything down     6 4 Other Notes    I gained a 2 3  speedup using FhGFS instead of NFS and by linking to a library provided  by the FhGFS developers to buffer calls to stat    I could gain a few seconds in runtime   These effects would probably be more noticeable on a larger system with more I O     I had a problem with the ptscotch decomposition method in the decomposeParDict  which  is used by snappyhexmesh to map a 3d model to the blockmesh of a single process  If it is  used with more than 128 processes  snappyhexmesh hung silently or  if I was lucky  crashed   So far I have found no solution  Neither changing the resolution nor using a multiLevel  distribution  where one level for example has 4 domains that get split each into 64  solved  the problem  However  at least now there is an error message and the jobs reliably crashes  with a stacktrace  which should ease debugging     I also want to note that the better our RAM  the closer the power usage of OpenFOAM  approached that of HPL  so we could possibly gauge the efficiency of OpenFOAM by  measuring the power usage  This however was just a small observat
35. ore means any physical core  and none means that we did not use that parameter  specifically  The difference in runtime is quite noticeable with an improvement of about  25  for both 8 processes and 16 processes when process pinning was enabled  Also the  standard deviation dropped between the runtimes from around 20 s with no pinning to    14    6  OpenFOAM 15       1 s with pinning  The cell parameter did not influence the runtime and if I looked at  the distribution using htop  it was clear that in both cases all processes were mapped to  physical cores and distributed evenly over the processors     With the default Intel MPI setup  processes were wandering between the two logical pro   cessors on the physical core  The standard deviation between runtimes was higher and  so was the average runtime  Defining anything as a cell mitigated that migration and  stabilized the runtimes  Also if I used the defaults with anything more than 8 processes  and less than 16  the processes were scheduled in a way to fill up one processor  in our  case 8 processes for 8 cores  with the remaining processes on the other  I avoided this by  using the map scatter parameter to spread the processes out  Unfortunately  there was  not enough time to thoroughly test the impact on the runtime these two modifications had   I theorize however that it was beneficial to avoid both scenarios  In case of just switching  logical processor on the same core  caches are still valid  so there should be no or 
36. ork that is completely transparent to the operating system  and thus behaves  like a large non uniform memory access  NUMA  system  This makes it possible that  software does not need MPI or other systems to communicate between the nodes and  applications scale up by just creating more threads like they would on a single system  In  our case however  that was no benefit  as all the required software already used MPI to  communicate between nodes     Researching the hardware  we found out that the network adapters were connected to the  AMD HyperTransport  This limited that technology to AMD processors which  as our  research showed  often had a higher thermal design power  TDP   9  than Intel processors  and did not support the AVX feature which greatly improves the floating point operations  rate     After having rejected the turnkey solutions rather early on  we looked at more custom  configurations  and in early stages of the design we had several proposals to power down  idle pieces of hardware  We could use controller chips with each CPU that could measure  the power usage of the processor or even power it down  giving us a detailed view of which  parts of the system were active and which parts could be shut down  For example we  could turn off GPUs on applications that could not utilize the GPU  However  we were  uncertain on how the hardware and software would react to components being powered up  and down during runtime  as most of our hardware was not designed for su
37. ormat the drives and run any predefined scripts  for  example setting the hardware clock  before mounting the drives  The actual system   s im   age is also distributed with Bittorrent and is unpacked with PBZIP2 onto the local storage  and scripts are executed to configure the nodes  After the system has been successfully  installed  the node will reboot and start from the local drive     Thanks to this system we could have our final cluster up and running in less then 20  minutes  as we had prepared our head node a few weeks prior to getting the system  All  it took was to plug in the Ethernet cables  add the compute nodes    MAC addresses to the  configuration  and start the imaging process     4 3 FhGFS    For our distributed file system we used FhGFS or FraunhoferFS developed by the Fraun   hofer Competence Center for High Performance Computing  It is a closed source but free  to use file system  One of the advantages is    its distributed metadata architecture  that   has been designed to provide       scalability and flexibility     19   It can use InfiniBand  to communicate or failover to Ethernet  We configured it to have data and metadata  striped over all nodes excluding the head node  that only acted as a client as its lower  performance in InfiniBand throughput and memory bandwidth could have impacted the  compute nodes  We did not include any kind of redundancies  as we preferred simplicity  over safety for our small system  As per recommendation of the FhGFS de
38. r OpenFOAM     In retrospect I created a few profiles using Valgrind   s Callgrind utility on some of the pro   vided example applications  which counts the amounts of calls to a function and estimates  the cycles needed to execute it  OpenFOAM uses a blockmesh to map the problem to cells  that change their properties  this could be pressure  heat  velocity or anything the user  wishes to simulate  It also divides the problem into discrete time steps  for example you  could have it calculate each millisecond and write the results to disk in steps of 10 ms   After each simulated time step  the data is updated and then the processing of the next  time step begins  resulting in many iterations of the algorithm     I profiled several tutorial jobs  icoFoam cavity  with 400 cells  interFoam laminar damBreak  with around 2300 cells  and solidDisplacementFoam plateHole with 1000 or 100000 cells  I  also profiled the larger applications of the competition  these were backStep with around  500000 cells  and square with 3 million cells  both using the icoFoam solver  Each was  executed with just one process and on the latter two I only executed 1 time step  This  means that no communication step happened  which was still sufficient to determine if a  GPU would be of benefit  I wanted to determine what kind of computations are performed  and if according to those a GPU would be a viable choice  And in case that was true  I  would have to profile the applications with more communications
39. s of the cluster  we could not build specialized nodes for each application  thus we had  to prioritize which application we would optimize the power usage to  After some testing we  decided that OpenFOAM would be our target application as it had significantly less power  usage than LINPACK  but only slightly more power usage than the other applications   This is explained more in depth in Subsection 3 2  We assumed that we could not exceed  the 3 kW limit without risking the circuit breakers shutting down the system  so we always  considered the highest peak power usage while evaluating the applications and the system     3 Hardware    We had nearly free reign over what kind of hardware we would use  so we spent much time  trying to find the best hardware setup to suit the competition   s rules and applications   The following presents the first systems we considered in Subsection 3 1 and the various  setups for the final system I tested in Subsection 3 2     3 1 Initial Systems    During the preparation for the competition we considered several systems  including turnkey  solutions like a large shared memory cluster from Numascale  5  or a PowerPC  PPC  clus    ter from IBM  We also considered more custom setups with low power CPUs like an ARM   based cluster or an Intel Atom based cluster  as opposed to regular server CPUs like the In    tel Xeon  We looked at hybrid approaches for our custom setups combining x86 processors   and GPUs from either Nvidia or AMD  We evaluate
40. sed  did not fit into the memory of one card  We also saw an issue that we could likely not use  our GPUs on the unknown applications unless they had the support already build in  we  did not believe that we could modify these applications fast enough to produce benchmark  results  You can find a more thorough analysis of OpenFOAM on the GPU in Subsection  6 1     Apart from the software issues  we had problems getting our systems into a stable state  when we used the GPUs  For example the BIOS crashed  the InfiniBand network randomly  failed or the operating system had issues with a GPU installed in the system  The issues  would only cease when we removed the card  The support staff at Intel speculated that the  firmware of the motherboards and drivers for their new PCI Express controller integrated in  the CPU were not completely tested yet and that we triggered a bug  Lastly we measured  that the power draw of the GPU solutions was quite high  we measured somewhere around  250 W per card  which was more than an entire node with two CPUs  This meant that  the GPUs needed to provide a rather significant speedup to justify their power usage     With all these issues and unconvincing performance predictions from vendors  we decided  against the use of GPUs     3 2 The Intel Xeon System    After having evaluated various systems  we decided to use a setup with 2 Sandy Bridge  Intel Xeon CPUs on a regular server motherboard  FDR InfiniBand from Mellanox  and    3  Hardware 5     
41. up  In  the case of backStep the maximum theoretical speedup  if we could accelerate the Linear  algebra solver to be infinitely fast  would be two  However our tests showed that the  power usage of one GPU was about the same as an entire node with only CPUs  so in this  case there would be no gain in efficiency if we were running the linear algebra solver on  a GPU  Tests done by FluiDyna  31  showed with a grid of 3M cells on one CPU  about  55  of the time was spent in the GAMG solver  Other linear solvers on the same problem  made up a larger percentage of the execution time  however it is unclear on how those  affected the overall runtime  Also the problems that are actually calculated by one core  are much smaller than the total amount of cells  as the grid gets split up typically to about  30 50 thousand cells per core  which would likely reduce the share of time spent in the  linear solver  Even ignoring communication and I O  using the problems calculated at the  competition  it is unlikely that we would have had a gain in efficiency with GPUs     6 2 Compilers    Like many open source projects  OpenFOAM can easily be compiled with GCC  g   for  C     as most testing is done with the free GNU compilers  I ran into a few issues with  the Intel composer  icpc for C      where versions after 12 1 0 would break the build   This is due to OpenFOAM   s extensive use of the only recently released C  11 standard  and most likely is actually based on a draft of C  0x  as no c
42. velopers we  formated the file system s data storage with XFS and the metadata storage with Ext4   We did not expect to gain much in performance due to the small size of our system  but  nonetheless  we saw a 2 3  decrease in runtime for OpenFOAM compared to NFS     4 4 Zabbix    To monitor the cluster  we used Zabbix  20   an open source monitoring solution  We  monitored our Ethernet switch to see if there were any bottlenecks or other performance  issues  Unfortunately our InfiniBand switch was unmanaged and we did not find another  reliable way to check the performance and utilization of the InfiniBand network  With  Zabbix we could provide live images of the current voltage and used power by polling our  power measuring device and the competition   s official power measuring PSU  We could also  monitor temperature sensors  CPU load  RAM usage  file system usage and much more   We configured Zabbix to have an agent running on each node that reported to a Zabbix  proxy that was installed on our head node  This proxy sent the information to the main  Zabbix server running on one of our computers to plot graphs  make alerts  and provide  other information like the layout of our network  Tests on one node running LINPACK  with and without the activated agent showed no difference in performance     4 5 SLURM    Once the cluster was running  we used SLURM  21  as resource manager  It is an open  source project from LLNL  SchedMD  HP  and many others  Users submit jobs and indi 
43. would have triggered its Turbo Boost  increasing the clock  and thus  could use the memory increased bandwidth more effectively  I however did not test how  the 2 6 GHz CPUs would behave compared the 1 8 GHz ones when they are only partially  used  as we never intended to use only parts of the cluster   s capacities     7 Conclusion    Finally  I can say that my goal of presenting a working cluster with all of the known  applications working was fulfilled  None of us had any prior experience with clusters  so I  am even more astonished that we managed to do this in such a short time  Our final cluster  was only ready 2 days before the competition due to bottlenecks in getting the parts  And  despite all these issues  we won the    Best LINPACK on x86    and    Fan Favorite    award     During the many tests  several effects manifested themselves  It seems apparent that  high CPU clocks do not mean best efficiency  as shown by OpenFOAM running on the  cluster and using every core  But if the workload is small enough that you can reduce  the cores needed  higher clocks might be beneficial  The greatest gains in performance  can be achieved by carefully choosing the best suited hardware for the application and  effectively distributing the applications    processes over the system  Compiler optimizations  might help if the application comes with a bad default  but often compiler optimizations  risk falsifying the application   s results  LINPACK for example does not compute with
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
TY-259A  FE-170/X-760 - Olympus America  Philips HR2097  atenuación natural y remediación inducida en suelos contaminados  Samsung E1270 1.77" 82.9g Black  7. Reference Information  Hampton Bay CBW1613L-3BRZ Installation Guide  Service manual - Radio Service Inc  Des aides pour avoir le haut débit - L`Orne Magazine  Sony XA-V65    Copyright © All rights reserved. 
   Failed to retrieve file