Home

Dell PowerEdge R420 White Paper

image

Contents

1. S v a pe z o o c E pu o LE o o a PowerEdge M620 Higher is Better M620 2 7GHz 256c R820 2 7GHz 128c M420 2 3GHz 512c Figure 13 LU performance Watt 1 00 0 40 0 20 Poerformance Watt Relative to PowerEdge M620 Higher is Better 0 00 M620 2 7GHz 256c R820 2 7GHz 128c M420 2 3GHz 512c Figure 13 plots the energy efficiency of the three clusters when executing the LU benchmark The observation is that the PowerEdge M620 cluster and the PowerEdge M420 cluster have similar energy efficiencies The differences in the values are within the statistical variation The PowerEdge R820 20 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers cluster shows slightly better energy efficiency than the other two clusters because of the lower core count The metric used here is rating Watt which translates to number of jobs of LU which can be run during a period of one day power consumed in Watts Figure 14 shows the energy efficiency of WRF The PowerEdge M420 cluster has 13 percent better energy efficiency when compared to the PowerEdge M620 cluster The performance metric used here is average time step Watt The lower wattage processors and the lower memory configuration attribute to this improvement in energy efficiency on the PowerEdge M420 cluster The PowerEdge R820 is 12 percent less energy efficient when compared to the PowerEdge M620 The perfor
2. In this technical white paper different server designs are compared using several high performance computing workloads At a cluster level a quantitative study is undertaken analyzing both performance and energy efficiency across the different server models The paper analyses the measured results and concludes with recommended configurations for each application Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers 1 Introduction The Dell PowerEdge 12 generation server line up armed with the latest processors from Intel has been well received by the High Performance Computing HPC community The new servers provide more choice than before however with these choices there is a need for quantitative recommendations and guidelines to match an application s requirements to the ideal cluster configuration The latest Dell servers can support the Intel Xeon processor E5 2400 product family the Intel Xeon processor E5 2600 product family or the Intel Xeon processor E5 4600 product family giving HPC users numerous choices to configure a server for specific CPU memory and I O requirements It is a daunting although necessary task for HPC users to understand the performance characteristics of each of these server models to be able to make well informed decisions regarding which server platform is best for their purposes This white paper analyses the performance and power consumption characteristics of the
3. this extra read was counted by the benchmark the effective memory bandwidth of the PowerEdge R820 would be approximately two times that of the PowerEdge R620 This study uses the actual measured memory bandwidth as reported by Stream An application may have the same behavior and incur the same RFO penalty This measured value provides a baseline for the analysis At 4 8GB s per core the PowerEdge R620 has the highest memory bandwidth per core whereas the memory bandwidth per core on the PowerEdge R820 is measured to be 30 percent lower Because the PowerEdge M420 has three memory channels when compared to PowerEdge R620 or PowerEdge R820 12 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers that have four memory channels the total memory bandwidth is 22 percent lower and the memory bandwidth per core is 25 percent lower than the PowerEdge R620 Figure 4 Memory bandwidth 3 4 GB s core 4 8 GB s core 3 7GB s core I a g B gt me c Q E o o T Oo mR620 mR820 m M420 BIOS options System Profile is set to Max Performance and C states and C1E are disabled 4 2 HPL Moving on to the application level performance High performance Linpack HPL a popular computationally intensive application is analyzed first The problem size used for all the runs is maintained at 90 percent of the entire memory as described in Table 3 Note that this tran
4. 0 0 80 0 60 0 40 Performance Watt Relative to PowerEdge M620 Higher is Better 0 20 0 00 M620 2 7GHz 256c M420 2 5GHz 512c 5 Conclusion This white paper provides a quantitative analysis of the performance and energy efficiency of three different Intel Xeon E5 Family Sandy Bridge based clusters using a sample of HPC workloads Table 4 and Table 5 summarize the performance and energy efficiency characteristics of each HPC application on the three types of clusters They present the results of this study in an easy to read format The Dell PowerEdge M620 cluster is used as the baseline for performance and energy efficiency comparisons Clusters of the same size for example with the same number of cores are used for performance comparisons As described in Power consumption and energy efficiency power and energy efficiency measurements were conducted on clusters of different sizes Table 5 explicitly notes the cluster size used for power measurements for easy reading Details of the test bed were provided in Test bed and applications 23 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Table 4 Results summary performance HPL Similar performance to Similar performance to Performance 17 lower PowerEdge R820 PowerEdge M620 LU Best Performance Performance 8 lower Performance 11 lower WRF Best Performance Performance up to 20 Performance up to 9 lower lowe
5. 1000e blade chassis 32 Servers 512 Cores 5 when 8 cores active 1 50GB SSD PERC H310 1 2 4 1 20 20 Build 24 Mellanox OFED 1 5 3 3 0 0 Mellanox ConnectX 3 FDR Mellanox M4001F FDR IO module for the PowerEdge M1000e blade chassis 16 Servers 256 Cores 4 when 8 cores active 1 146GB 15K SAS 1 1 2 1 06 06 Build 15 RHEL 6 2 2 6 32 220 el6 x86 64 PowerEdge R820 rack server 4 Sandy Bridge EP 4S Quad Intel Xeon E5 4650 2 7GHz 16 8GB 1600 MT s Mellanox ConnectX 3 FDR Mellanox FDR rack switch SX6036 4 Servers 128 Cores 2 when 8 cores active 1 146GB 15K SAS 1 1 5 1 20 20 Build 24 Even though the absolute memory configurations appear differently all servers contain a balanced one DIMM per channel configuration running at a memory speed of 1600 MT s For the PowerEdge R820 and PowerEdge M620 cluster the amount of memory per core is also identical The Intel Xeon processor E5 2680 on the PowerEdge M620 is the highest bin 130W part available in that product family The Intel Xeon processor E5 4650 is the highest bin processor available in the EP 4 socket product family and the Intel Xeon processor E5 2470 is the highest bin processor available in the EN product family which supports 8 GT s QPI speed and 1600 MT s memory 10 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers The PowerEdge M620 and PowerEdge M420 are blade b
6. 2 7GHz m R820 2 7GHz m M420 2 35GHz 17 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers 4 6 MILC Figure 10 illustrates that as the core count increases the Dell PowerEdge M620 outperforms the PowerEdge M420 and the PowerEdge R820 when running the MILC application MILC is sensitive to memory bandwidth while core speed and Turbo Boost do not contribute to the difference in performance in this scenario The memory bandwidth per core is 30 percent lower on the PowerEdge R820 when compared to the PowerEdge M620 and is 25 percent lower on the PowerEdge M420 when compared to the PowerEdge M620 When fully subscribed the PowerEdge M420 performs 21 to 33 percent less than the PowerEdge M620 and the PowerEdge R820 performs 11 to 39 percent less than the PowerEdge M620s However for a fixed cluster size fewer PowerEdge R820 servers will be needed because it is a four socket system Figure 10 MILC performance higher is Better o N oO v o o lu x o E o a o o 2 T E v x o o c E pe o LE oO a Number of Cores m M620 2 7GHz m R820 2 7GHz m M420 2 3GHz 4 7 NAMD Figure 11 plots NAMD performance on the three clusters From the figure it is observed that the Dell PowerEdge M420 performs consistently lower than the PowerEdge M620 by 10 percent NAMD is not memory bandwidth sensitive and the delta in performance can be attributed to the 15 percent drop in co
7. 28 cores and the InfiniBand switch are taken into consideration The power consumed by the Gigabit Ethernet management switch is not taken into account Note that the energy efficiency data presented here is not an apples to apples comparison First each type of cluster has a different total core count for example different number of servers in the cluster Additionally the results compare rack servers to blade servers These results are an attempt to extract trends and report measured results An ideal comparison would compare similar types of servers and identical number of cores in each of the three clusters Figure 12 presents the energy efficiency of HPL The metric used here is GFLOPS Watt The PowerEdge R820 cluster with 128 cores has 4 percent better efficiency than the 256 core PowerEdge M620 cluster This can be attributed to the lower core count of the PowerEdge R820 cluster lower total performance but lower power consumption as well The 512 core PowerEdge M420 cluster is 15 percent more energy efficiency than the 256 core PowerEdge M620 cluster This is due to two factors double the core count on the PowerEdge M420 cluster boosts the performance Additionally the lower 19 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers wattage processors and the lower DIMM configuration on the PowerEdge M420 contribute to the lower power consumption Figure 12 HPL performance Watt o o
8. 7 Table 2 Test bed details eere ree nrr rete ree e rea axe Ex PEERDEE C PEERVERRPEKECRE E PERRA PO ROLE ASA 10 Table 3 Application and benchmark details 2 eoe rere e eR rere Eee ure e Fe Re vR kay secs 11 Table 4 Results summary PErfOrMANCe iv 22 eere osa sera aua eer ana era pre a ae gru a S apre aae Pre RETE 24 Table 5 Results summary energy efficiency EE ccsccesccesccesccesccesceesceesseesseesseessees 24 Figures Figure 1 Platform architecture for Sandy Bridge EP Intel Xeon E5 2600 eese 8 Figure 2 Platform architecture for Sandy Bridge EN Intel Xeon E5 2400 ceeeeeeeese 8 Figure 3 Platform architecture for Sandy Bridge EP 4 Socket Intel Xeon E5 4600 9 Fisure 4 Memory bandwidth sex eek ern crdeccvea ved saws FERE EERR EE ER ERE ERR ones EENEN CERA Ex KE CO Ra cee 13 Figure 5 HPL performance ex eee xe aeree eate exe RAEE AROEN RENEE AROEN ARENE ERARON ARESE 14 Figure 6 LW performance esee case ear En Ra EE EPE INNE SEDARE E case sgseadsnceaeasneseancessees 15 Figure 7 WRF performance soos ise ossa verno abeo eren a PU er E ERR E ele dudasedngeded eae reas Puce E 16 Figure 8 ANSYS Fluent performance truck poly 14m cessessesseeeee mene 17 Figure 9 ANSYS Fluent performance truck 111m eese eene eene hehehe nnn 17 Figure 10 MILC performances sees e rete Rh E EE EAE EEE HEX S OER E ENRUEP REDON
9. Eee ERR ERR REREPKT Ryan 18 iii Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 11 NAMD performance cc cvcscvesevesdessevecdves cues here n d n eR e d n ER S ER ORE adn ERR ER cused Figure 12 HPL performance Watt 5 cedere ee erre een ren oer rede re nee n eR ed n e ned n EE reds Fig re 13 LU performance Watt ss earriero sats sx ea lostoo ee eau bises dete s va e e ean DENN URP CR e Figure 14 WRF performanter Watter a ikoe ee oc xlote E veg os daos er EN EE ur uineis nee Figure 15 ANSYS Fluent performance Watt e ee o er ERRARE ARR ERE CINES XE ENAN AR CAR UTER EAR Figure 16 MIEC performance Watt sens seis sess re rer rebote annee nne ron cleats ents n eR wrens Eaa ena cud ene Figure 17 NAMD performance Watt seus er dre reb sane vend ee e ee vane eel we E ee vane eR we T UE ERE TUS Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Executive summary In the last six months there have been a variety of new servers available in the market These servers have several architectural differences as well as support for different amounts of memory PCI E slots hard disks and so on All these models are good candidates for High Performance Computing clusters but certain questions remain unanswered Which server model is best suited for a specific application What features of the architecture make it the ideal choice
10. MA yis voee ves eR ovens PURI PI RUP rare REEE E RUR I UNE C RUPEE RTNEU URUC RUE EE LE MEC E 5 T dIRtFOGHCCI D eise eerie 3x eoe nae ToO R EA Tuo EPA RUE Rv Cedo Ob aqu de Y Fera Raus ORT Ede op Edo SER dude oo bn 6 2 Dell PowerEdge 12 generation server platforms and Intel processor architecture 6 3 Nest bed and applications 2 2 eru eR e ep err I Sansa D Re dre D aDIN kS 9 4 Res lts and analysis 51st t n 3 I Rua k a Xu X TI XR A a e XR G3 SI Rd Ru e eR Oa e eR anes e eu Ge 12 4 1 Memory DaNdWIdth iwi vies cues sees sees nn Ra sora snes nese oven sees ones even enya sees ones seen e NR Ru ERE FREE 12 LIE c E M 13 LB U 14 44 MRBFEsid aee ote bur n UE REDI ER ERU EUG RUN ERE BU ERE SUI KANN QI EQUERUI FOE RU EUER RUN IER SNR 15 4 5 ANSYS Fl erit essei rin pres rs pri sexa us sie pex Eh OU ES EEEE pr a e ua POPE UrRd 16 LX ME spere TA 18 LAM TP M n r M 18 4 8 Power consumption and energy efficiency cesssessseesseeeseeee eene ene ehe eene 19 E euasbm URERUTTMTTSTNUSNT 23 6 References ess cese esr vp ek a HERREN KUNA ERO SERE E ne BPO EX REEER EER CEPR E EFC TEFAL SEEN E EE EK nies anv ENT C 25 Tables Table 1 lintel architecture comparisOFt i ies excisa etra renine Era e FER UENE EY RERO DERENAN EEEE EE VETERE
11. Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers This Dell technical white paper evaluates and provides recommendations for the performance of several HPC applications across three server architectures the Dell PowerEdge M620 M420 and R820 Nishanth Dandapanthula and Garima Kochhar High Performance Computing Engineering Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers This document is for informational purposes only and may contain typographical errors and technical inaccuracies The content is provided as is without express or implied warranties of any kind 2012 Dell Inc All rights reserved Dell and its affiliates cannot be responsible for errors or omissions in typography or photography Dell the Dell logo and PowerEdge are trademarks of Dell Inc Intel and Xeon are registered trademarks of Intel Corporation in the U S and other countries Microsoft Windows and Windows Server are either trademarks or registered trademarks of Microsoft Corporation in the United States and or other countries Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products Dell disclaims proprietary interest in the marks and names of others September 2012 Rev 1 0 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Contents Executive SUM
12. ased servers The PowerEdge M620 is a half height blade the PowerEdge M1000e chassis can house up to 16 such blades The PowerEdge M420 is a denser quarter height blade and the same PowerEdge M1000e chassis can house up to 32 such blades A full chassis of servers was used in each case to allow meaningful power measurements and to properly amortize the shared infrastructure cost of power supplies fans and so on The other differences in the size of the clusters are due to resource limitations however the results sections compares performance based on the number of cores to eliminate total cluster size as a factor The PowerEdge M420 supports only SSD drives The operating system for the server was installed on this drive None of the applications were configured to write local files on each compute node therefore the choice of SSD versus SAS is not relevant to the results in this study The BIOS on all the servers are set to Dell HPC defaults which include the Performance per Watt Optimized DAPC System Profile Node Interleaving disabled and Logical Processor disabled This System Profile balances power saving and performance options by enabling Turbo Boost C states and C1 E The Power Profile is set to DAPC Dell Advanced Power Controller and the Memory Frequency is set to max performance StacklQ Rocks 6 0 1 Dell edition 5 was used to deploy and manage the cluster Table 3 illustrates the applications that were studied the benchmarks u
13. cture comparison Max Sockets Cores per socket Max Memory speed 1333 MHz 1600 MHz 1600 MHz 1600 MHz Max QPI Speed 6 4 GT s 8 GT s 8 GT s 8 GT s Max DIMMs Per 2 DPC 3 DPC 3 DPC Channel DPC Figure 2 outlines the block diagram of the Sandy Bridge EN platform architecture Compared to the Sandy Bridge EP platform the differences lie in the number of QPI links the number of memory channels and the number of DIMMs per channel The EN based processors operate at a lower maximum wattage compared to the EP based processors InfiniBand FDR is not supported on the EN based processers In its place InfiniBand FDR10 2 is used The EN processor is a balanced configuration in terms of bandwidth The processors can support a theoretical maximum of 32 GB s through the QPI link between the sockets and the theoretical maximum memory bandwidth from a socket to its memory is 38 4 GB s Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 1 Platform architecture for Sandy Bridge EP Intel Xeon E5 2600 1600MHz 64b 4 channels QPI links tlt 8GT s 2 bi directional 16b Processor Processor 2 32GB s 51 2GB s 51 2GB s PCI E Gen3 8GT s 128b 130b Left x8 Storage NDC Right Right Center x16 8GB s x8 x8 x16 x8 x16 QPI 56Gbps 64b 66b DDR3 memory channel 2 bi directional PCI Gen3 lanes 13 6GB s InfiniBand Figure 2 Platform architecture for Sandy B
14. es to this drop in performance At 256 cores the PowerEdge M420s perform 11 percent better than the PowerEdge M620s This data point is repeatable and is not explained by the difference in the InfiniBand network architecture between these two clusters At the time of writing this aspect was still under study 15 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 7 WRF performance Better Oo nd co o eo O eo a ro aigner is EN eo o N Ke o o o Lu x o z o a o P o 2 S oO a o o c m E pu o Y LE o a Number of Cores mM620 2 7GHz mR820 27GHz mM420 2 3GHz 4 5 ANSYS Fluent Two benchmark datasets truck poly 14m and truck 111m are used for performance evaluation with ANSYS Fluent ANSYS Fluent is not sensitive to memory bandwidth but is sensitive to core speed From Figure 8 the Dell PowerEdge M420 performs 9 to 12 percent less than the PowerEdge M620 This can be attributed to the 15 percent drop in CPU frequency when compared to the PowerEdge M620 For the same dataset the PowerEdge R820 performs 7 to 11 percent less than the PowerEdge M620 when the server is fully subscribed The difference in the number of QPI links the number of turbo bins and the turbo headroom available are possible factors that contribute to the drop in performance on the PowerEdge R820 The 30 percent drop in memory bandwidth per core on the PowerEdge R820 when compa
15. fs Optimal BIOS HPC Dell 12G v1 0 pdf Mellanox FDR10 Product line description http www mellanox com pdf products oem RG IBM pdf Intel Xeon Processor E5 Family Servers http ark intel com products family 59138 Intel Xeon Processor E5 Family server Intel Xeon Processor E5 Family Specifications https www ssl intel com content www us en processors xeon xeon e5 family spec update html StacklQ Rocks http www stackiq com Stream benchmark for memory bandwidth https www cs virginia edu stream Measuring Memory Bandwidth on Intel Xeon Processor 7500 Platform RFO penalty https www ssl intel com content www za en benchmarks resources xeon 7500 measuring memory bandwidth paper html Memory Selection Guidelines for High Performance Computing with Dell PowerEdge 11G Servers http i dell com sites content business solutions whitepapers en Documents 119g memory selection guidelines pdf 25
16. l PCI Gen3 lanes 13 6GB s InfiniBand Information regarding any of the Intel Sandy Bridge processors can be obtained from 3 3 Test bed and applications The previous section presented the differences between the three Sandy Bridge based architectures This section details the test bed used in the study explains the choices selected in configuring the test bed and describes the HPC applications used in this study Subsequent sections evaluate the performance of these HPC workloads on the different architectures Three types of HPC clusters were configured for this purpose The details of this test bed are provided in Table 2 A 16 server Dell PowerEdge M620 cluster was deployed to represent Sandy Bridge EP while a 32 server PowerEdge M420 cluster represented the Sandy Bridge EN A four server PowerEdge R820 cluster was used for Sandy Bridge EP 4S A PowerEdge R620 rack server was used as the master node of the cluster Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Table 2 PowerEdge M420 blade server 32 in a PowerEdge M1000e chassis Sandy Bridge EN Dual Intel Xeon E5 2470 2 3GHz 6 8GB 1600MT s Test bed details PowerEdge M620 blade server 16 ina PowerEdge M1000e chassis Sandy Bridge EP Dual Intel Xeon E5 2680 2 7GHz 8 8GB e 1600 MT s 1 DIMM Per Channel at 1600 MHz Mellanox ConnectX 3 FDR10 Two Mellanox M4001T FDR10 IO modules for the PowerEdge M
17. mance of the 128 core PowerEdge R820 cluster is lower than half of the 256 core PowerEdge M620 cluster s performance But the power consumed by the PowerEdge R820 cluster is higher than half the power taken by the PowerEdge M620 blades This contributes to the 12 percent drop in energy efficiency Figure 14 WRF performance Watt o o M620 Higher is Better o T eo v o o Ir x o z o a o gt o 2 T S v a L 4 o o c E x 9 o n M620 2 7GHz 256c R820 2 7GHz 128c M420 2 3GHz 512c Figure 15 shows the energy efficiency with ANSYS Fluent The metric used here is rating per watt The PowerEdge R820 cluster has 14 to 16 percent better energy efficiency than the PowerEdge M620 cluster The performance is approximately half of the PowerEdge M620 cluster but the power consumption is accordingly lower too because of the lower total core count The 512 core PowerEdge M420 cluster has similar energy efficiency compared to the 256 core PowerEdge M620 cluster for the truck_poly_14m dataset Interestingly the PowerEdge M420 cluster is 23 percent more energy efficient than the PowerEdge M620 cluster for the truck_111m data set 21 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 15 ANSYS Fluent performance Watt Higher is Better 0 20 o Oo Ke Lud pes o s o a o Ld o 2 5 5 o gt D d D d N i o a 0 00 truck p
18. ms are studied first The impact each server s architecture has on system memory bandwidth is demonstrated at a micro benchmark level using the Stream benchmark 6 Subsequent sections analyze and explain the application level performance 4 1 Memory bandwidth The memory bandwidth and memory bandwidth per core for the three platforms measured using the Stream benchmark is plotted in Figure 4 The height of the bar indicates the total memory bandwidth of the system The value above each bar marks the memory bandwidth per core The Dell PowerEdge R620 is a rack based server with a similar architecture and expected performance as the PowerEdge M620 blade server As expected the PowerEdge R820 has the maximum total memory bandwidth measured at 110GB s The corresponding bandwidth for the 2 socket PowerEdge R620 is 78GB s The Stream Triad benchmark performs two reads and one write to memory If additional data is transferred to from memory during this benchmark measurement period it is not counted towards the total memory bandwidth capability Therefore the memory bandwidth available to certain applications may be higher than reported by Stream On the Intel Xeon processor E5 4600 product family an issued non cacheable write instruction still triggers a read for ownership due to the cache coherency protocol This extra read is not counted when running the benchmark but takes memory bandwidth to accomplish This is explained in more detail in 7 If
19. oly 14m truck 111m m M620 2 7GHz 256c m R820 2 7GHz 128c m MA20 2 5GHz 512c Figure 16 illustrates the energy efficiency of MILC The metric used here is rating Watt The PowerEdge M620 cluster provides approximately double the performance for less than double the power consumed by the PowerEdge R820 cluster Thus the PowerEdge R820 cluster measures 16 percent lower energy efficiency than the PowerEdge M620 cluster The 512 core PowerEdge M420 cluster provides double the performance for less than double the power consumption of the 256 core PowerEdge M620 cluster This results in the 37 percent better energy efficiency of the PowerEdge M420 cluster over the PowerEdge M620 cluster The shared infrastructure of the 32 node PowerEdge M420 chassis clearly provides a benefit Figure 16 MILC performance Watt 1 60 1 40 1 20 1 00 0 80 0 60 0 40 M620 Higher is Better 0 20 Perf Watt relative to PowerEdge 0 00 m M620 2 7GHz 256c m R820 2 7GHz 128c m M420 2 3GHz 512c 22 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 17 plots the energy efficiency of NAMD The PowerEdge M420 cluster performs better in this scenario providing 5 percent better energy efficiency than the PowerEdge M620 cluster As mentioned in NAMD there were issues running NAMD on the PowerEdge R820 cluster therefore that data point is missing in this graph Figure 17 NAMD performance Watt 1 0
20. ory configuration 14 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 6 LU performance o e o T a o o o Higher is Better o v o o N Ke z v o Ke lu o z o a o Lud o 2 E oO a o o c E Lp o Y LE o a o o o Number of Cores m M620 2 7GHz mR820 2 7GHz mM420 2 3GHz 4 4 WRF The performance of WRF on the three clusters is shown in Figure 7 The Dell PowerEdge M620 performs increasingly better than the PowerEdge R820 as the cluster size increases From a previous study of the impact of memory bandwidth 8 it is observed that a 16 percent drop in memory bandwidth translates to a 4 percent drop in WRF performance In this case the drop in memory bandwidth per core of the PowerEdge R820 when compared to the PowerEdge M620 is 30 percent Thus a portion of this performance drop on the PowerEdge R820s can be attributed to the lower memory bandwidth per core However for a fixed cluster size fewer PowerEdge R820 servers will be needed to achieve this level of performance since it is a four socket system WRF is impacted by the difference in processor frequency on the PowerEdge M420 The PowerEdge M420 cluster performs consistently lower than the PowerEdge M620s by a factor of 15 percent until 128 cores There is a 23 percent drop in memory bandwidth per core on the PowerEdge M420 when compared to the PowerEdge M620 which also contribut
21. r Best performance Performance 8 lower Performance 10 lower Best performance Performance 39 lower Performance 33 lower NAMD Best performance Issues when executing Performance 9 lower program The PowerEdge M620 cluster is used as the baseline for performance comparisons Higher is better Table 5 Results summary energy efficiency EE EE 4 higher Best EE 15 higher Best EE 6 higher Similar EE to PowerEdge M620 EE 12 lower Best EE 13 higher WRF ANSYS Fluent Best EE for truck poly 14m Best EE for truck 111m 16 23 MILC EE 18 lower Best EE 37 higher Issues when executing program Better EE 5 higher EE is shorthand for energy efficiency The PowerEdge M620 cluster is used as the baseline for comparison Higher is better for EE From an engineering and design perspective performance and energy efficiency considerations are important to best fit a cluster to an application s requirements However other factors do influence the final decision These include total cost of ownership aspects that differ from data center to data center like the total number of servers number of switches power and cooling availability ease of administration and cost 24 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers 6 References 1 Optimal BIOS settings for HPC with Dell PowerEdge 12th generation servers http www dellhpcsolutions com assets pd
22. re frequency There were issues running NAMD fully subscribed on more than one server on the PowerEdge R820 platform The authors are working towards resolution with the NAMD developers 18 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 11 NAMD performance 0 6 oe D T 9 L4 v ea 2 fo St I o Dv o o o o 128 64 32 Number of Cores m M620 2 7GHz m M420 2 5GHz Performance Relative to PowerEdge M620 4 8 Power consumption and energy efficiency This section discusses the power consumption and the energy efficiency of the three clusters described in Test bed and applications Energy efficiency is used as the metric for power comparisons and is computed as performance obtained for each watt of power consumed The Dell PowerEdge M1000e chassis supports multiple server blades and includes shared infrastructure such as fans power supplies management modules and switches The power consumption of the chassis includes the power consumed by these shared resources Power measurements for a smaller number of servers that do not fully make use of the chassis would not be a fair measure of the actual power consumed Therefore only a fully populated PowerEdge M1000e blade chassis with 16 PowerEdge M620 servers or 32 PowerEdge M420 servers is used for this portion of the study For the PowerEdge R820 cluster the power consumed by four PowerEdge R820s 1
23. red to the PowerEdge M620 is not considered a significant factor since ANSYS Fluent is not memory bandwidth sensitive Figure 9 compares the performance among the three clusters when using the truck 111m benchmark dataset The minimum amount of memory required for the benchmark is 128GB The PowerEdge M620 and PowerEdge M420 were configured with 64GB each Allowing some space for the operating system a minimum of three servers is needed to run this benchmark data set Similarly a minimum of two PowerEdge R820 servers configured with 128GB memory each is required to run truck 111m Thus the first data point plotted is with 64 cores and not 8 16 or 32 cores The PowerEdge R820 performs 8 percent lower than the PowerEdge M620 and the PowerEdge M420 perform 8 to 11 percent lower than the PowerEdge M620s The trends and analysis are similar to the ones observed with the truck poly 14m data set 16 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 8 ANSYS Fluent performance truck poly 14m 1 20 1 00 Higher is Better o N Ko z v o me Lu x o z o a fe E o 2 amp 7 a o c a E x x 7 a Number of Cores m M620 2 7GHz m R820 2 7GHz m M420 2 3GHz Figure 9 ANSYS Fluent performance truck 111m 1 00 0 80 e a e o NO eo 0 00 Performance Relative to PowerEdge M620 Higher is Better o EN o Number of Cores m M620
24. ridge EN Intel Xeon E5 2400 1600MHz 64b 3 channels QPI links 8GT s 2 bi directional 16b Processor Processor 1 32GB s 38 4GB s 38 4GB s PCI E Gen3 8GT s 128b 130b 1600MHz Left x16 Right x16 x8 16GB s Internal PERC QPI cS DDR3 memory channel 2 bi directional PCI Gen3 lanes 50Gbps 64b 66b 12 1GB s InfiniBand Figure 3 describes the platform architecture of Sandy Bridge EP 4 socket platform Each socket has two QPI links but any two adjacent sockets are connected by just one QPI link in a ring structure There is no cross link between processors one and three and between processors zero and two Thus any communication between these two socket pairs needs to traverse two QPI links Only two of the sockets have PCI lanes and therefore can be local to PCI cards installed in the system Other than the differences in number of QPI links the 4 socket platform architecture is very close to the 2 socket EP platform architecture Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 3 Platform architecture for Sandy Bridge EP 4 Socket Intel Xeon E5 4600 Processor 3 Processor 2 51 2GB s 1600MHz 64b 4 channels QPI links 8GT s 2 bi directional 16b Processor 0 1 32GB s Processor 1 PCI E Gen3 8GT s 128b 130b Left X8 Storage NDC Right Right Center x16 8GB s x8 x8 x16 x8 x16 QPI 56Gbps 64b 66b 2 bi DDR3 memory channel directiona
25. se server platforms at an application level to help HPC users make this choice with confidence The latest Dell PowerEdge 12 generation servers include support for the new processors from Intel The PowerEdge M420 servers armed with the Intel Xeon processor E5 2400 product family cater to users who need a dense compute intensive platform by accommodating 32 servers in 10 U rack space This allows 512 cores in 10 U doubling the typical rack density The 4 socket PowerEdge R820 servers tap into the processing power of the Intel Xeon processor E5 4600 product family and provide massive processing power and memory density These characteristics are attractive to users who need fat nodes in their clusters Finally the PowerEdge M620 server strikes a balance between performance energy efficiency scalability and density with the Intel Xeon processor E5 2600 product family This white paper describes the behavior of select HPC workloads on these three Intel Xeon processor families with focus on performance and energy efficiency The focus is on a cluster level analysis as opposed to a single server study The paper first introduces each of the three Intel architectures and compares the three different processor families It provides cluster level results for different HPC workloads Subsequent sections analyze the results in order to provide better understanding and recommendations regarding which type of server platform best fit a particular workload The st
26. sed and their characteristics The applications chosen are a mix of open source and commercial applications Table 3 Application and benchmark details High Performance Floating point CPU intensive Intel MKL Problem size set to 90 Linpack system benchmark v10 3 9 293 percent of total memory Stream Memory Bandwidth micro v5 9 Array size 160000000 benchmark ANSYS Fluent Computational Fluid Dynamics v14 0 0 truck poly 14m and application truck 111m WRF Weather modeling application v3 1 Conus 12k NAMD Molecular Dynamics application v2 9 STMV MILC Quantum Chromo dynamics v7 6 3 fnl 2009 intel in application Based on Medium NSFt3 LU Lower upper decomposition NPB v3 3 1 Class D physical systems For HPL the performance metric used for comparison is GFLOPS and for WRF the performance metric used is the average time step For NAMD the performance metric used is days per nanosecond For all other applications the metric used is rating Rating is defined as the number of times an application can be executed in a single day In addition to quantifying the performance on the above mentioned server platforms the power consumed is also measured by using a rack power distribution unit PDU Because an apples to apples comparison is not possible with the test bed configuration a cluster level comparison of power consumption is provided in Power Consumption and Energy Efficiency A previous study 1 characterized
27. slates to different problem sizes values of N for each configuration The results are plotted in Figure 5 From the figure it is clear that the Dell PowerEdge M620 and the PowerEdge R820 perform similarly when comparing the same number of cores This performance is attributed to the similar core frequency and memory frequency of these two configurations Clearly HPL is not affected by the difference in memory bandwidth at these core counts HPL also scales well and the interconnect is not a bottleneck at these core counts This is apparent from the graph because the number of PowerEdge R820 servers needed to achieve a certain core count is half that of the PowerEdge M620 servers but the performance of both clusters is similar The PowerEdge M420s perform consistently lower than the M620s by 15 to 19 percent irrespective of core count The difference in core frequency between the PowerEdge M420 2 3 GHz and the PowerEdge M620 2 7 GHz is 15 percent The PowerEdge M420 also has a lower total memory configuration and uses InfiniBand FDR10 which is slower than the InfiniBand FDR used in the PowerEdge M620s This explains the consistent lower performance of the PowerEdge M420s 13 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Figure 5 HPL performance 1 20 Q 100 O 0 80 o 0 m Y 90 60 a Q ea 4 9 0 40 25 o oo xI 0 20 o c E 0 00 2 256 128 64 32 16 a N
28. the performance and energy impact of different BIOS tuning options on the PowerEdge M620 servers The PowerEdge M620 cluster test bed and applications used in that 11 Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers study were identical to this one and therefore data from that analysis is leveraged for the Sandy Bridge EP portion of this work 4 Results and analysis This section compares the performance characteristics of each of the above mentioned applications on the three different server platforms Because the Dell PowerEdge R820 has double the number of cores per server over the PowerEdge M620 and the PowerEdge M420 the performance comparison is made on the basis of core count rather than the number of servers This comparison is also helpful when studying applications that have per core licensing costs For example the PowerEdge R820 needs double the number of ANSYS Fluent licenses for each server 32 when compared to the 16 needed for a PowerEdge M620 or M420 For all tests the cores in the server were fully subscribed For example a 32 core result indicates that the test used two PowerEdge M620 2 16 cores server one PowerEdge R820 and two PowerEdge M420s All application results in this section are plotted relative to the performance on the PowerEdge M620 cluster Before jumping into application performance the obvious differences in the memory subsystem of the three server platfor
29. udy characterizes each server platform based not only on its performance but also on its energy efficiency The behavior and guidelines presented here apply to HPC workloads similar to those tested as part of this study The recommendations in this document may not be appropriate for general enterprise workloads 2 Dell PowerEdge 12 generation server platforms and Intel processor architecture A detailed comparison of the latest Intel processor architectural variants Xeon E5 2400 E5 2600 4600 architecture codenamed Sandy Bridge is provided in Table 1 It also provides a comparison to the previous generation Intel Xeon processor 5600 series architecture codenamed Westmere At a glimpse the major improvements on new Sandy Bridge based servers when compared to the previous generation Westmere servers are the 33 percent increase in core count increase in memory channels support for higher memory speeds and higher QPI speeds A previous study by the authors 1 describes the Dell PowerEdge 12 generation server models and the architecture of Sandy Bridge EP Intel Xeon processor E5 2600 product family in great detail It also explains the differences between Westmere EP and Sandy Bridge EP at a more granular level A block diagram of the Sandy Bridge EP processor architecture is included in this document in Figure 1 as reference Performance Analysis of HPC Applications on Several Dell PowerEdge 12 Generation Servers Table 1 Intel archite
30. umber of Cores mM620 2 7GHz mR820 2 7 GHz mM420 2 3GHz 4 3 LU Figure 6 presents the performance of the LU benchmark from the NAS Parallel Benchmarks NPB suite on the three clusters When the servers are fully subscribed the Dell PowerEdge M620 performs 8 percent better than the PowerEdge R820 and 6 to 12 percent better when compared to the PowerEdge M420 From a previous study analyzing the various memory configurations on Dell PowerEdge 11 generation servers 8 a 16 percent drop in measured memory bandwidth led to a 2 percent drop in LU performance This indicates that LU is not a memory intensive application The PowerEdge R820 has a single QPI link connecting the sockets whereas the PowerEdge M620 has two QPI links The extent of intra node communication is higher on the PowerEdge R820 because of the higher core count Recall that there are no crosslinks between sockets zero and two on the PowerEdge R820 and thus the messages need to traverse two QPI links for any communication as described in Figure 3 The difference in this QPI bandwidth can be associated with the lower performance on the PowerEdge R820 However the value of the PowerEdge R820 is that a fewer number of servers are needed to achieve a certain performance or core count because this is a quad socket system The performance drop on the PowerEdge M420 when compared to the PowerEdge M620 can be attributed to the 15 percent lower clock speed single QPI link and lower mem

Download Pdf Manuals

image

Related Search

Related Contents

here - Milton Public Schools    取扱説明書 - エー・アンド・デイ  ALARM Facile 7 Mode d`emploi - télécharger la dernière version  LCD-26XR7 LCD-32XR7 SERVICE MANUAL LCD TV  Ubee DDW3612 Wireless Cable Modem Gateway    

Copyright © All rights reserved.
Failed to retrieve file