Home
PFFT User Manual - Technische Universität Chemnitz
Contents
1. MPI Comm comm FILE stream const char format void pfft printf MPI Comm comm const char format 6 9 Generating Periodic Cartesian Communicators Based on the processes that are part of the given communicator comm the following routine int pfft create procmesh 1d MPI Comm comm int npO0 MPI Comm comm cart 1d allocates and creates a one dimensional periodic Cartesian communicator comm cart 1d of size np0 Thereby a non zero error code is returned whenever np0 does not fit the size of comm The memory of the generated communicator should be released with MPI Comm free after usage Analogously use int pfft create procmesh 2d MPI Comm comm int np0 int npl MPI Comm comm cart 2d in order to allocate and create two dimensional periodic Cartesian communicator comm cart 2d of size npO np1 or int pfft create procmesh int rnk np MPI Comm comm const int np MPI Comm comm cart 6 PFFT Reference 61 in order to allocate and create a rnk_np dimensional periodic Cartesian communica tor of size np 0 np 1 np rnk_np 1 Hereby np is an array of length rnk_np Again the memory of the generated communicator should be released with MPI Comm free after usage 1 Developers Guide 7 1 Search and replace patterns Correct alignment of pfft h header IA SIRIA AP MANS Expand most macros of pfft h to generate the function reference of this manual sed e s S g e s P
2. local ni 0 k0 for ptrdiff t k1 0 kl local ni 1 k1 for ptrdiff t k2 0 k2 local ni 2 k2 alia litters Euao 140 ae loose s starrt IIO ki ar local 3 sicaucie 111 k2 ar local i susuulI21 g The parallel FFT is computed when we execute the generated plan via void pfft execute const pfft plan plan Now the results can be read from out with an analogous three dimensional loop If we do not want to execute another parallel FFT of the same type we free the allocated memory of the plan with void pfft destroy plan pfft plan plan Additionally we use int MPI Comm free MPI Comm comm to free the communicator allocated by pf ft create procmesh 2d and void pfft free void xptr to free memory allocated by p ft alloc complex Finally we exit MPI via int MPI Finalize void 2 2 Porting FFTW MPI based code to PFFT We illustrate the close connection between FFTW MPI and PFFT at a three dimensional MPI example analogous to the example given in the FFTW manual 2 Exactly the same task can be performed with PFFT as given in Listing include lt pfft h gt int main int argc char xxargv const ptrdiffit imi Sl Lecep soap coool co 10 0t4 C2 2 A 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 2 Tutorial 14 Listing 2 2 Minimal parallel c2c FFT test program include lt fftw3 mpi h gt int main int argc char x
3. 2 0 0 ee ee 31 5 3 Preliminary Skip Serial Transformations 31 6 PFFT Reference 33 6 1 Files and Data Types sls 33 6 2 MPI Initialization srs p Ee Geis ere odes Dae ge pim 34 6 3 Using PFET Plans 4 luogo a AURORA ES ER 34 6 4 Data Distribution Functions 2e 34 6 4 1 Complex to Complex FFT lens 34 6 4 2 Real to Complex FFT o o 36 6 4 3 Complex to Real FFT les 38 6 4 8 Real to Real FFT n 39 6 5 Plan Creation cl ds eos A ho ee EUR UE e ORO Ee ox ed 40 6 5 1 Complex to Complex FFT o 40 6 5 2 Real to Complex FFT ees 41 6 5 3 Complex to RealFFT leen 42 6 5 4 Real to Real FFT le 43 6 6 FFT Execution Timer e 44 6 6 1 Basis Run Time Measurements lll 44 6 6 2 Advanced Timer Manipulation 45 6 7 Ghost Cell Communication 0 0 02000002 ee ee 46 6 7 1 Using Ghost Cell Plans 0 0 00 0 0000 2 ee 47 6 7 2 Data Distribution e 47 6 7 3 Memory Allocation 2 sh 48 6 7 4 Plan Creation for Complex Data rrr 49 6 7 5 Plan Creation for Real Data lens 51 6 7 6 In ff clalElags umo oe Ro RR Roy Row eo EUR us 52 6 7 7 Ghost Cell Execution Timer a 52 6 8 Useful Tools cut LX SRL Xen ia xA una ae 54 6 8 1 Initializing Complex Inputs and Checking Outputs 54 6
4. Note that these functions can be combined for a quick consistency check of the FFT Since a forward FFT followed by a backward FFT reproduces the inputs up to a scaling factor the following code snippet should give a result equal to zero up to machine precision Initialize input with random numbers uci bg aowi Comjolless Sel im local ml local ab start atin y execute parallel forward FFT x pfft execute plan forw e edes tha odio 3uew b v7 if in out joie Clear _slinjowic_ Ccomolex Ica local mi local_i mieu aa execute parallel backward FFT x pfft_execute plan_back Scale data for ptrditftestet ME Mui cS EON IE AA EN E E e a a a 2 07 Print error of back transformed data Si prre Check aula comalex scl m local mi local a usu aia comm_cart_2d pfft printf comm cart 2d Error after one forward and backward praco Of size m Sud See Sue swa 9l mill al2l FONE INE cia Comm cari Ziele merezco Ese ym ex p Hereby we set all inputs equal to zero after the forward FFT in order to be sure that all the final results are actually computed by the backward FFT instead of being a buggy relict of the forward transform 6 8 2 Initializing Real Inputs and Checking Outputs To fill a real array data with reproducible real values use one of the functions void pfft init input real 3d const ptrdiff t xn const ptrdiff t local n const ptrdiff
5. 6 PFFT Reference 34 6 2 MPI Initialization Initialization and cleanup of PFFT in done in the same way as for FFTW MPI see 6 In order to keep a clean name space PFFT offers the wrapper functions void pfft_init void void pfft_cleanup void that can be used as substitutes for ftw mpi init and fftw mpi cleanup respec tively 6 3 Using PFFT Plans PFFT follows exactly the same workflow as FFTW MPI A plan created by one of the functions given in Section 6 5 is executed with void pfft execute const pfft plan plan and freed with void pfft destroy plan const pfft plan plan Note that you can not apply ftw mpi execute or fftw destroy on PFFT plans The new array execute functions are given by void pfft execute dft const pfft plan plan pfft complex xin pfft complex out void pfft execute dft r2c const pfft plan plan double xin pfft complex out void pfft execute dft c2r const pfft plan plan pfft complex xin double xout void pfft execute r2r const pfft plan plan double xin double xout The arrays given by in and out must have the correct size and the same alignement as the array that were used to create the plan just as it is the case for FFTW see 6 4 Data Distribution Functions 6 4 1 Complex to Complex FFT ptrdiff t pfft local size dft 3d const ptrdiff t n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t
6. local o start ptrdiff t pfft local size dft int rnk n const ptrdiff t lt n MPI Comm comm cart unsigned pfft flags 6 PFFT Reference 35 ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start ptrdiff t pfft local size many dft int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t iblock const ptrdiff t xoblock MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start Compute the data distribution of a parallel complex input output discrete Fourier trans form DFT in two or more dimensions returning the number of complez numbers that must be allocated to hold the parallel transform Arguments rnk n is the rank of the transform typically the size of the arrays n ni no that can be any integer gt 2 The 3d planner corresponds to a rnk_n of 3 The array n of size rnk n specifies the transform dimensions They can be any positive integer The array ni of size rnk_n specifies the input array dimensions They can be any positive integer with ni t lt n t for all dimensions t 0 rnk_n 1 For ni t n t the inputs will be padded with zeros up to size n t along the t th dimension before the transform see Section The array no of size rnk_n spec
7. 8 2 Initializing Real Inputs and Checking Outputs 55 6 8 3 Initializing r2c c2r Inputs and Checking Outputs 56 6 8 4 Operations on Arrays of Type ptrdiff t 57 6 8 5 Print Three Dimensional Arrays in Parallel 58 6 8 6 Reading Command Line Arguments 59 6 8 7 Parallel Substitutes for vprintf fprintf and printf 60 6 9 Generating Periodic Cartesian Communicators 60 Contents 7 Developers Guide 7 1 Search and replace patterns 222A 8 ToDo 8 1 Measuring parallel run times o e 00200004 This user manual describes the usage of PFFT 1 0 7 alpha 18 20 a MPI based parallel software library for the computation of equispaced fast Fourier transforms FFT on parallel distributed memory architectures The reader of this manual should familiar with the basic usage of FFTW and MPI For further information we refer to the well written FFTW user manual 1 and the MPI Standard 15 see also 12 for detailed explanations 1 Introduction A popular software library for computing FFTs is FFTW 11 10 This library also includes a parallel FFT implementation FFTW MPI based on the Message Passing Interface MPI FFTW MPI parallelizes multi dimensional FFTs by a mixture of serial lower dimensional FFTs and parallel data transpositions However FF TW MPI makes use of a one dimensional data decomposition which shows to be a scalability bo
8. It can be 1 PFFT FORWARD or 1 PFFT BACKWARD e pfft flags is a bitwise OR of zero or more planner flags as defined in Section PFFT computes an unnormalized transform computing a forward followed by a back ward transform or vice versa will result in the original data multiplied by the size of the transform the product of the dimensions n t 1 6 5 2 Real to Complex FFT pfft plan pfft plan dft r2c 3d const ptrdiff t n double in pfft complex out MPI Comm CXOXqQd0 Gieue ie int sign unsigned pfft flags pfft plan pfft plan dft r2c int rnk n const ptrdiff t n double in pfft complex out MPI Comm comm cart int sign unsigned pfft flags pfft plan pfft plan many dft r2c int rnk n const ptrdiff t xn const ptrdiff t xni const ptrdiff t no ptrdiff t howmany const ptrdiff t iblock const ptrdiff t xoblock double xin pfft_complex xout MPI_Comm comm_cart int sign unsigned pfft flags pfft plan pfft plan many dft r2c skipped int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t xoblock const int skip_trafos double in pfft complex out MPI Comm comm cart int sign unsigned pfft flags Plan a parallel real input complex output discrete Fourier transform DFT in two or more dimensions returning an p ft plan The planner returns NULL if the plan cannot 6 PFFT Re
9. above double data MPI Comm comm cart unsigned gc flags pfft gcplan pfft plan rgc int rnk n const ptrdiff t x n const ptrdiff t gc below const ptrdiff t gc above double data MPI Comm comm cart unsigned gc flags pfft gcplan pfft plan many rgc int rnk n const ptrdiff t x n ptrdiff t howmany const ptrdiff t block const ptrdiff t gc below const ptrdiff t gc above double data MPI Comm comm cart unsigned gc flags Hereby rnk n n howmany and comm cart must be the variables that were used for the PFFT plan creation Remember that n is the logical FFT size just as it is the case for FFT planning The block size block must be equal to iblock or oblock depending on whether the ghost cell plan works on the FF T input or output array Analogously data becomes in or out Set the number of ghost cells by gc below and gc above as described in Section 6 7 2 The flags gc flags must be set appropriately to the flags that were passed to the FFT planner Table 6 2 shows the ghost cell planner flags that must be set in dependence on the listed FFT planner flags Note that the FFT flag ghost cell flag PFFT TRANSPOSED NONE PFFT GC TRANSPOSED NONE PFFT TRANSPOSED IN PFFT GC TRANSPOSED PFFT TRANSPOSED OUT PFFT GC TRANSPOSED PFFT PADDED R2C PFFT GC PADDED R2C PFFT PADDED C2R PFFT GC PADDED C2R Table 6 2 Mapping of FFT and real valued ghost cell
10. and ghost cells pfft complex cdata pfft alloc complex alloc local gc gt alloe local 7 alloe MNIK C SING o 2 elloe local Here alloc_local gives the number of data elements that are necessary to hold all steps of the parallel FFT while alloc_local_gc gives the number of data elements that are necessary to hold all steps of the ghost cell communication Note that we took the maximum of these both numbers as argument for p ft alloc complex The code snippet for real valued arrays looks very similar Get parameters of data distribution x f x alloc local local imo Local o etare ara guweunm im COmolese Wades s f f ocal imi Torcal x Start use Cawem aim weed vales c alles local prie local size gute 1220 30l 1 Ceu eur Zel PFFT TRANSPOSED NONE local mi local s seeec local mo local o seus fe aoc local we local enc bocal Ge scare eee wero chi DoE umts aloe local Ge prre local size ce 3e local ni local i start gc below gc above local nge local ge suce s Allocate enough memory for FFT and ghost cells double rdata pfft alloc real alloc local gc gt 2xalloc local alloc local ge s 2sellos esu Note that the number of real valued data elements is given by two times alloc local for r2c transforms whereas the last line would change into double rdata pfft alloc real alloc local gc gt alloc local alloe local ge alloc local for r2r
11. command line argument is given The array parameter must be of sufficient size to hold neededArgs elements of the given data type Special attention is given For example a program containing the following code snippet double x 0 1 prre gue ergs learge argv ore JL Ir DOUE y sis ma Z 42 Xp orro grew aos lenta acy Verite xp 2 Riri JUN ajo p ptrdiffS Stou 927 192 2925 otec get _ args euge aro juiE m Sf ISI IRD I a int switch 0 otre get args laco eg tte qu 0 liem SWIC SWECON p 6 PFFT Reference 60 that is executed via test prire x 3 1 pfft np 2 3 prre A 8 16 32 prie On will read x 3 1 np 2 2 3 n 3 8 16 32 and turn on the switch 1 Note the address operator in front of x in the second line Furthermore note that the initialization of all variables with default values before the call of p ft get args avoids trouble if the user does not provide all the command line arguments 6 8 7 Parallel Substitutes for vprintf fprintf and printf The following functions are similar to the standard C function vfprintf fprintf and printf with the exception that only rank 0 within the given communicator comm will produce output The intension is to avoid the flood of messages that is produced when simple printf statement are run in parallel void pfft_vfprintf MPI Comm comm FILE stream const char format va list ap void pfft fprintf
12. communicator cart comm 2 6 Parallel data decomposition In the following we use the notation to symbolize that an array of length n is broken into disjoint blocks and distributed on P MPI processes Hereby the data is distributed in compliance to the FFTW MPI data decompostion 5 i e the first P b1ock rounded down processes get a contiguous chunk of block elements the next process gets the re mainingn block n block data elements and all remaining processes get noth ing Thereby the block size block defaults to n P rounded down but can also be user defined 2 Tutorial 19 2 6 1 Non transposed and transposed data layout In the following we use the notation to symbolize that an array of length n is dis tributed on P MPI processes The standard PFFT data decomposition of h interleaved d dimensional arrays of equal size ng x n X X ng on a r dimensional process mesh of size Pp x x P _1 is given by the blocks no n Nr xx X P Py LI X ne X head X X Ng 1 X A A PFFT created with planning flag PFFT TRANSPOSED NONE requires the inputs to be decomposed in this standard way and produces outputs that are decomposed in the same way PFFT can save half of the global communication amount if the data reordering to standard decomposition is omitted The transposed data decomposition is given by n1 na Nr POL PES pdt P P Pr 1 X No Xmnpx xmqaxh A PFFT plan created wit
13. iblock t PFFT_DEFAULT_BLOCK e pfft_flags is a bitwise OR of zero or more planner flags as defined in Section e The array local_ni of size rnk_n returns the size of the local input array block in every dimension counted in units of complex numbers e The array local_i_start of size rnk_n returns the offset of the local input array block in every dimension counted in units of complex numbers e The array local_no of size rnk_n returns the size of the local output array block in every dimension counted in units of complex numbers e The array local_o_start of size rnk_n returns the offset of the local output array block in every dimension counted in units of complex numbers In addition the following local_block functions compute the local data distribution of the process with MPI rank pid The local_size interface can be understood as a call of 1ocal block where pid is given by MPI Comm rank comm cart amp pid i e each MPI process computes its own data block However 1ocal block functions have a voidreturn type i e they omit the computation of the local array size that is necessary to hold the parallel transform This makes 1ocal block functions substantially faster in exectuion void pfft local block dft 3d const ptrdiff t n MPI Comm comm cart int pid unsigned jw re ap lacs ptrdiff t local ni ptrdiff t 1 oca stare ocal ptrdiff t local no p
14. k 0 n 1 set fr 0 2 For k i 2 i 2 1 compute Cte DD ga 3 For1 0 n 1 compute f 577 fye2rikl n using PFFT 4 For 70 2 0 2 1 compute g 1 fa n 2 Note that this shift implies that the library deals with pruned FFTs in a special way i e half of the zeros are added at the beginning of the inputs and the other half is added at the end 4 1 2 Arbitrary shifts More general shifts must be done by the user In a more general setting we are interested in the computation of FFTs with shifted index sets i e assume k 1 Z and compute ni ks 1 gS Y e dino lees d k ks 4 Advanced Features 28 Algorithm 4 1 Shifted FFT with explicit data movement 1 For k 0 n 1 assign f Qr FR mod ny 2 For 1 0 1 compute f zp fe ome using PFFT 3 For l 0 no 1 assign gi f 1 1 mod no Because of the periodicity of the FF T this can be easily performed by Alg 4 1 2 However this involves explicit data movement since the sequence of data changes For a our parallel data decomposition the change of data layout requires data communication A simple index shift results in the computation of ni ks 1 n 1 ge 5 Que 2r UH y ppp 0727 RH k ks k 0 ni l e 2riksl n y Seager Eure e 2rikl n k 0 m for all 0 no 1 The resulting Alg 4 1 2 preserves the sequence of data at the price of some extra com
15. local no ptrdiff t local o start Compute the data distribution of a parallel real input complex output discrete Fourier transform DFT in two or more dimensions returning the number of complez numbers that must be allocated to hold the parallel transform Arguments are the same as for c2c transforms see Section 6 4 1 with the following exceptions e The logical input array size ni will differ from the physical array size of the real inputs if the flag PFFT_PADDED_R2C is included in pfft_flags This results from the padding at the end of the last dimension that is necessary to align the real val ued inputs and complex valued outputs for inplace transforms see 7 In contrast to FFTW MPI PFFT does not pad the r2c inputs per default e local ni is counted in units of real numbers It will include padding e local i start is counted in units of real numbers The corresponding 1ocal block functions compute the local data distribution of the process with MPI rank pid Nace joicinice local lolo Gl oS _ Sel const ptrdiff t n MPI Comm comm cart int pid unsigned jr see lag ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start husuiel mourir ILOCeI Jolbo elk Chew 362 int rnk n const ptrdiff t x n MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start void pfft local
16. local_i_start if you are interested in ghost cell communication of the input array For ghost cell commu nication of the output array substitute local_n and local_start by local_no and local_o_start 6 7 1 Using Ghost Cell Plans We introduce a new datatype p ft gcplan that stores all the necessary information for ghost cell communication Using a ghost cell plan follows the typical workflow At first determine the parallel data distribution cf Section 6 7 2 Next create a ghost cell plan cf Section 6 7 4 and Section 6 7 5 Execute the ghost cell communication with one of the following two collective functions void pfft_exchange pfft gcplan ths void pfft_reduce pfft gcplan ths Hereby a ghost cell exchange creates duplicates of local data elements on next neigh boring processes while a ghost cell reduce is the adjoint counter part of the exchange i e it adds the sum of all the duplicates of a local data element to the original data element Finally free the allocated memory with void pfft destroy gcplan pfft gcplan ths if the plan is not needed anymore Passing a freed plan to p ft exchange or pfft reduce results in undefined behavior 6 7 2 Data Distribution Corresponding to the three interface layers for FFT planning there are the following three layers for computing the ghost cell data distribution ptrdiff t pfft local size gc 3d const ptrdiff t local n const ptrdiff t local start
17. of 3d FFTs are not yet parallelized with OpenMP e PFFT does not have full SIMD support All serial FFT computations and global communications are implemented with FFTW which offers SIMD support see Section However most of the PFFT only features such as pruned FFT ghost 1 Introduction 9 cell send and 3d decompostion of 3d FFTs are not yet parallelized with SIMD e PFFT does not overlap communication and computation The code of PFFT is build in a very modularized structure Most of these modules consist of FF TWs routines Therefore the global transposition does not support non blocking com munication e Similar to FFTW we do not provide any parallel IO routines The user is respon sible of load and store of parallel data e PFFT depends on FFTW to perform its serial transforms and does not support different vendor FFTs such as Intel s MKL or IBM s ESSL However this is not assumed to be a big drawback since FFTW seems to perform very well on most platforms e The global communication routines can not be called separately However it should be possible to implement a user interface to our global transposition rou tines e PFFT does not support GPU parallelization You are welcome to propose new PFFT features at https github com mpip ptit git 1 1 Alternative parallel FFT implementations There have been several FFT implementations that aim to circumvent the scalability bottleneck for at least three dimensiona
18. oor dur Gee go ee ee ee a a A oe 10 2 2 Porting FFTW MPI based code to PFFT 13 2 3 Errorcode for communicator creation ee 17 2 4 Inplace transiorms nro eg ee ide Peas 17 2 5 Higher dimensional data decomposition 18 2 6 Parallel data decomposition ee 18 2 6 1 Non transposed and transposed data layout 19 2 6 2 Three dimensional FFTs with three dimensional data decomposition 19 200 Planning effort korta d eue wt ete Wy boe Hue Ear ode ie IR ig 20 2 8 Preserving input data lA 20 2 9 FFTs with shifted index sets 22A 21 2 10 Pruned FFT and Shifted Index Sets lle 21 2 10 1 Pru ed EFT 2 000 2 goa ee Ba A SR Peo as 21 2 10 2 Shifted Index Sets 2 e 21 2 11 Precisions ck a RU oe a a a DP dra x 21 2 12 Ghost cell communication es 22 2 13 Fortran interface 2 lll s 22 Installation and linking 23 3 1 Install of the latest official FFTW release 23 3 2 Install of the PFFT library ss 23 3 8 How to include PFFT in your program lee 24 Advanced Features 26 4 1 How to Deal with FFT Index Shifts in Parallel 26 4 1 Shift with half the FFT size lle 26 4 1 2 Arbitrary shifts sim no o o Rd RO RR SES 27 4 2 Parallel pruned FFT ocu ei e a EE 29 Contents 4 5 Interface Layers of the PFFT Library 30 5 1 Basic Interface 22e ee 30 5 2 Advanced Interface 2
19. t local n start 6 PFFT Reference 56 double xdata void pfft_init_input_real int rnk n const ptrdiff t lt n const ptrdiff t local n const ptrdiff t 1local start double xdata Hereby the arrays n local_n and 1ocal n start give the size of the FFT the local array size and the local array offset as computed by the array distribution functions described in Section 6 4 The functions double pfft check output real 3d const ptrdiff t n const ptrdiff t local n const ptrdiff t local n start const pfft complex data MPI Comm comm double pfft check output real int rnk n const ptrdiff t n const ptrdiff t local n const ptrdiff t local start const pfft complex data MPI Comm comm compute the l norm between the elements of array data and values produced by pfft init input real 3d pfft init input real In addition we supply the fol lowing functions for setting all the input data to zero at once void pfft clear input real 3d const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start double data void pfft clear input real int rnk n const ptrdiff t n const ptrdiff t local n const ptrdiff t 1local start double data Note that both p ft init input real functions will set all array elements to zero were local n local n start gt n In addition both p ft check output real function will ignore all the errors resulting from these el
20. the array plus ghost cells Note that the array distribution functions do not distinguish between real and com plex valued data That is because local_n and local start count array elements in units of complex or real depending on the transform In addition it does not mat ter if the local array is transposed or not i e it is not necessary to pass the flags PFFT TRANSPOSED IN and PFFT TRANSPOSED OUT to the ghost cell distribution func tion In constrast the ghost cell plan creation depends on the transform type as well as the transposition flags 6 7 3 Memory Allocation In most applications we must ensure that the data array is large enough to suit the memory requirements of a parallel FFT and the ghost cell communication The following two code snippets illustrate the correct allocation of memory in for complex valued and real valued arrays Get parameters of data distribution fx callos local local imo Local o start ara Efiweid cum oes baldes w7 fx local min local x beri use Giyom Gia eel varcs c7 alles local mui local size uri 352 3cl im Coit Caie Zel PFFT TRANSPOSED NONE local ami Local st amp eueuct local mo local Shea Je ailioe local owe local me Local ge srcert Ene Giyen cha COmokaxs units alloc oes ce jux local save ea 36 6 PFFT Reference 49 local mo local e susct Giu l elo Gem soos Local seg local eo sued Allocate enough memory for FFT
21. xdata ese tia if allocgileecal local iaa 131 local molsl iecael o sra 131 unsigned pfft_flags 0 MPI Comm comm cart 2d aR local 3a seart 13 Jy kp i MPI Init amp argc prre mati p amp argv creat pfft create procmesh 2d MPI COMM WORLD amp comm cart 2d mp OF mall Ewo dimenstonal process grid of size mp0 x npl T coma Cue 2el joie lada IS E 0 seas x get local data size and allocate alloc local prie local size che Sola local mi local local mo local data pfft_alloc_complex alloc_local create plan for in place forward DFT plan pfft plan dft 3d n data data MPI COMM WORLD PFFT FORWARD fe imicialize cerca co sodas FUMNCELOD jum Juice o a Epy Z PFFT ESTIMATE for 0 i E O tora OO A A sar f oT OM E NEZ a dara tilo e ii LIE ato A A s 3eedbexeeud m2 i my 3e ince stom local ab sieaice 101 a abs Focal a sucre lid 95 local ai Secas L2 xe Je 8 compute transforms in place as many times as desired pfft execute plan pfft destroy plan plan 2 Tutorial 17 MPI_Finalize 2 3 Errorcode for communicator creation As we have seen the function int pfft_create_procmesh_2d MPI Comm comm int np0 int npl MPI Comm comm cart 2d creates a two dimensional periodic Cartesian communicator The int return value no
22. FFT EXTERN g SEAS ISA A e s DNE ocre e g a s R gouble sg a s C piti comilez a N e s fg i m gt Joi gmgeeoeunolexol 8 ToDo e PFFT FORWARD is defined as FFTW FORWARD FFTW FORWARD is defined as 1 PFFT allows to chose between FFTW FORWARD and FFTW BACKWARD which is not implemented by FFTW Matlab uses the same sign convention i e 1 for fft and 1 for ifftn 8 1 Measuring parallel run times Use MPI Barrier in front of every call to pfft_ function to avoid unbalanced run times 8 ToDo 64 Acknowledgments We are thankful to Yu Feng who implemented the new array execute and the clear input functions Bibliography 10 11 M Frigo and S G Johnson FFTW users manual http www fftw org fftw3 doc M Frigo and S G Johnson FFTW users manual 2d mpi example http www fftw org fftw3_doc 2d MPI example html g_t2d MPI example M Frigo and S G Johnson FFTW users manual Complex numbers http www fftw org fftw3_doc Complex numbers html Complex numbers M Frigo and S G Johnson FFTW users manual Memory allocation http www fftw org fftw3 doc Memory Allocation htmli Memory Allocation M Frigo and S G Johnson FFTW users manual Mpi data distri bution http www fftw org fftw3 doc MPI Data Distribution html MPI Data Distribution M Frigo and S G Johnson FFTW users manual MPI initialization http www fftw org fftw3_
23. Feng for the new array execute patch Furthermore we added some special features to support repeated tasks that often occur in practical application of parallel FFTs e PFFT includes a very flexible ghost cell exchange module A detailed description of this module is given in Section 6 7 e PFFT accepts three dimensional data decomposition even for three dimensional FFTs However the underlying parallel FFT framework is still based on two dimensional decomposition A more detailed description can be found in Sec tion e PFFT explicitly supports the parallel calculation of pruned FFTs Details are given in Section Finally we complete this overview with a list of features that are not yet implemented in PFFT e Parallel one dimensional FF T based on MPI FFTW MPI uses another paralleliza tion strategy for one dimensional FFTs which is not implemented in PFFT The reason is that we can not achive a scalability benefit due to higher dimensional data decomposition if the FFT has only one dimension Therefore one can also call FFTW directly in this case e There is no equivalent of FFTW wisdom in PFFT i e you can not save a PFFT plan to disk and restore it for later use e PFFT does not have full OpenMP support All serial FFT computations and global communications are implemented with FFTW which offers OpenMP support see Section However most of the PFF T only features such as pruned FFT ghost cell send and 3d decompostion
24. I Comm comm cart 2d Set size of FFT and process mesh moy Zg X e 2p imp e 4g m0 e 27 moll 25 Initialize MPI and PFFT x MPI Init amp argc amp argv jose ie c_Slinalic y Creat two dimensional process grid of size np 0 x np 1 x pfft create procmesh 2d MPI COMM WORLD np 0 np 1 amp comm cart 2d Get parameters of data distribution x alloe logal qiie local sitze eur Sell n comm cart 2d PFFT TRANSPOSED NONE local ims local li sesite local mo luxe o Susie Allocate memory x in pfft alloc complex alloc local out pfft alloc complex alloc local fe Pisin oeradel comica PEI m pulsum qui glam Ure Sella am OIE COUN oade Zel PFFT FORWARD PFFT TRANSPOSED NONE Initialize input with random numbers fic ile age Comjollesx elm local ml Local i suerte 209 Execute parallel forward FFT x pfft_execute plan free mem and finalize MPI x pfft_destroy_plan plan MPI Comm free amp comm cart 2d pfft free in pfft free out MPI Finalize return 0 2 Tutorial 12 ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start Hereby n 1ocal ni local i start local no local o start are arrays of length 3 that must be allocated The return value of this function equals the size of the local com
25. PFFT User Manual for version 1 0 7 alpha December 2 2014 Michael Pippig Technische Universitat Chemnitz Faculty of Mathematics 09107 Chemnitz Germany michael pippig mathematik tu chemnitz de Download Parallel Fast Fourier Transform Software Library at www tu chemnitz de mpip software php https github com my pT Ee 1 This work was supported by the BMBF grant 011H08001B from 01 01 2010 until 31 03 2013 Todo list finish FFTW2PFFT porting example sn 13 Describe shifted input and output ee 21 Describe pruned FFT with shifted input and output 21 explain ghost cell communication with a test file 22 explain F03 interface with a test file 0 0 0 000000 22 this flag can be used for local_size and planning 26 implement getters and setters for pfft timer 2 22 lle 46 Does anybody need non 3d ghost cell communication 46 Does anybody need r2c ghost cell communication with correct boundary conditions 50 explain PFFT GC SENDRECV and PFFT GC RMA a 52 Do we need getters and setters for ghost cell timers 2 54 Contents 1 Introduction 7 1 1 Alternative parallel FFT implementations 9 1 2 Parallel nonequispaced FFT 2 20 0 0 less 9 Tutorial 10 2 1 A first parallel transform Three dimensional FFT with two dimensional data decompositions
26. alues you can use one of the functions void pfft_init_input_complex_3d const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start pfft complex data void pfft init input complex int rnk n const ptrdiff t xn const ptrdiff t local n const ptrdiff t 1local start pfft complex xdata Hereby the arrays n 1ocal n and 1ocal n start of length rnk_n rnk_n 3 for 3d give the size of the FFT the local array size and the local array offset as computed by the array distribution functions described in Section 6 4 The functions double pfft check output complex 3d const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start const pfft complex data MPI Comm comm double pfft check output complex int rnk n const ptrdiff t lt n const ptrdiff t local n const ptrdiff t 1local start const pfft complex data MPI Comm comm compute the norm between the elements of array data and values produced by pfft init input complex 3d pfft init input complex In addition we supply the following functions for setting all the input data to zero at once void pfft clear input complex 3dQ const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start 6 PFFT Reference 55 pfft complex xdata void pfft clear input complex int rnk n const ptrdiff t lt n const ptrdiff t local n const ptrdiff t 1local start pfft complex data
27. atlab However with distributed memory these fftshift operations require more complex data movements and result in a global communication For example the first index of the array moves to the middle and therefore the corresponding data move to another MPI process Fortunately this communication can be avoided at the cost of little extra computation At the end of the section we present two PFFT library functions that perform the necessary pre and postprocessing for shifted input and output index sets 4 1 1 Shift with half the FFT size The special case of input shift ks 2 and or output shift ls n 2 is supported by PFFT User can choose to shift the input PFFT_SHIFTED_IN and or to shift the output PFFT_SHIFTED_OUT Here we are interested in the computation of ni 2 1 gi gt qe l No 2 ses nof2 el k n4 2 with n ni no 2N and n gt nj n gt No 4 Advanced Features 27 With an index shift of 1 2 both in k and l this equivalent to the computation of n 24 ni 2 1 gam 7 J e k n 2 n 2 n 2 nj 2 1 gti y ajo 7 67702 e7 2rikl n k n 2 n 2 n 24n 2 1 e rill n 2 y Ga nyet e 2mikl n k n 2 n 2 M fk fi 2ni k n 2 I n 2 n for 7 2 no 2 7 2 no 2 1 Therefore we get the following algorithm TL fi n S ape tina l No 2 nee nof2 The special case ks ls c corresponds to the shifts the arrays FFTSHIFT 1 For
28. ays Assume a three dimensional array data of size n 3 that is distributed in blocks such that each process has a local copy of data k 0 k 1 k 2 with local suare le lt Ele lt leocal_sitarwelie s Llocal_ el Here and in the following we assume t 0 1 2 The classical ghost cell exchange communicates all the necessary data between neighboring processes such that each process gets a local copy of data k 0 k 1 k 2 with local ge seee ej lt kie lt local ge suerte ie s local wee ic where Logal_ee_ suei s e local _sicaiee ic ege loelem el Local ince ej loc ale lt qelselow el sp qe sleeve lg Le the local array block is increased in every dimension by gc_below elements below and gc_above elements above Hereby the data is wrapped periodically whenever k t exceeds the array dimensions The number of ghost cells in every dimension can be chosen independently and can be arbitrary large i e PFFT ghost cell communication also handles the case where the requested data exceeds next neighbor communication 6 PFFT Reference 47 The number of ghost cells can even be bigger than the array size which results in multiple local copies of the same data elements at every process However the arrays gc_below and gc_above must be equal among all MPI processes PFFT ghost cell communication can work on both the input and output array dis tributions Substitute local_n and local_start by local_ni and
29. block many dft r2c int rnk n const ptrdiff t ni const ptrdiff t no const ptrdiff t xiblock const ptrdiff t xoblock MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start 6 PFFT Reference 38 6 4 3 Complex to Real FFT ptrdiff t pfft_local const AS t Sube oru 23 3X9 n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start ptrdiffseteotst E oca Su zc ea int rnk n const ptrdiff t n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start ptrdiff t pfft local size many dft c2r int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany xoblock MPI Comm comm cart ptrdiff t local ni ptrdiff t loca const ptrdiff t xiblock const ptrdiff t unsigned ue flags l 3L SB Ep ptrdiff t local no ptrdiff t local O ESEEWEIE Compute the data distribution of a parallel complex input real output discrete Fourier transform DFT in two or more dimensions returning the number of complex numbers that must be allocate
30. const ptrdiff t gc below const ptrdiff t gc above ptrdiff t local ngc ptrdiff t local gc start ptrdiff t pfft local size gc int rnk n const ptrdiff t local n const ptrdiff t 1local start const ptrdiff t gc below const ptrdiff t gc above 6 PFFT Reference 48 ptrdiff t local ngc ptrdiff t local gc start ptrdiff t pfft local size many gc aiaia aink m const ptrdiff t local n const ptrdiff t xlocal_start ptrdiff t howmany const ptrdiff t gc below const ptrdiff t gc above ptrdiff t local ngc ptrdiff t local gc start Hereby rnk n and howmany must be the exactly same variables that were used for the PFFT plan creation However only the case rnk n 3 is completely implemented at the moment The local array size 1ocal n must be equal to 1ocal ni or 1ocal no computed by an appropriate call of p ft 1ocal size cf Section 6 4 depending on whether the ghost cell plan works on the FFT input or output array Analogously local start becomes local i start or local o start The number of ghost cells is given by the two arrays gc below and gc above that must be equal among all MPI processes All the ghost cell data distribution functions return the local array plus ghost cell size local_ngc and the corresponding offset 1ocal gc start as two arrays of length rnk n In addition the ptrdiff t return value gives the number of data elements that are necessary in order to store
31. const ptrdiff t xiblock const ptrdiff t xoblock pfft complex xin pfft complex out MPI Comm comm cart int sign unsigned pfft flags pfft plan pfft plan many dft skipped int rnk n const ptrdiff t xn const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t xoblock const int skip trafos pfft complex xin pfft complex out MPI Comm comm cart int sign unsigned pfft flags Plan a parallel complex input output discrete Fourier transform DFT in two or more dimensions returning an p ft plan The planner returns NULL if the plan cannot be created Arguments e rnk n n ni no howmany iblock oblock comm cart must be the same as passed to the corresponding p ft local size dft function see Section 6 4 1 6 PFFT Reference Al e The array skip trafos of size rnk_pm 1 specifies the serial transforms that will be omitted For t 0 rnk_pm set skip trafos t 1 if the t th serial trans formation should be computed otherwise set skip trafos t 0 see Section 5 3 for more details e in and out point to the complex valued input and output arrays of the transform which may be the same yielding an in place transform These arrays are over written during planning unless PFFT ESTIMATE PFFT NO TUNE is used in the flags The arrays need not be initialized but they must be allocated e signis the sign of the exponent in the formula that defines the Fourier transform
32. corded run times with void pfft reset timer pfft plan ths This function is called per default at the end of every PFFT plan creation function 6 PFFT Reference 45 6 6 2 Advanced Timer Manipulation In order to access the run times directly a new typedef pfft_timer is introduced The following function returns a copy of the timer corresponding to PFFT plan ths pfft timer pfft get timer const pfft plan ths Note that the memory of the returned p ft timer must be released with void pfft destroy timer pfft timer ths as soon as the timer is not needed anymore In the following we introduce some routines to perform basic operations on timers For all functions with a p ft timer return value you must use p ft destroy timer in order to release the allocated memory of the timer Create a copy of a PFFT timer orig with pfft timer pfft copy timer const pfft timer orig Compute the average local time over all runs of a timer ths with void pfft average timer pfft timer ths Create a new timer that contains the sum of two timers sum1 and sum2 with pfft timer pfft add timers const pfft timer suml const pfft timer sum2 Create a timer that contains the maximum times of all the timers ths from all processes belonging to communicator comm with pfft timer pfft reduce max timer const pfft timer ths MPI Comm comm Since this function calls MPI Reduce only the first process rank 0 of comm will get the desir
33. d to hold the parallel transform Arguments are the same as for c2c transforms see Section 6 4 1 with the following exceptions e The logical output array size no will differ from the physical array size of the real outputs if the flag PFFT_PADD ED C2R is included in pfft_flags This results from the padding at the end of the last dimension that is necessary to align the real valued outputs and complex valued inputs for inplace transforms see 7 In contrast to FFTW MPI PFFT does not pad the c2r outputs per default e local e local no is counted in units of real numbers o start is counted in units of real numbers The corresponding 1ocal block functions compute the local data distribution of the process with MPI rank pia vouxdWaotsei ee NEISNEO c kel HER 22 RS lU const ptrdiff t x e elec ptrdiff t local ni ptrdiff t 1 n MPI Comm comm cart oca int pid unsigned MS ptrdiff t local no ptrdiff t oca EE ONES creda y huremirel oci dox Joli Ghi 6235 int rnk n const ptrdiff t xn MPI Comm comm cart ptrdiff t local int pid unsigned pfft flags ocal A SEG ni ptrdiff t ptrdiff t local no ptrdiff t 1 oca O SuEBWEIE e 6 PFFT Reference 39 void pfft_local_block_many_dft_c2r int rnk n const ptrdiff t ni const ptrdiff t xno const ptrdiff t iblock const ptrd
34. diff t xvec Returns the product over all elements of vec ptrdiff t pfft sum INT int d const ptrdiff t xvec Returns the sum over all elements of vec int pfft equal INT int d const ptrdiff t vecl const ptrdiff t xvec2 Returns 1 if both arrays have equal entries 0 otherwise void pfft vcopy INT int d const ptrdiff t xvecl ptrdiff t vec2 Copies the elements of vec1 into vec2 void pfft vadd INT int d const ptrdiff t vecl const ptrdiff t x vec2 ptrdiff t xsum Fills sum with the componentwise sum of vec1 and vec2 void pfft vsub INT int d const ptrdiff t vecl const ptrdiff t xvec2 ptrdiff t sum Fills sum with the componentwise difference of vec1 and vec2 6 8 5 Print Three Dimensional Arrays in Parallel Use the following routine to print the elements of a block decomposed three dimensional real or complex valued array data in a nicely formatted way void pfft apr real 3d const double data const ptrdiff t local n const ptrdiff t 1local start const char name MPI Comm comm void pfft apr complex 3d const pfft complex data const ptrdiff t local n const ptrdiff t 1local start const char name MPI Comm comm Obviously this makes only sense for arrays of moderate size The block decomposition is given by 1ocal n local n start as returned by the array distribution function decribed in Section 6 4 Furthermore some arbitrary string name can be added at the beginni
35. doc MPI Initialization html MPI Ini tie lage Guo M Frigo and S G Johnson FFTW users manual MPI initialization http www ftftw org fftw3 doc Real 002ddata DFT Array Format html4Real 002ddata DFT Array Format M Frigo and S G Johnson FFTW users manual Precision http www fftw org fftw3_doc Precision html Precision M Frigo and S G Johnson FFTW users manual SIMD alignment and fftw_malloc http www fftw org fftw3_doc SIMD alignment and fftw_005fmalloc html SIMD alignment and fftw_005fmalloc M Frigo and S G Johnson The design and implementation of FFTW3 Proc IEEE 93 216 231 2005 M Frigo and S G Johnson FFTW C subroutine library http www fftw org 2009 http www fftw org Bibliography 66 12 13 14 15 16 17 18 19 20 21 22 23 W Gropp E Lusk and R Thakur Using MPI 2 Advanced Features of the Message Passing Interface MIT Press Cambridge MA USA 1999 N Li 2DECOMP amp FFT Parallel FFT subroutine library http www 2decomp org N Li and S Laizet 2DECOMP amp FFT A Highly Scalable 2D Decomposition Library and FFT Interface In Cray User Group 2010 conference pp 1 13 Edinburgh Scotland 2010 MPI Forum MPI A Message Passing Interface Standard Version 2 2 2009 http www mpi forum org D Pekurovsky P3DFFT Parallel FFT subroutine library http code google com p p3dfft D Pe
36. duces the concept of communicators to store all the topological information of the physical process layout PFFT requires to be called on a process mesh that corresponds to a periodic Cartesian communicator We assist the user in creating such a communicator with the following routine int pfft_create_procmesh_2d MPI Comm comm int np0 int npl MPI Comm comm cart 2d This routine uses the processes within the communicator comm to create a two dimensional process grid of size np0 x np1 and stores it into the Cartesian communicator comm cart 2d Note that comm cart 2dis allocated by the routine and must be freed with MPI Comm free after usage The input parameter comm is a communicator indicating which processes will participate in the transform Choosing comm as MPI COMM WORLD implies that the FF T is computed on all available processes At the next step we need to know the data decomposition of the input and output array that depends on the array sizes the process grid and the chosen parallel algorithm Therefore we call pohevetcbisee fe joutsie_dleveal_ silwS_ sel ptrdiff t n MPI Comm comm cart 2d unsigned pfft_flags 2 Tutorial 11 Listing 2 1 Minimal parallel c2c FFT test program include lt pfft h gt int main int argc char xxargv dnte ptrdiff t n 3 ptrdiff t alloc local ptrdatfeEtesosdemeN Ao Cs ca ptrdattfEteerocou mos local _o_ scare NUT pfft complex in xout pfft plan plan NULL MP
37. ed data while all the other processes have timers with undefined values Note that you can not access the elements of a timer directly since it is only a pointer to a struct However PFFT offers a routine that creates an array and copies all the entries of the timer into it double pfft convert timer2vec const pfft timer ths Remember to use free in order to release the allocated memory of the returned array at the moment it is not needed anymore The entries of the returned array are ordered as follows e dimension of the process mesh rnk pm 6 PFFT Reference 46 number of serial trafos rnk_trafo number of global remaps rnk_remap number of pfft_execute runs iter local run time of all runs rnk_n local times of the serial trafos rnk remap local times of the global remaps 2 times of the global remaps that are only necessary for three dimensional FF Ts on three dimensional process meshes e time for computing twiddled input as needed for PFFT SHIFTED OUT e time for computing twiddled output as needed for PFFT SHIFTED IN The complementary function pfft_timer pfft_convert_vec2timer const double xtimes creates a timer and fills it s entries with the data from array times Thereby the entries of times must be in the same order as above 6 7 Ghost Cell Communication In the following we describe the PFFT ghost cell communication module At the mo ment PFFT ghost cell communication is restricted to three dimensional arr
38. ements Therefore it is safe to use all these functions for a consistency check of a r2c transform followed by a c2r transform since all padding elements will be ignored 6 8 3 Initializing r2c c2r Inputs and Checking Outputs The real inputs of a r2c transform can be initialized with the functions decribed in Sec tion 6 8 2 However generating suitable inputs for a c2r transform requires more caution In order to get real valued results of a DFT the complex input coefficients need to sat isfy an radial Hermitian symmetry i e X k X k We use the following trick to generate the complex input values for c2r transforms Assume any N periodic complex valued function f It can be easily shown that the values X k 4 f k f k satisfy the radial Hermitian symmetry 6 PFFT Reference 57 To fill a complex array data with reproducible complex values that fulfill the radial Hermitian symmetry use one of the functions void pfft_init_input_complex_hermitian_3d const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start double xdata void pfft init input complex hermitian int rnk n const ptrdiff t lt n const ptrdiff t local n const ptrdiff t 1local start double data Hereby the arrays n 1ocal n and local n start give the size of the FFT the local array size and the local array offset as computed by the array distribution functions described in Section 6 4 The functions d
39. ference 42 be created Arguments e rnk n n ni no howmany iblock oblock comm cart must be the same as passed to the corresponding pfft local size dft r2c function see Sec tion 6 4 2 e in and out point to the real valued input and complex valued output arrays of the transform which may be the same yielding an in place transform These arrays are overwritten during planning unless PFFT ESTIMATE PFFT NO TUNE is used in the flags The arrays need not be initialized but they must be allocated e signis the sign of the exponent in the formula that defines the Fourier transform It can be 1 PFFT FORWARD or 1 PFFT_BACKWARD Note that this pa rameter is not part of the FFTW MPI interface where r2c transforms are defined to be forward transforms However the backward transform can be easily realized by an additional conjugation of the complex outputs as done by PFFT 6 5 3 Complex to Real FFT pfft plan pfft plan dft c2r 3d const ptrdiff t n pfft complex in double out MPI Comm CON e int sign unsigned pfft flags pfft plan pfft plan dft c2r int rnk n const ptrdiff t n pfft complex xin double out MPI Comm comm cart int sign unsigned pfft flags pfft plan pfft plan many dft c2r int rnk n const ptrdiff t xn const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t xoblock pfft complex xin double out MPI Comm comm cart int s
40. fft plan pfft plan r2r 3d const ptrdiff t n double in double out MPI Comm comm cart const pfft r2r kind xkinds unsigned pfft flags pfft plan pfft plan r2r int rnk n const ptrdiff t n double xin double out MPI Comm SO TOT carter const pfft r2r kind kinds unsigned pfft flags pfft plan pfft plan many r2r int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t xoblock double in double out MPI Comm comm cart const pfft r2r kind xkinds unsigned pfft flags pfft plan pfft plan many r2r skipped int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t iblock const ptrdiff t xoblock const int skip trafos double in double out MPI Comm SX OTT cS const pfft r2r kind kinds unsigned pfft flags Plan a parallel real input output r2r transform in two or more dimensions returning an pfft plan The planner returns NULL if the plan cannot be created Arguments e rnk n n ni no howmany iblock oblock comm cart must be the same as passed to the corresponding p ft local size r2r function see Section 6 4 4 6 PFFT Reference 44 e inand out point to the real valued input and output arrays of the transform which may be the same yielding an in place transform These arrays are overwritten during planning unless PFFT_ESTIMATE PFFT_NO_TUNE is u
41. flag FFTW PRESERVE INPUT and ensures that the input values are not overwrit ten In fact this flag implies that only the first serial transform is executed out of place and all successive steps are performed in place on the output array In compliance to FFTW this is the default behaviour for out of place plans The second flag behaves analogously to the FFTW flag FFTW DESTROY INPUT and tells the planner that the input array can be used as scratch array This may give some speedup for out of place plans because all the intermediate transforms and transposition steps can be performed out of place 2 Tutorial 21 Finally the flag PFFT_BUFFERED_INPLACE can be used for out of place plans that store its inputs and outputs in the same array i e array out is used for intermediate out of place transforms and transpositions but the PFFT inputs and outputs are stored in array in 2 9 FFTs with shifted index sets e PFFT SHIFTED IN e PFFT SHIFTED OUT 2 10 Pruned FFT and Shifted Index Sets 2 10 1 Pruned FFT For pruned r2r and c2c FFT are defined as a 2mikl a gt gt ke EUR Peg 0 1 1 k 0 where n n and n n 2 10 2 Shifted Index Sets For N 2N we define the FFT with shifted inputs For K L N 2N L lt N L lt N we define 2 11 Precisions PFFT handles multiple precisions exactly in the same way as FFTW Therefore we quote part 8 of the FFTW manual in the c
42. ging to communicator comm with pfft gctimer pfft reduce max gctimer const pfft gctimer ths MPI Comm comm Since this function calls MPI Reduce only the first process rank 0 of comm will get the desired data while all the other processes have timers with undefined values Note that you can not access the elements of a timer directly since it is only a pointer to a struct However PFFT offers a routine that creates an array and copies all the entries of the timer into it void pfft convert gctimer2vec const pfft gctimer ths double xtimes Remember to use free in order to release the allocated memory of the returned array at the moment it is not needed anymore The entries of the returned array are ordered as follows e number of p ft execute runs iter e local run time of all runs e local run time of zero padding make room for incoming ghost cells and init with zeros e local run time of the ghost cell exchange or reduce depending on the timer The complementary function 6 PFFT Reference 54 pfft gctimer pfft convert vec2gctimer const double times creates a timer and fills it s entries with the data from array times Thereby the entries of times must be in the same order as above 6 8 Useful Tools The following functions are useful tools but are not necessarily needed to perform parallel FFTs 6 8 1 Initializing Complex Inputs and Checking Outputs To fill a complex array data with reproducible complex v
43. h fftw3 includedir FFTWINC with fftw3 libdir SFFTWLIB At the end this is equivalent to configure CPPFLAGS ISFFTWINC LDFLAGS LS FFTWLIB which is more common to experienced users of the Autotools To install PFFT in a user specified directory PFFTINSTDIR call configure with the option configure prefix PFFTINSTDIR However this option is mandatory whenever you do not have root permissions on your machine since the default install paths of configure are not accessible by standard users The PFFT library must be built with a MPI compiler In Section 3 1 we already described how to hand the right compilers to the configure script Some more options are e enable float Produces a single precision version of PFFT float instead of the default double precision double see 2 11 e nable long double Produces a long double precision version of PFFT long double instead of the default double precision double see 2 11 e disable fortran Disables inclusion of Fortran wrapper routines in the stan dard PFFT libraries e disable tests Disables build of test programs For more details on the options of the configure script call configure help 3 3 How to include PFFT in your program All programs using PFFT should include its header file include lt pfft h gt 3 Installation and linking 25 This header includes the FFTW headers fftw h fftw mpi h automatically Make sure that the compiler can find them b
44. h planning flag PFFT_TRANSPOSED_OUT produces outputs with transposed data decomposition Analogously a PFFT plan created with planning flag PFFT_TRANSPOSED_IN requires its inputs to be decomposed in the transposed way Typ ically one creates a forward plan with PFFT_TRANSPOSED_OUT and a backward plan with planning flag PFFT_TRANSPOSED_IN Note that the flags PFFT_TRANSPOSED_OUT and PFFT_TRANSPOSED_IN must be passed to the array distribution function see Section 6 4 as well as to the planner see Sec tion 6 5 2 6 2 Three dimensional FFTs with three dimensional data decomposition Many applications work with three dimensional block decompositions of three dimensional arrays PFFT supports decompositions of the kind no nij ma h B x B x B x h However PFFT applies a parallel algorithms that needs at least one non distributed transform dimension we do not transform along h Therefore we split the number of processes along the last dimension into two factors P Q1Q remap the data to the two dimensional decomposition no T n PyQo PiQi and compute the parallel FFT with this two dimensional decomposition Note that the 3d to 2d remap implies some very special restrictions on the block sizes for ng and x no X h 2 Tutorial 20 n i e the blocks must be divisible by Qo and Q More precisely the default blocks of the 2d decomposition are given by n0 P0 Q0 and n1 P1 Q1 both divisions
45. iff t xoblock MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrditfEte local mo EEptrdittfEte soa Sas 6 4 4 Real to Real FFT ptrdiff t pfft local size r2r 3d const ptrdiff t n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start ptrdiff t pfft local size r2r int rnk n const ptrdiff t lt n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start ptrdiff t pfft local size many r2r int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t xoblock MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start Compute the data distribution of a parallel complex input output discrete Fourier trans form DFT in two or more dimensions returning the number of real numbers that must be allocated to hold the parallel transform Arguments are the same as for c2c transforms see Section 6 4 1 with the following exceptions e local ni is counted in units of real numbers e local i start is counted in units of real numbers e local no i
46. ifies the output array dimensions They can be any positive integer with no t lt n t for all dimensions t 0 rnk_n 1 For no t n t the outputs will be pruned to size no t along the t th dimension after the transform see Section howmany is the number of transforms to compute The resulting plan computes howmany transforms where the input of the k th transform is at location in k in C pointer arithmetic with stride howmany and its output is at location out k with stride howmany The basic p ft plan dft interface corresponds to howmany 1 comm cart is a Cartesian communicator of dimension rnk pm that specifies the parallel data decomposition see Section Most of the time PFFT requires rnk pm rnk n The only exception is the case rnk pm rnk n 3 see Section 2 6 2 If an ordinary i e non Cartesian communicator is passed PFFT internally converts it into a one dimensional Cartesian communicator while retain ing the MPI ranks this results in the FFTW MPI data decomposition The arrays iblock and oblock of size rnk_pm 1 specify the block sizes for the first rnk_pm 1 dimensions of the input and output data respectively These must be the same block sizes as were passed to the corresponding 1ocal size function You can pass PFFT DEFAULT BLOCKS to use PFFT s default block sizes Further more you can use PFFT DEFAULT BLOCK to set the default block size in separate 6 PFFT Reference 36 dimensions e g
47. ign unsigned pfft flags pfft plan pfft plan many dft c2r skipped int rnk n const ptrdiff t xn const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t xoblock const int skip trafos pfft complex xin double out MPI Comm Commac crater int sign unsigned pfft flags Plan a parallel complex input real output discrete Fourier transform DFT in two or more dimensions returning an p ft plan The planner returns NULL if the plan cannot be created 6 PFFT Reference 43 Arguments e rnk n n ni no howmany iblock oblock comm cart must be the same as passed to the corresponding pfft local size dft c2r function see Sec tion 6 4 3 e in and out point to the complex valued input and real valued output arrays of the transform which may be the same yielding an in place transform These arrays are overwritten during planning unless PFFT ESTIMATE PFFT NO TUNE is used in the flags The arrays need not be initialized but they must be allocated e signis the sign of the exponent in the formula that defines the Fourier transform It can be 1 PFFT FORWARD or 1 PFFT_BACKWARD Note that this pa rameter is not part of the FFTW MPI interface where c2r transforms are defined to be backward transforms However the forward transform can be easily realized by an additional conjugation of the complex inputs as done by PFFT 6 5 4 Real to Real FFT p
48. itted to the FFTW developers but are not yet included in the official FFTW releases wget http www fftw org fftw 3 3 3 tar gz EGS SAVE EPS 2353 89 el IEeEW 3 3543 ic omiiia nable mpi prefix HOME local fftw3 mpi make make install The MPI algorithms of FFTW must be build with a MPI C compiler Add the statement MPICC MPICCOMP at the end of line 4 if the configure script fails to determine the right MPI C compiler MPICCOMP Similarly the MPI Fortran compiler MPIFCOMP is set by MPIFC MPIFCOMP 3 2 Install of the PFFT library In the simplest case the hardware platform and the FFTW 3 3 3 library are recognized by the PFFT configure script automatically so all we have to do is wget http www tu chemnitz de mpip software pfft 1 0 7 alpha tar gz peg zayi prire l 0 aloha seus sop Cel mad AN oa configure make 3 Installation and linking 24 make check make install Hereby the optional call make check builds the test programs If the FFTW 3 3 3 software library is already installed on your system but not found by the configure script you can provide the FFTW installation directory FFTWDIR to configure by configur with fftw3 SFFTWDIR This call implies that the FFTW header files are located in SFFTWDIR include and the FFTW library files are located in FFTWDIR lib Otherwise one should specify the FFTW include path SFFTWINC and the FFTW library path SFFTWLIB separately by ASOMO Us wit
49. kurovsky P3DFFT A Framework for Parallel Computations of Fourier Transforms in Three Dimensions SIAM J Sci Comput 34 C192 C209 2012 M Pippig PFFT Parallel FFT subroutine library 2011 http www tu chemnitz de mpip software php M Pippig PNFFT Parallel Nonequispaced FFT subroutine library 2011 http www tu chemnitz de mpip software php M Pippig PF FT An extension of FFTW to massively parallel architectures SIAM J Sci Comput 35 C213 C236 2013 M Pippig and D Potts Parallel three dimensional nonequispaced fast Fourier trans forms and their application to particle simulation SIAM J Sci Comput accepted 2013 S J Plimpton Parallel FFT subroutine library http www sandia gov sjplimp docs fft README html S J Plimpton R Pollock and M Stevens Particle Mesh Ewald and rRESPA for Parallel Molecular Dynamics Simulations In Proceedings of the 8th SIAM Confer ence on Parallel Processing for Scientific Computing Minneapolis 1997 Philadel phia 1997 SIAM
50. l FF Ts by using two dimensional decomposition approach However these implementations are often fitted to special problems and where not published as a stand alone software library Remarkable exceptions are the parallel FFT software library by S Plimpton 23 22 the P3DFFT software library by D Pekurovsky 17 16 and the 2DECOMP amp FFT software library by N Li 14 13 1 2 Parallel nonequispaced FFT If your are interested in a parallel implementation of nonequispaced fast Fourier trans forms NFFT for distributed memory architectures you should have a look at our PNFFT software library 19 21 that is also available at https github com mpip pnfft git 2 Tutorial The following chapter describes the usage of the PFFT library at the example of a simple test file in the first section followed by the more advanced features of PFFT in the next sections 2 1 A first parallel transform Three dimensional FFT with two dimensional data decomposition We explain the basic steps for computing a parallel FFT with the PFFT library at the example of the short test program given by Listing 2 1 This test computes a three dimensional c2c FFT on a two dimensional process mesh The source code manual_c2c_3d c can be found in directory tests of the library s source code tree After initializing MPI with MPI Init and before calling any other PFFT routine initialize the parallel FFT computations via void pfft_init void MPI intro
51. l data size and allocate x alloc local pfft local size dft 3d n MPI COMM WORLD omire JE LACS Local ml local i siento LOCAL MO OS AOS aaa F cearca AO Cc ME OM PIES CO O e t create plan for in place forward DFT x plan pfft_plan_dft_3d n data data MPI_COMM_WORLD PFFT FORWARD PFFT ESTIMATE x slinalics iil aber laica TO SOMS Biei soja MO cues ona sy E fora OF a E OEA ES Eor y OF y lt als LEE moja Ue O02 Kk im Aig sue dara isa Lib esm 2 4 gesm 2 s kl i runet ocal 1 susi 101 s sky To Ng compute transforms in place as many times as desired x pfft execute plan pfft destroy plan plan MENE ab ve 0 2 substitute fftw3 mpi h by pfft h substitute all prefixes ftw and fftw mpi by pfft substitute all prefixes FFTW by PFFT_ the integers N 1ocal n0 local 0 start become arrays of length 3 dft in pfft local size dft 3d pfft local size dft 3d has additional input p ft flags and additional out puts local no local o start The loop that inits data becomes splitted along all three dimensions We could also use First All prefixes fftw_ are substituted by pfft_ 2 Tutorial 16 Now the changes in order to use a two dimensional process mesh are marginal as can be seen in Listing include lt pfft h gt int main int argc char xargv const ptrdiff t n 3 const int np0 pfft plan plan pfft complex
52. ls PFFT This header automatically includes fftw h and fftw3 mpi h Therefore PFFT can use the ftw complex data type defined in fftw h see 3 Note that fftw complex is defined to be the C99 na tive complex whenever complex h is included before lt fftw h gt lt fftw mpi h gt and pfft h Otherwise it is defined as typedef double fftw complex 2 For the sake of a clean namespace we define the wrapper data type p ft complex as typedef fftw complex pfft complex that can be used equivallently to ftw complex Futhermore we define the wrapper functions void pfft malloc size t n double pfft alloc real size t n pfft complex pfft alloc complex size t n void pfft free void xp as substitues for their corresponding FFTW equivalents see 4 Note that memory allocated by one of these functions must be freed with p ft free or its equivalent fftw free Because of the performance reasons given in 9 we recommend to use one of the pfft_ or its equivalent ftw allocation functions for all arrays containing FFT inputs and outputs However PFFT will also work possibly slower with any other memory allocation method Different precisions are handled as in FFTW That is p t_ functions and datatypes become pfftf_ single precision or pfft1_ long double precision prefixed Quadruple precision is not yet supported The main problem is that we do not know about a suitable MPI datatype to represent __float128
53. n pfft complex out MPI Comm comm cart int sign unsigned pfft flags Therefore n 1ocal ni local i start local no and local o start become ar rays of length rnk n 5 2 Advanced Interface The advanced interface introduces the arrays ni and no of length rnk n that give the pruned FFT input and output size Furthermore the arrays iblock and oblock of length rnk pm rnk pm being the dimension of the process mesh serve to adjust the block size of the input and output block decomposition The additional parameter howmany gives the number of transforms that will be computed simultaneously ptrdiff t pfft local size many dft int rnk n const ptrdiff t lt n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t iblock const ptrdiff t xoblock MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start void pfft local block many dft int rnk n const ptrdiff t ni const ptrdiff t no const ptrdiff t xiblock const ptrdiff t xoblock MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start pfft plan pfft plan many dft int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t iblock const ptrdiff t xoblock pfft complex xin pfft complex out MPI Comm comm cart int sig
54. n unsigned pfft flags 5 3 Preliminary Skip Serial Transformations The skipped interface extends the many interface by adding the possibility to skip some of the serial FFTs pfft_plan pfft_plan_many_dft_skipped int rnk n const ptrdiff t x n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t iblock const ptrdiff t xoblock 5 Interface Layers of the PFFT Library 32 const int skip trafos pfft complex xin pfft complex out MPI Comm comm cart int sign unsigned pfft flags Hereby skip trafos is an int array of length rnk_pm 1 rnk pm being the mesh di mension of the communicator comm cart Fort 0 rnk pmsetskip trafos t 1 if the t th serial transformation should be computed otherwise set skip trafos t 0 Note that the local transpositions are always performed since they are a prerequisite for the global communication to work At the moment it is only possible to skip the whole serial transform along the last rnk n rnk pm 1 dimensions However this behaviour can be realized by a call of a rnk_pm 1 dimensional PFFT with Lor eiaa esmane E nn howmany x n t and manual computation of the desired serial transforms along the last rnk_n rnk_pm 1 dimensions 6 PFFT Reference 6 1 Files and Data Types You must include the PFFT header file by include lt pfft h gt in the preamble of each source file that cal
55. n 1 x n 2 is done by the function pfft plan pfft plan dft 3d ptrdiff t n pfft complex xin pfft complex out MPI Comm comm cart 2d int sign unsigned pfft flags We provide the address of the input and output array by the pointers in and out respectively An inplace transform is assumed if these pointers are equal The integer sign gives the sign in the exponential of the FF T Possible values are PFFT FORWARD 1 and PFFT BACKWARD 1 Flags passed to the planner via pf ft flags must coincide with the flags that were passed to pf ft local size 3d Otherwise the data layout of the parallel execution may not match calculated local array sizes As return value we get a PFFT plan some structure that stores all the information needed to perform a parallel FFT Once the plan is generated we are allowed to fill the input array in Note that per default the planning step pfft plan dft 3d will overwrite input array in Therefore you should not write any sensitive data into in until the plan was generated For simplicity our test program makes use of the library function void pfft init input complex 3d 2 Tutorial 13 ptrdiff t n ptrdiff t local ni ptrdiff t local i start pfft complex xin to fill the input array with some numbers Alternatively one can fill the array with a function func of choice and the following loop that takes account of the parallel data layout ptrdiff t m 0 for ptrdiff t k0 0 k0
56. ng of each output typically this will be the name of the array Communicator 6 PFFT Reference 59 comm must be suitable to the block decomposition and is used to synchronize the outputs over all processes Generalizations for the case where the dimensions of the local arrays are permuted are given by void pfft apr real permuted 3d const double data const ptrdiff t local n const ptrdiff t xlocal_start int perm0 int perml int perm2 const char name MPI Comm comm void pfft apr complex permuted 3d const pfft complex data const ptrdiff t local n const ptrdiff t 1local start int perm0 int perml int perm2 const char name MPI Comm comm Hereby perm0 permi and perm2 give the array s permutation of dimension 8 ysp 6 8 6 Reading Command Line Arguments The following function offers a simple way to read command line arguments into an array parameter void pfft_get_args int argc char xxargv const char xname int neededArgs unsigned type void xparameter Hereby argc and argv are the standard argument of the main routine Furthermore name neededAgrs and type give the name number of entries and the type of the com mand line argument Supported types are PFFT_INT PFFT_PTRDIFF_T PFFT_FLOAT PFFT_DOUBLE and PFFT_UNSIGNED which denote the standard C type that is used for typecasting In addition you can use the special type PFFT_SWITCH that is an integer type equal to one if the corresponding
57. nicator comm Discard all the recorded run times with void pfft reset gctimers pfft gcplan ths This function is called per default at the end of every ghost cell plan creation function In order to access the run times directly a new typedef pfft timer is introduced The following functions return a copy of the timer corresponding to ghost cell plan ths that accumulated the time for ghost cell exchange or ghost cell reduce respectively pfft gctimer pfft get gctimer exg const pfft gcplan ths pfft gctimer pfft get gctimer red const pfft gcplan ths 6 PFFT Reference 53 Note that the memory of the returned p ft gctimer must be released with void pfft destroy gctimer pfft gctimer ths as soon as the timer is not needed anymore In the following we introduce some routines to perform basic operations on timers For all functions with a p ft gctimer return value you must use p ft destroy gctimer in order to release the allocated memory of the timer Create a copy of a ghost cell timer orig with pfft gctimer pfft copy gctimer const pfft gctimer orig Compute the average local time over all runs of a timer ths with void pfft average gctimer pfft gctimer ths Create a new timer that contains the sum of two timers sum1 and sum2 with pfft gctimer pfft add gctimers const pfft gctimer suml const pfft gctimer sum2 Create a timer that contains the maximum times of all the timers ths from all processes belon
58. omposition must be at least one dimension smaller e PFFT offers portable performance e g it will perform well on most platforms e The application of PFFT is split into a time consuming planning step and a high performance execution step e Installing the library is easy It is based on the common sequence of configure make and make install e The interface of PFFT is very close to the MPI interface of FFTW In fact we tried to add as few extra parameters as possible 1 Introduction 8 e PFFT is written in C but also offers a Fortran interface see Section e FFTW includes shared memory parallelism for all serial transforms This enables us to benefit from hybrid parallelism to a certain amount see Section e All steps of our parallel FFT can be performed completely in place This is espe cially remarkable for the global transposition routines e Confirming to good MPI programming practice all PFFT transforms can be per formed on user defined communicators In other words PFFT does not enforce the user to work with MPI COMM WORLD e PFFT uses the same algorithm to compute the size of the local array blocks as FFTW This implies that the FFT size need not be divisible by the number of processes e PFFT supports single double and long double precision e PFFT supports new array execution i e a PFFT plan can be planned and exe cuted on different plans up to some restrictions see Section for details Thanks to Yu
59. ontext of PFFT You can install single and long double precision versions of PFFT which replace dou ble with float and long double respectively see To use these interfaces you must e Link to the single long double libraries on Unix 1pfftf or 1pfftl instead of or in addition to 1pfft You can link to the different precision libraries simultaneously e Include the same p ft h header file bo Tutorial 29 e Replace all lowercase instances of pfft_ with pfftf_ or pfft1_ for single or long double precision respectively p ft complex becomes pfftf complex pfft execute becomes pfftf execute etcetera e Uppercase names i e names beginning with PFFT_ remain the same e Replace double with float or long double for subroutine parameters El 2 12 Ghost cell communication 2 13 Fortran interface 3 Installation and linking The install of PFFT is based on the Autotools and follows the typical workflow configure make make install 3 1 Install of the latest official FFTW release PFFT depends on Release 3 3 3 of the FFTW library 11 For the sake of completeness we show the command line based install procedure in the following However note that we provide install scripts on www tu chemnitz de mpip software php that simplify the install a lot We highly recommend to use these install scripts since they additionally apply several performance patches and bugfixes that have been subm
60. ouble pfft check output complex hermitian 3d const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start const pfft complex data MPI Comm comm double pfft check output complex hermitian int rnk n const ptrdiff t lt n const ptrdiff t local n const ptrdiff t 1local start const pfft complex data MPI Comm comm compute the l norm between the elements of array data and values produced by pfft init input complex hermitian 3d pfft init input complex hermitian In addition we supply the following functions for setting all the input data to zero at once void pfft clear input complex hermitian 3d const ptrdiff t xn const ptrdiff t local n const ptrdiff t local n start pfft complex xdata void pfft clear input complex hermitian int rnk n const ptrdiff t n const ptrdiff t local n const ptrdiff t local start pfft complex data Note that these functions can also be used in order to generate complex inputs with radial Hermitian symmetry for ordinary c2c transforms Of course the results of such a c2c DFT will have all imaginary parts equal to zero up to machine precision 6 8 4 Operations on Arrays of Type ptrdiff t The following routines are shortcuts for the elementwise manipulation of ptrdiff t valued arrays In the following all arrays vec vec1 and vec2 are of length d and type ptrdiff t 6 PFFT Reference 58 ptrdiff t pfft prod INT int d const ptr
61. planner flags flag PFFT GC PADDED R2C or its equivalent PFFT GC PADDED C2R implies an ordinary real valued ghost cell communication on an array of size n 0 x x n rnk n 2 6 PFFT Reference 52 x 2 n rnk_n 1 2 1 Especially the padding elements will be handles as normal data points i e you must we aware that the numbers of ghost cells gc below rnk n 1 and gc above rnk n 1 include the number of padding elements 6 7 6 Inofficial Flags 6 7 7 Ghost Cell Execution Timer PFFT ghost cell plans automatically accumulate the local run times of every call to pfft exchange and pfft reduce For most applications it is sufficient to print run time of a plan ths averaged over all runs with void pfft print average gctimer const pfft gcplan ths MPI Comm comm Note that for each timer the maximum time over all processes is reduced to rank 0 of communicator comm i e a call to MPI Reduce is performed and the output is only printed on this process The following function works in the same way but prints more verbose output void pfft print average gctimer adv const pfft gcplan ths MPI Comm comm To write the averaged run time of a ghost cell plan ths into a file called name use void pfft write average gctimer const pfft gcplan ths const char name MPI Comm comm void pfft write average gctimer adv const pfft gcplan ths const char name MPI Comm comm Again the output is only written on rank 0 of commu
62. plest interface layer It is suitable for the planning of three dimensional FFTs ptrdiff t pfft local size dft 3d const ptrdiff t n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start void pfft local block dft 3d const ptrdiff t n MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start pfft plan pfft plan dft 3d const ptrdiff t xn pfft complex in pfft complex out MPI Comm comm cart int sign unsigned pfft flags Hereby n 1ocal ni local i start local no and local o start are ptrdiff t arrays of length 3 The basic interface generalizes the 3d interface to FFTs of arbitrary dimension rnk n ptrdiff t pfft local size dft int rnk n const ptrdiff t lt n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start void pfft local block dft int rnk n const ptrdiff t x n MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start pfft plan pfft plan dft 5 Interface Layers of the PF FT Library 31 int rnk n const ptrdiff t xn pfft complex xi
63. plex array that needs to be allocated by every process In most cases this coincides with the product of the local array sizes but may be bigger whenever the parallel algo rithm needs some extra storage The input value n gives the three dimensional FF T size and the flag pfft_flags serves to adjust some details of the parallel execution For the sake of simplicity we restrict our self to the case p ft flags PFFT TRANSPOSED NONE for a while and explain the more sophisticated flags at a later point The output ar rays local ni and 1ocal i start give the size and the offset of the local input array that result from the parallel block distribution of the global input array i e every pro cess owns the input data in k 0 k 1 k 2 with local i start t lt k t lt local i start t local_ni t for t 0 1 2 Analogously the output parameters local_o_start and local_no contain the size and the offset of the local output array Afterward the input and output arrays must be allocated Hereby pfft complex pfft alloc complex size t size is a simple wrapper of fftw alloc complex which in turn allocates the memory via fftw malloc to ensure proper alignment for SIMD Have a look at the FFTW user man ual 9 for more details on SIMD memory alignment and ftw malloc Nevertheless you can also use any other dynamic memory allocation The planning of a single three dimensional parallel FFT of size n 0 x
64. putation Algorithm 4 2 Shifted FFT without explicit data movement 21 k ks ls m 1 For k 0 n 1 compute fr k ks 2 For 0 n 1 compute f 22 8 fie 2mikl n using PFFT 3 For 1 0 9 1 compute gq41 fie ksta The special case ks 4 ls corresponds to the shifts the arrays FFTSHIFT 1 For k 0 n 1 compute fre Guanes ere 2 For 0 n 1 compute f 25 8 fre rikl n using PFFT 3 For 0 no 1 compute g 1 5 5 fre trinil n 4 Advanced Features 29 4 2 Parallel pruned FFT Within PFFT we define a pruned FFT as ni 1 gi o ppe ert l 0 no 1 k 0 Formally this is equivallent to the following regular size n FFT n 1 f 2mikl f M fre min Tuy k 0 with fk k 0 n1 1 k 0 kh nj n 1 and f gi k 0 ng 1 ILe we add n nj zeros at the end of the input array and throw away n n entries at the end of the output array The definition of pruned FFT changes for PFFT SHIFTED IN and PFFT SHIFTED OUT 5 Interface Layers of the PFFT Library We give a quick overview of the PFFT interface layers in the order of increasing flex ibility at the example of c2c FFTs For r2c c2r and r2r FFT similar interface layer specifications apply A full reference list of all PFFT functions is given in Chapter 6 5 1 Basic Interface The _3d interface is the sim
65. rounded down This implies that the default blocks of the 3d decomposition must be n0 P0 Q0 00 n1 P1 Q1 O1 and n2 Q0 Q1 all divisions rounded down 2 7 Planning effort Pass one of the following flags e PFFT ESTIMATE e PFFT MEASURE e PFFT PATIENT Or e PFFT EXHAUSIVE to the PFFT planner in order to plan all internal FFTW plans with FFTW ESTIMATE FFTW MEASURE FFTW PATIENT or FFTW EXHAUSIVE respectively The default value is PFFT MEASURE PFFT uses FFTW plans for parallel array transposition and the serial transforms In fact every serial transform is a combination of strided lower dimensional FFTs and a serial array transposition necessary to prepare the global transposition which can be done by a single FFTW plan However it turns out that FFTW sometimes per forms better if the serial transposition and the strided FFTs are executed separately Therefore PFFT introduces the flag PFFT TUNE that enables extensive run time tests in order to find the optimal sequence of serial strided FFT and serial transposition for every serial transform These tests are disable on default which corresponds to the flag PFFT NO TUNE 2 8 Preserving input data The following flags e PFFT PRESERVE INPUT e PFFT DESTROY INPUT and e PFFT BUFFERED INPLACE only take effect for out of place transforms The first one behaves analogously to the FFTW
66. s by gc below and gc above as described in Section 6 7 The flags gc 1ags must be set appropriately to the flags that were passed to the FF T planner Table 6 1 shows the ghost cell planner flags that must be set in dependence on the listed FFT planner flags In addition we introduce the flag FFT flag ghost cell flag PFFT TRANSPOSED NONE PFFT GC TRANSPOSED NONE PFFT TRANSPOSED IN PFFT GC TRANSPOSED PFFT TRANSPOSED OUT PFFT GC TRANSPOSED Table 6 1 Mapping of FFT and complex valued ghost cell planner flags PFFT GC R2C and its equivalent PFFT GC C2R to handle the complex array storage format of r2c and c2r transforms In fact these two flags imply an ordinary complex valued ghost cell communication on an array of size n 0 x x n rnk n 2 x n rnk_n 1 2 1 Please note that we wrongly assume periodic boundary conditions in this case Therefore you should ignore the data elements with the last index behind n rnk n 1 2 6 PFFT Reference 51 6 7 5 Plan Creation for Real Data The following functions create ghost cell plans that operate on real valued arrays i e e r2r inputs e r2r outputs e r2c inputs and e c2r outputs Corresponding to the three interface layers for FFT planning there are the following three layers for creating a real valued ghost cell plan pfft gcplan pfft plan rgc 3d const ptrdiff t n const ptrdiff t gc below const ptrdiff t gc
67. s counted in units of real numbers e local o start is counted in units of real numbers The corresponding 1ocal block functions compute the local data distribution of the process with MPI rank pid Niel ote local lolo 1221 301 const ptrdiff t n MPI Comm comm cart int pid unsigned dae aos ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start void pfft local block r2r 6 PFFT Reference 40 int rnk n const ptrdiff t n MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start void pfft local block many r2r int rnk n const ptrdiff t ni const ptrdiff t xno const ptrdiff t iblock const ptrdiff t xoblock MPI Comm comm cart int pid unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start 6 5 Plan Creation 6 5 1 Complex to Complex FFT pfft plan pfft plan dft 3d const ptrdiff t n pfft complex xin pfft complex x out MPI Comm comm cart int sign unsigned pfft flags pfft plan pfft plan dft int rnk n const ptrdiff t n pfft complex in pfft complex xout MPI Comm comm cart int sign unsigned pfft flags pfft plan pfft plan many dft int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany
68. sed in the flags The arrays need not be initialized but they must be allocated e The array kinds of length rnk_n specifies the kind of r2r transform that is com puted in the corresponding dimensions Just like FFTW MPI we compute the separable product formed by taking each transform kind along the corresponding dimension one dimension after another 6 6 FFT Execution Timer PFFT offers an easy way to perform run time measurements and print write the results 6 6 1 Basis Run Time Measurements PFFT plans automatically accumulate the local run times of every call to p ft execute For most applications it is sufficient to print run time of a plan ths averaged over all runs with void pfft print average timer const pfft plan ths MPI Comm comm Note that for each timer the maximum time over all processes is reduced to rank 0 of communicator comm i e a call to MPI Reduce is performed and the output is only printed on this process The following function works in the same way but prints more verbose output void pfft print average timer adv const pfft plan ths MPI Comm comm To write the averaged run time of plan ths into a file called name use void pfft write average timer const pfft plan ths const char name MPI Comm comm void pfft write average timer adv const pfft plan ths const char name MPI Comm comm Again the output is only written on rank 0 of communicator comm Discard all the re
69. t used in Listing 2 1 is the forwarded error code of MPI1 Cart create It is equal to zero if the communicator was created successfully The most common error is that the number of processes within the input communicator comm does not fit npO x np1 In this case the Cartesian communicator is not generated and the return value is unequal to zero Therefore a typical sanity check might look like VEO cial two dimensional process grid of size np 0 x npll Xx possibla af if pfft create procmesh 2d MPI COMM WORLD np 0 np 1 amp comm cart 2d pfft fprintf MPI COMM WORLD stderr Umaron Wais reste cile only works with S CISID ISO CO SIS C SIVE np 0 np 1 1 MPI Finalize return Ly Hereby we use the PFFT library function void pfft_fprintf MPI Comm comm FILE stream const char format to print the error message This function is similar to the standard C function fprintf with the exception that only the process with MPI rank 0 within the given communicator comm will produce some output see Section 6 8 7 for details 2 4 Inplace transforms Similar to FFTW PFFT is able to compute parallel FFTs completely in place which means that beside some constant buffers no second data array is necessary Especially the global data communication can be performed in place As far as we know there is 2 Tutorial 18 no other parallel FFT library beside FFTW and PFFT that supports this feature This fea
70. transforms 6 7 4 Plan Creation for Complex Data The following functions create ghost cell plans that operate on complex valued arrays i e c2c inputs c2c outputs r2c outputs use flag PFFT_GC_C2R and c2r inputs use flag PFFT_GC_R2C 6 PFFT Reference 50 Corresponding to the three interface layers for FFT planning there are the following three layers for creating a complex valued ghost cell plan pfft gcplan pfft plan cgc 3d const ptrdiff t xn const ptrdiff t gc below const ptrdiff t gc above pfft complex data MPI Comm comm cart unsigned gc flags pfft gcplan pfft plan cgc int rnk n const ptrdiff t n const ptrdiff t gc below const ptrdiff t gc above pfft complex data MPI Comm comm cart unsigned gc flags pfft gcplan pfft plan many cgc int rnk n const ptrdiff t n ptrdiff t howmany const ptrdiff t block const ptrdiff t gc below const ptrdiff t gc above pfft complex data MPI Comm comm cart unsigned gc flags Hereby rnk n n howmany and comm cart must be the variables that were used for the PFFT plan creation However only the case rnk n 3 is completely implemented at the moment Remember that n is the logical FFT size just as it is the case for FFT planning The block size block must be equal to iblock or oblock depending on whether the ghost cell plan works on the FF T input or output array Analogously data becomes in or out Set the number of ghost cell
71. trdiff t 1 void pfft local block dft int rnk n const ptrdiff t x n EON Sisi MPI Comm comm cart int pid unsigned pfft flags oca MS aa ptrdiff t local ni ptrdiff t 1 ptrdiff t local no ptrdiff t oca void pfft_local_block_many_dft 0 Sess E int rnk n const ptrdiff t xni const ptrdiff t no const ptrdiff t iblock const ptrdiff t xoblock MPI Comm comm cart int pid unsigned pfft flags oca ptrdiff t local ni ptrdiff t 1 ptrdiff t local no ptrdiff t 1 oca 6 4 2 Real to Complex FFT perdit Stip eoe sive ple ME GI EENES Usb MOS Tata g const ptrdiff t n MPI Comm comm cart unsigned pfft_flags oca ptrdiff t local ni ptrdiff t ptrdiff t local no ptrdiff t x1 oca peigebisee fe prre local siza crr s26 EENES b MOS Hana eo 6 PFFT Reference 37 int rnk n const ptrdiff t lt n MPI Comm comm cart unsigned pfft flags ptrdiff t local ni ptrdiff t local i start ptrdiff t local no ptrdiff t local o start ptrdiff t pfft local size many dft r2c int rnk n const ptrdiff t n const ptrdiff t ni const ptrdiff t no ptrdiff t howmany const ptrdiff t xiblock const ptrdiff t oblock MPI_Comm comm_cart unsigned pfft_flags ptrdiff t local ni ptrdiff t local i start ptrdiff t
72. ttleneck on large scale parallel computers For example a three dimensional FFT of size 1024 can be computed with at most 1024 MPI processes In contrast using a two dimensional data decomposition would increase the maximum number of MPI processes to 1024 in this case The main goal of PFFT is to extend the MPI part of the FFTW software library to multi dimensional data decompositions i e d dimensional FFTs of size N can be computed in parallel with at most N MPI processes In addition PFFT offers several extra features that are particular usefull for parallel distributed memory FFTs but are not yet present in FFTW MPI We refer to the publication 20 for a closer look on the different data decompositions and the underlying algorithms of the PFFT library The interface of PFFT is as close as possible to the FFTW MPI interface In fact we consider every difference between PFFT and FF TW that is not explicitly mentioned within this manual as a bug that should be reported to https github com mpip pfft git Therefore porting code that uses FF TW MPI to PFFT is almost trivial e g see Section 2 2 Most features of PFFT are inherited from FFTW or similarily implemented These include the following e We employ fast O N log N algorithms of FFTW to compute arbitrary size discrete Fourier transforms of complex data real data and even or odd symmetric real data e The dimension of the FFT can be arbitrary However parallel data dec
73. ture is enabled as soon as the pointer to the output array out is equal to the pointer to the input array in E g in Listing 2 1 we would call Plan parallel forward FFT olaa oree olen Clic Seta La alin lom esti 40 PFFT FORWARD PFFT TRANSPOSED NONE 2 5 Higher dimensional data decomposition The test program given in Listing 2 1 used a two dimensional data decomposition of a three dimensional data set Moreover PFFT support the computation of any d dimensional FFT with r dimensional data decomposition as long as r lt d 1 For example one can use a one dimensional data decomposition for any two or higher dimensional data set while the data set must be at least four dimensional to fit to a three dimensional data decomposition The case r d is not supported efficiently since during the parallel computations there is always at least one dimension that remains local i e one dimensions stays non decomposed The only exception from this rule is the case d r 3 that is supported by PFFT in a special way see Section 2 6 2 for details The dimensionality of the data decomposition is given by the dimension of the Carte sian communicator that goes into the PFFT planing routines Therefore we present a generalization of communicator creation function int pfft_create_procmesh int rnk np MPI Comm comm const int np MPI Comm comm cart Hereby the array np of length rnk np gives the size of the Cartesian
74. xargv const ptrdiff t nO 4 nl 4 n2 4 E olan plem fftw_complex xdata jrerckisse 18 alloc _iceal ilocal_im0 local E MPI Init amp argc amp argv CNG MOI amaia get local data size and allocate alloc local fftw mpi local size 3d n0 nl n2 MPI COMM WORLD Local 390 local 0 arent p data fftw alloc complex alloc local create plan for in place forward DFT x plan fftw mpi plan dft 3d n0 nl n2 data data MPI COMM WORLD FFTW FORWARD FFTW ESTIMATE fe chmhiesbell see daca co Some FomeccLom my vincia xv e c fore 03 L lt local ap cri tor Of 37 S inp sry fora Ue RO a m2p arre dema ivinisin2 Jima k my Eiaction local_ 0 sies a do 1 ptrdiff t local_ni 3 1ocal n0 nl n2 local_i_start 3 local_0 staicc 0 Oe jue _eyow Comoe sicl Clerca loecailini local 3L euim e MPI COMM WORLD compute transforms in place as many times as desired x fftw execute plan estelas ie local mola localm0 ml m2 local_o_seamel3 Toca Monstre 0 0 DE a comal 3e0llelera docal_mo local_o sue eucaues Y MPI COMM WORLD fftw destroy plan plan MPI Finalize 2 Tutorial 15 pfft_plan plan pfft complex xdata pErdrtfete Eee NN NK NNI PIE o Is o e local sols local_o stare Si do Jo Jg unsigned pfft flags 0 MPI Init amp argc amp argv prre init p get loca
75. y setting the include flags appropriately You must also link to the PFFT FFTW and FFTW MPI libraries On Unix this means adding lpfft lfftw3 mpi lfftw3 lm at the end of the link command For example to build p ft test c use the following compiler invocation mpicc pfft test c ISPFFTINC ISFFTWINC LSPFFTLIB L FFTWLIB Sloe Jod WS Np Lies Jn Substitute mpicc by any other MPI C compiler if you like SPFFTINC SFFTWINC SPFFTLIB and SFFTWLIB denote the PFFT and FFTW include and library paths re spectively If you use the install scripts mentioned in Sect 3 2 these paths will be PFFTINC S HOME local pfft 1 0 7 alpha include FFTWINC SHOME local fftw 3 3 3 include PFFTINC HOME local pfft 1 0 7 alpha lib FFTWINC HOME local fftw 3 3 3 lib 4 Advanced Features 4 1 How to Deal with FFT Index Shifts in Parallel Let n 2N A common problem is that the index of the FFT input and or output array runs between 2 2 1 but the FFT library requires them to run between 0 n 1 With serial program execution one can easily remap the input data in a way that is suitable for the library i e n 9 k n 2 mod n k 0 n 1 Similarly one could remap the outputs of the library fj 0 n 1 in the opposite direction in order to get the required outputs i e gi Fimod vis l n 2 7 2 1 These shifts are also known as fftshift in M
Download Pdf Manuals
Related Search
Related Contents
¥ Liberty (520) Manual BRAVIA B4000: armonía y estilo en cada habitación 平成27年1月期 決算短信〔日本基準〕(連結) LA TAXE D`APPRENTISSAGE Sony BDV-N8100W Installation Guide 4 - Bosch Security Systems Samsung Q1244AT Manual de Usuario Explosionszeichnungen - ReinigungsBerater.de FOS 100 POWER USER MANUAL REL 1.2 ENG GE WWA7650S User's Manual Copyright © All rights reserved.
Failed to retrieve file