Home
        GF-Complete: A Comprehensive Open Source Library for
         Contents
1.       PGM13b   The Paper describes all of those techniques as well     If You Would Like Help With the Software    Please contact the first author of this manual     CONTENTS    Three Simple Command Line Tools  gf_mult  gf divandgfadd                      Quick Starting Example  1  Simple multiplication and division        2    2 20 0      000   Quick Starting Example  2  Multiplying a region by aconstant                 000   Quick Starting Example  3  Using w   64            0 00 00 00000000022 eee  Quick Starting Example  4  Using w   128             2 00    0 0 000000  00000     6 1 1 Changing the Components of a Galois Field with create_gffrom_argv                 6 1 3 Changing the Multiplication Technique      2       0 0 2 0    2 0    redre  6 1 4 Changing the Division Technique                  0    0000 02 00 000   6 1 5 Changing the Region Technique  lt s p posadia 0 2 0 0    0000000000 0000   Determining Supported Techniques with gf  methods                    00000005    Inlining Single Multiplication and Division for Speed                      0000   Using different techniques for single and region multiplication                             CARRY_FREE    and the Primitive Polynomial               0    0000000000     7 8 1 Primitive Polynomials that are not Primitive    2    2    2 0    e  7 8 2 Default Polynomials for Composite Fields    2    2    ee  7 8 3 The Program gf_poly for Verifying Irreducibility of Polynomials                     Contents  1 Introducti
2.     m TABLE        For w      32  64  128   all    SPLIT    implementations are unsafe for region_multiply    This means that if the  default uses    SPLIT     see Table 1 for when that occurs   then region_multiply   is not thread safe     The    COMPOSITE    operations are only safe if the implementations of the underlying fields are safe     9 Listing of Procedures    The following is an alphabetical listing of the procedures  data types and global variables for users to employ in  GF complete     GF_W16_INLINE_DIV   in gf complete h  This is a macro for inline division when w   16  See section 7 1     GF_W16INLINE_MULTQ   in gf complete h  This is a macro for inline multiplication when w   16  See  section 7 1     GF_W4_INLINE_MULTDIV     in gf_complete h  This is a macro for inline multiplication division when w    4  See section 7 1     GF_W8_INLINE_MULTDIV   in gf_complete h  This is a macro for inline multiplication division when w    8  See section 7 1     MOA Fill_Random_Region   in gf_rand h  Fills a region with random numbers   MOA Random 128   in gf_rand h  Creates a random 128 bit number    MOA Random 32   in gf_rand h  Creates a random 32 bit number   MOA_Random_64   in gf_rand h  Creates a random 64 bit number     MOA_Random_W   in gf_rand h  Creates a random w bit number  where w  lt  32        MOA Seed   in gf_rand h  Sets the seed for the random number generator    _gf_errno in gf_complete h  This is to help figure out why an initialization call failed  See
3.    3 925 MB s                   7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 28       multiply   multiply_region    Full white is 25 5 Mega ops   second Full white is 720 MB   second                                              234567 8 910111213141516                Figure 3  The performance of multiply   and multiply_region   using    GROUP     and varying the arguments gs  and gr  All graphs are heat maps with black equaling zero  The region size is 100KB     There is support for performing multiply   inline for the    TABLE    implementations for w      4 8  and for the     LOG    implementation for w   16  see section 7 1   These are leveraged by multiply   in    COMPOSITE     and  by multiply_region   if you are not using    ALTMAP     To demonstrate this  in the table below  you can see that the  performance of multiply   with    SPLIT 8 4    is 88 percent as fast than the default in w   8  which is    TABLE       When you use each as a base field for    COMPOSITE    with w   16  the one with    SPLIT 8 4    is now just 37 percent  as fast  The difference is the inlining of multiplication in the base field when    TABLE    is employed     gftime 8 M 0 1048576 100   Speed   501 Mega ops s  gftime 8 M 0 1048576 100  m SPLIT 8 4   Speed   439 Mega ops s       gftime 8 M 0 1048576 100  m COMPOSITE 2     Speed   207 Mega ops s    gf_time 8 M 0 1048576 100  m COMPOSITE 2  m SPLIT 8 4       Speed  77 Mega ops s                You can keep making recursive definitions of co
4.    Only for GF  24       SPLIT    Only when f ae argument equals 4        SPLIT    SSE4   When w   64 and not using    ALTMAP         BYTWO_p         BYTWO_b         GROUP    Only for GF 2  8         Table 2  Multiplication techniques which can leverage SSE instructions when they are available     e    NOSSE     Force non SSE version     e    DOUBLE     Use a table that is indexed on two words rather than one  This applies only to w   4  where  the table is indexed on bytes rather than 4 bit quantities  and to w   8  where the table is indexed on shorts    6 THE DEFAULTS 20    rather than bytes  In each case  the table lookup performs two multiplications at a time  which makes region  multiplication faster  It doubles the size of the lookup table     e    QUAD     Use a table that is indexed on four words rather than two or one  This only applies to w   4  where  the table is indexed on shorts  The    Quad    table may be lazily created or created ahead of time  the default   If  the latter  then it consumes 24 x 216 x 2   2 MB of memory     e    LAZY     Typically it   s clear whether tables are constructed upon initialization or lazily when a region operation  is performed  There are two times where it is ambiguous     QUAD    when w   4 and    DOUBLE    when w   8   If you don   t specify anything  these tables are created upon initialization  consuming a lot of memory  If you  specify    LAZY     then the necessary row of the table is created lazily when you call       mu
5.   1061 497   m LOG   503 310   m SHIFT   157 749   m CARRY _FREE   86 202       Table 3  Speed of various calls to multiply region   for w   4     References     Anv09  H  P  Anvin  The mathematics of RAID 6  http   kernel org pub linux kernel people hpa   raid6 pdf  2009      BKK 95  J  Blomer  M  Kalfane  M  Karpinski  R  Karp  M  Luby  and D  Zuckerman  An XOR based erasure   resilient coding scheme  Technical Report TR 95 048  International Computer Science Institute  August  1995      GMS08  K  Greenan  E  Miller  and T  J  Schwartz  Optimizing Galois Field arithmetic for diverse processor  architectures and applications  In MASCOTS 2008  16th IEEE Symposium on Modeling  Analysis and  Simulation of Computer and Telecommunication Systems  Baltimore  MD  September 2008      GP97  S  Gao and D  Panario  Tests and constructions of irreducible polynomials over finite fields  In Founda   tions of Computational Mathematics  pages 346 361  Springer Verlag  1997     REFERENCES 43     LBOX12      LD00      LHy08      LSXP13      Mar94      PGM13a      PGM13b      Pla97     Speed  MBI      m SPLIT 8 4  Default    13279 146   m COMPOSITE 2    r ALTMAP   5516 588   m TABLE  r CAUCHY   4968 721   m BYTWO_b   2656 463   m TABLE  r DOUBLE   2561 225   m TABLE   1408 577   m BYTWO_b  r NOSSE   1382 409   m BYTWO_p   1376 661   m LOG_ZERO_EXT   1175 739   m LOG_ZERO   1174 694   m LOG   997 838   m SPLIT 8 4  r NOSSE   885 897   m BYTWO_p  r NOSSE   589 520   m COMPOSITE 2     327 039   m
6.   September 1997     REFERENCES 44    Speed  MBI      m SPLIT 16 4  r ALTMAP   10460 834   m SPLIT 16 4  r SSE  Default    8473 793   m COMPOSITE 2    r ALTMAP   5215 073   m LOG  r CAUCHY   2428 824   m TABLE   2319 129   m SPLIT 16 8   2164 111   m SPLIT 8 8   2163 993   m SPLIT 16 4  r NOSSE   1148 810   m LOG   1019 896   m LOG ZERO   1016 814   m BYTWO b   738 879   m COMPOSITE 2     596 819   m BYTWO_p   560 972   m GROUP 44   450 815   m BYTWO_b  r NOSSE   332 967   m BYTWO_p  r NOSSE   249 849   m CARRY FREE   111 582   m SHIFT   95 813       Table 5  Speed of various calls to multiply_region   for w   16      PSR12  J  S  Plank  C  D  Schuman  and B  D  Robison  Heuristics for optimizing matrix based erasure codes for  fault tolerant storage systems  In DSN 2012  The International Conference on Dependable Systems and  Networks  Boston  MA  June 2012  IEEE      Rab89  M  O  Rabin  Efficient dispersal of information for security  load balancing  and fault tolerance  Journal  of the Association for Computing Machinery  36 2  335    348  April 1989     REFERENCES    Speed  MBA      m SPLIT 32 4  r SSE  r ALTMAP   7185 440   m SPLIT 32 4  Default  5063 966   m COMPOSITE 2  m SPLIT 16 4  r ALTMAP    r ALTMAP   4176 440   m COMPOSITE 2    r ALTMAP   3360 860   m SPLIT 8 8   1345 678   m SPLIT 32 8   1340 656   m SPLIT 32 16   1262 676   m SPLIT 8 8  r CAUCHY   1143 263   m SPLIT 32 4  r NOSSE   480 859   m CARRY _FREE  p 0xc5   393 185   m COMPOSITE 2     332 964   m BYTWO_
7.   This takes the inverse of a gf_general_t   gf_general_is_one   in gf_general h  This tests whether a gf_general_t is one   gf_general_is_two   in gf_general h  This tests whether a gf_general_t is two   gf_general_is_zero   in gf_general h  This tests whether a gf_general _t is zero     gf_general_multiply   in gf_general h  This multiplies two gf_general_t   s  See the implementation of gfmult c  for an example     gf_general_s_to_val   in gf_general h  This converts a string to a gf_general_t  See the implementation of  gf_mult c for an example     gf_general_set_one   in gf_general h  This sets a gf_general_t to one   gf_general_set_random   in gf_general h  This sets a gf_general_t to a random number   gf_general_set_two   in gf_general h  This sets a gf_general_t to two   gf_general_set_up_single_timing_test   in gf_general h  Used in gf_time c   gf_general_set_zero   in gf_general h  This sets a gf_general_t to zero     gf_general_t in gf_general h  This is a general data type for all values of w  See the implementation of gfmult c  for examples of using these     gf_general_val_to_s   in gf_general h  This converts a gf_general_t to a string  See the implementation of  gf_mult c for an example     gf_init_easy   in gf_complete h  This is how you initialize a default gf_t  See 4 2 through 4 5 for examples of  calling gf _init_easy       LISTING OF PROCEDURES 38    gf_init_hard   in gf_complete h  This allows you to initialize a gf_t without using the defaults  See 6 4
8.   We   ll give you one example of calling gf_init_hard    Suppose you want to make a gf_init_hard   call to be  equivalent to     m SPLIT 16 4  r SSE  r ALTMAP      and you want to allocate the scratch space yourself  Then you   d  do the following     gft gf   void x scratch   int size     7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 24             size   gf_scratch_size 16  GF_MULT_SPLIT_TABLE   GF_REGION_SSE   GF_REGION_ALTMAP   GF_DIVIDE_DEFAULT                             16  4    if  size    0    gf_error    exit 1        It failed  That shouldn   t happen x     scratch    void    malloc size    if  scratch    NULL    perror   malloc    exit 1           if   gf_init_hard  amp gf  16  GF_MULT_SPLIT_TABLE   GF_REGION_SSE   GF_REGION_ALTMAP   GF_DIVIDE_DEFAULT   0  16  4  NULL  seratch                                   gf_error     exit 1         6 5 gf_size      You can call gf_size gft  gf  to learn the memory consumption of the gf_t  It returns all memory consumed by the  gf_t  including the gf_t itself  any scratch memory required by the gf_t  and the memory consumed by the sub field if  the field is    COMPOSITE     If you provided your own memory to gf_init_hard Q   it does not report the size of this  memory  but what the size should be  as determined by gf_scratch_size       7 Further Information on Options and Algorithms    7 1 Inlining Single Multiplication and Division for Speed    Obviously  procedure calls are more expensive than single instructions  and th
9.   We  recommend calling create_gf_from_argv   when you can  instead of gf_init_hard       gf mult_type_t in gf complete h  the different ways to specify multiplication when using gf_init_hard    See  section 6 4     gf _region_type_t in gf_complete h  the different ways to specify region multiplication when using gf_init_hard     See section 6 4     gf_region in gf_complete h  This is the data type of multiply_region   in a gf_t  See section 4 3 for an example  of how to use multiply_region       gf_scratch_size   in gf_complete h  This is how you calculate how much memory a gf_t needs  See section 6 4   gf_size   in  gf complete h  Returns the memory consumption of a gf_t  See section 6 5     gf_val_128_t in gf_complete h  This is how you store a value where w  lt  128  It is a pointer to two 64 bit  unsigned integers  See section 4 4     gf_val_32_t in gf_complete h  This is how you store a value where w  lt  32  It is equivalent to a 32 bit unsigned  integer  See section 4 2     gf_val_64_t in gf_complete h  This is how you store a value where w  lt  64  It is equivalent to a 64 bit unsigned  integer  See section 4 5     gf_w16_get_div_alog_table   in gf_complete h  This returns a pointer to an inverse logarithm table that can be  used for inlining division when w   16  See section 7 1     gf_w16_get_log_table   in gf_complete h  This returns a pointer to a logarithm table that can be used for inlining  when w   16  See section 7 1     gf_w16_get_mult_alog_table   in gf_co
10.   t let this error discourage you     7 8 2 Default Polynomials for Composite Fields  GF Complete will successfully select a default polynomial in the following composite fields   e w   8 and the default polynomial  0x13  is employed for GF  24    e w   16 and the default polynomial  0x11d  is employed for GF  28    e w   32 and the default polynomial  0x1100b  is employed for GF  218    e w   32 and 0x1002d is employed for GF  216    e w   32 and the base field for GF  w16  is a composite field that uses a default polynomial   e w   64 and the default polynomial  0x100400007  is employed for GF  2      e w   64 and 0x1000000c5 is employed for GF  2      e w   64 and the base field for GF  w     is a composite field that uses a default polynomial   e w   128 and the default polynomial  0x1b  is employed for GF  264      e w   128 and the base field for GF  w64  is a composite field that uses a default polynomial     7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 31    7 8 3 The Program gf_poly for Verifying Irreducibility of Polynomials    The program gf_poly uses the Ben Or algorithm  GP97  to determine whether a polynomial with coefficients in GF  2      is reducible  Its syntax is        gf_poly w method power coef power coef        You can use it to test for irreducible polynomials with binary coefficients by specifying w   1  For example  from  the discussion above  we know that zt   x   1 and zt   x3   x     1 are both irreducible  but zt   1 is reducible   gf_poly con
11.  23  Oxldbc   0x1234   Ox7f5f  Word 9  Ox3a3f   0x1234   0x49b3 Word 24  Oxl3le   0x1234   0x8974  Word 10  0xl8ea x 0x1234   Oxfb99 Word 25  0x47e0   0x1234   0x05el  Word 11  Oxc5b3 x 0x1234   0xb216 Word 26  Oxclla   0x1234   Oxcff3  Word 12  0x9753   0x1234   Oxeb3e Word 27  Ox7f07   0x1234   Oxa09c  Word 13  Oxld8a   0x1234   0x463a Word 28  0x76e0   0x1234   Oxde3c  Word 14  Oxa7ff   0x1234   0x01fb Word 29  Oxfe86   0x1234   0x4ac0       In the first region are words 0 and 1  which are identical to how they appear in memory  0x640b and 0x07e5  In  the second region are words 2 through 17  These words are split among the two sixteen byte regions  For example   word 2  which extract_word   reports is 0xba59  is constructed from the low byte in word 2  Oxba  and the low byte  in word 10  0x59   Since 0xba59   0x1234   Oxf827  we see that the low byte in word 2 of b is Oxf8  and the low byte  in word 10 is 0x27    When we reach word 22  we are in the third region of memory  and words are once again identical to how they  appear in memory    While this is confusing  we stress that that so long as you call multiply_region   with pointers of the same align   ment and regions of the same size  your results with ALTMAP will be consistent  If you call it with pointers of    7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 34    different alignments  or with different region sizes  then the results will not be consistent  To reiterate  if you don   t use  ALTMAP  you don   t have 
12.  400 500 600 700 800 900 1000       0 100 200 300 400 500 600 700 800 900 1000  Speed  Mega ops   s     Figure 5  Speed of doing single multiplications for w      32  64  128      because it uses a special application of Euclid   s method  which relies on division in GF  2      which is very fast     11 3 Multiply Region      Tables 3 through 8 show the performance of the various region operations  It should be noted that for GF  218   through G F  21 8   the default is not the fastest implementation of multiply_region    The reasons for this are outlined  in section 6    For these tables  we performed 1GB worth of multiply_region   calls for all regions of size 2   bytes for 10  lt  i  lt   30  In the table  we plot the fastest speed obtained    We note that the performance of    CAUCHY    can be improved with techniques from  LSXP13  and  PSR12      REFERENCES 42    Default   m CARRY_FREE  p Oxc5     32   m COMPOSITE 2     ws    0 1 2 3 4 5 6 7 8 9 10 11 12 13    Default   m COMPOSITE 2      64   m COMPOSITE 2  m COMPOSITE 2       w     0 1 2 3 4 5 6 7 8 9 10 11 12 13    Default m    0 1 2 3 4 5 6 7 8 9 10 11 12 13  Speed  Mega ops   s     Figure 6  Speed of doing single divisions for w      32  64  128      Speed  MBI      m TABLE  Default    11879 909   m TABLE  r CAUCHY   9079 712   m BYTWO_b   5242 400   m BYTWO_p   4078 431   m BYTWO_b  r NOSSE   3799 699     m TABLE  r QUAD   3014 315     m TABLE  r DOUBLE   2253 627   m BYTWO_p  r NOSSE   2021 237   m TABLE  r NOSSE 
13.  Figures 4 for w      4 8 16  and 5 for w      32  64 128   These  numbers were obtained by calling gf_time with the size and iterations both set to 10240  We plot the speed in mega   ops per second    As would be anticipated  the inlined operations  see section 7 1  outperform the others  Additionally  in all  cases with the exception of w   32  the defaults are the fastest performing implementations  With w   32      CARRY_FREE    is the fastest with an alternate polynomial  see section 7 7   Because we require the defaults to  use a    standard    polynomial  we cannot use this implementation as the default     11 2 DivideQ     For the    TABLE    and    LOG    implementations  the performance of division is the same as multiplication  This means  that for w      4 8  16   it is very fast indeed  For the other implementations  division is implemented with Euclid   s  method  and is several factors slower than multiplication    In Figure 6  we plot the speed of a few implementations of the larger word sizes  Compared to the    TABLE    and     LOG    implemenations for the smaller word sizes  where the speeds are in the hundreds of mega ops per second   these are very slow  Of note is the    COMPOSITE    implementation for w   32  which is much faster than the others    11 TIMINGS 41     m CARRY_FREE  p 0xc5       m FT  _   m BYTWO_b   w  32  0 100 200 300 400 500 600 700 800 900 1000   m CARRY_FREE   m GROUP 44   m GROUP 4 8     m BYTWO_p   m COMPOSITE 2      0 100 200 300
14.  GF  216  when  source mod 16     dest mod 16    6  The  alignment parameters are s   32 and t   16  The multiplication is in three phases  which correspond to the initial  unaligned region  10 bytes   the aligned region of s byte chunks  256 bytes   and the final leftover region  8 bytes      chunks of s bytes  until the remaining part of the region is less than s bytes  At that point  the third phase calls  multiply wxx   on the last part of the region   A detailed example helps to illustrate  Suppose we make the following call in GF  216  with SSE enabled     multiply_region w32 g f  0x10006  0x20006  a  274  0     First  note that source and dest are aligned on two byte quantities  which they must be in GF  218   Second  note  that size is a multiple of  2    2  And last  note that  source mod 16  equals  dest mod 16   We illustrate the three  phases of region multiplication in Figure 2  Because  source mod 16    6  there are 10 bytes of unaligned words that  are multiplied with five calls to multiply w32   in the first phase  The second phase multiplies 256 bytes  eight chunks  of s   32 bytes  using the SSE instructions  That leaves 8 bytes remaining for the third phase    When we describe the defaults and the various implementation options  we specify s and    as    alignment parame   ters       One of the advanced region options is using an alternate mapping of words to memory     ALTMAP         These interact  in a more subtle manner with alignment  Please see Sectio
15.  SHIFT   106 115   m CARRY FREE   104 299       Table 4  Speed of various calls to multiply region   for w   8     J  Luo  K  D  Bowers  A  Oprea  and L  Xu  Efficient software implementations of large finite fields  GF  2      for secure storage applications  ACM Transactions on Storage  8 2   February 2012     J  Lopez and R  Dahab  High speed software multiplication in fom  In Annual International Conference  on Cryptology in India  2000     H  Li and Q  Huan yan  Parallelized network coding with SIMD instruction sets  In International Sympo   sium on Computer Science and Computational Technology  pages 364 369  IEEE  December 2008     J  Luo  M  Shrestha  L  Xu  and J  S  Plank  Efficient encoding schedules for XOR based erasure codes   IEEE Transactions on Computing  May 2013     G  Marsaglia  The mother of all random generators  ftp    ftp taygeta com pub c mother   c  October 1994     J  S  Plank  K  M  Greenan  and E  L  Miller  A complete treatment of software implementations of  finite field arithmetic for erasure coding applications  Technical Report UT CS 13 717  University of  Tennessee  September 2013     J  S  Plank  K  M  Greenan  and E  L  Miller  Screaming fast Galois Field arithmetic using Intel SIMD  instructions  In FAST 2013  11th Usenix Conference on File and Storage Technologies  San Jose  February  2013     J  S  Plank  A tutorial on Reed Solomon coding for fault tolerance in RAID like systems  Software      Practice  amp  Experience  27 9  995   1012
16.  Sometimes it is easier to find a number   s inverse than it is to divide  In that case  you can divide by multiplying  by an inverse     4  Adding two regions of numbers in GF  2       which will be explained along with       5  Mutiplying a region of numbers in GF  2     by a constant in GF  2         Erasure coding typically boils down  to performing dot products in G F  2      For example  you may define a coding disk using the equation     Co   do   2d    4d2   8d3     That looks like three multiplications and three additions However  the way that   s implemented in a disk system  looks as in Figure 1  Large regions of disks are partitioned into w bit words in GF  2    In the example  let us  suppose that w   8  and therefore that words are bytes  Then the regions pictured are 1 KB from each disk   The bytes on disk D  are labeled d  o  di1     di 1023  and the equation above is replicated 1024 times  For  0  lt  j  lt  1024   Co j   do j   2d1     4d2     8d3 j    While it   s possible to implement each of these 1024 equations independently  using the single multiplication  and addition operations above  it is often much more efficient to aggregate  For example  most computer archi   tectures support bitwise exclusive or of 64 and 128 bit words  Thus  it makes much more sense to add regions  of numbers in 64 or 128 bit chunks rather than as words in GF  2      Multiplying a region by a constant can  leverage similar optimizations     GF Complete supports multiplication 
17.  THE SSE INSTRUCTIONS 7    The following C files compose gf_complete a  You shouldn   t have to mess with these files  but we include them in  case you have to     gf c  This implements all of the procedures in both gf_complete h and gf_int h   gf_w4 c  Procedures specific to w   4    gf_w8 c  Procedures specific to w   8    gf_w16 c  Procedures specific to w   16    gf_w32 c  Procedures specific to w   32    gf_w64 c  Procedures specific to w   64    gf_w128 c  Procedures specific to w   128    gf_wgen c  Procedures specific to other values of w between 1 and 31     gf_general c  Procedures that let you manipulate general values  regardless of whether w  lt  32  w   64  or w   128   Le  the procedures defined in gf_general h      gf_method c  Procedures to help you switch between the various implementation techniques   I e  the proce   dures defined in gf_method h      gf_rand c  The    Mother of all    random number generator   I e  the procedures defined in gf_rand h      Finally  the following C files are example applications that use the library     gf_mult c  gf_div c and gf_add  Command line tools to do multiplication  division and addition by single num   bers     gf_example_z c  Example programs to help you use the library     gf_unit c  A unit tester to verify that all of the procedures work for given values of w and implementation  options     gf_time c  A program that times the procedures for given values of w and implementation options   gf_methods c  A program t
18.  environment by trying to compile some programs that use those instructions  If it succeeds  we will run the programs and compare the output to the known correct output  Based on the support it discovers  it  will then print out two lines of compilation flags that you should use to replace the defaults in your GNUmakefile   For example  this is the output on a machine that supports SSE4 2 and PCLMUL  which is needed to use ALL of the  SSE optimizations  using clang for the compiler     UNIX gt    which_compile_flags sh clang                               CFLAGS O3  msse4  DINTEL_SSE4  maes  mpclmul     DINTEL_PCLMUL  LDFLAGS O3  msse4  maes  mpclmul  UNIX gt     You should put those lines into GNUmakefile in place of the CFLAGS and LDFLAGS lines that are currently  there  and you should make sure that GNUmakefile uses clang as its compiler    Here   s another machine   s output that only supports up to SSE3  not the SSSE3 we are looking for      UNIX gt  grep model  proc cpuinfo                            model   4   model name   Intel R  Pentium R  4 CPU 3 40GHz  model   4   model name   Intel R  Pentium R  4 CPU 3 40GHz  UNIX gt    which_compile_flags sh   CFLAGS O3  msse2    DINTEL_SSE2   LDFLAGS    03  msse2   UNIX gt     After you make  the library will be in gf_complete a  To use the library  include gf_complete h in your C programs  and create your executable with gf_complete a    Keep in mind that even though a given machine may have processor support for certain SSE 
19.  ll try the following        w 32   m SPLIT 32 4  r NOSSE  d EUCLID               You should first run gf_unit to ensure this set of options works on your environment        UNIX gt  gf_unit 32 A  1  m SPLIT 32 4  r NOSSE  d EUCLID    Size  bytes   684  UNIX gt              gf_unit first reports on the memory consumption of the gf_t  If the field is a composite field  this includes the  memory consumption of the base field s   see Section 6 5   Then _unit runs all tests for the specified parameters  If  gf_unit does not return an error  then you can be confident that there are no environmental correctness issues    Next  to get an idea of how the methods perform  you should run gf_time        UNIX gt  gf_time 32 A  1 1024 1024  m SPLIT 32 4  r NOSSE  d EUCLID    Seed  1375822062             Multiply  0 013587 s Mops  0 250 18 400 Mega ops s   Divide  0 293074 s Mops  0 250 0 853 Mega ops s   Inverse  0 280164 s Mops  0 250 0 892 Mega ops s  Region Random  XOR  0 0 002397 s MB  1 000 417 178 MB s    6 THE DEFAULTS 22       Region Random  XOR  1 0 002379 s MB  1 000 420 271 MB s  Region By Zero  XOR  0 0 000053 s MB  1 000 18978 751 MB s  Region By Zero  XOR  1 0 000032 s MB  1 000 31536 120 MB s   Region By One  XOR  0 0 000068 s MB  1 000 14614 300 MB s   Region By One  XOR  1 0 000094 s MB  1 000 10591 677 MB s   Region By Two  XOR  0 0 002124 s MB  1 000 470 741 MB s   Region By Two  XOR  1 0 002148 s MB  1 000 465 517 MB s    UNIX gt     The first column of output displays
20.  memory  With alternate mappings  each word may be split over several memory regions  so  extract_word   grabs the relevant parts of each memory region to extract the word  Below  we go over each of the  above situations in detail  Please refer to Figure 2 in Section 5 for reference     7 9 1 Alternate mappings with    SPLIT       The alternate mapping with    SPLIT    is employed so that we can best leverage mm_shuffle_epi8    Please read  PGM1 3b   for details as to why  Consider an example when w   16  In the main region of memory  the middle region in Fig   ure 2   multiplication proceeds in units of 32 bytes  which are each broken into two 16 byte regions  The first region  holds the high bytes of each word in GF  2168   and the second region holds the low bytes    Let   s look at a very detailed example  from gf_example_5 c  This program makes the following call  where gf has  been initialized for w   16  using    SPLIT    and    ALTMAP        gf multiply_region w32  amp gf  a  b  0x1234  30 2  0      In other words  it is multiplying a region a of 60 bytes  30 words  by the constant 0x1234 in GF  218   and placing  the result into b  The pointers a and b have been set up so that they are not multiples of 16  The first line of output  prints a and b     a  0x10010008c b  0x10010015c    As described in Section 5  the regions of memory are split into three parts     7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 33    1  4 bytes starting at 0x1001008c   0x10010015c   2  
21.  section 6 1   gf_create_gf from_argv   in gf method h  Creates a gf_t using C style argc argv  See section 6 1 1     gf_division_typet in gfcomplete h  the different ways to specify division when using gf_init_hard    See  section 6 4     gf_error   in gf_complete h  This prints out why an initialization call failed  See section 6 1     gf_extract in gf_complete h  This is the data type of extract_word   in a gf_t  See section 7 9 for an example  of how to use extract_word       LISTING OF PROCEDURES 37    ef _free   in gfcomplete h  If gf_init_easy    gfinit_hard   or create_gffrom_argv   allocated memory  this  frees it  See section 6 4     gf func_a_b in gf_complete h  This is the data type of multiply   and divide   in a gf_t  See section 4 2 for  examples of how to use multiply   and divide       gf func_a_b in gf_complete h  This is the data type of multiply   and divide   in a gf t  See section 4 2 for  examples of how to use multiply   and divide       gf _func_a in gf_complete h  This is the data type of inverse   in a gft    gf_general_add   in gf_general h  This adds two gf_general_t   s    gf _general_divide   in gf_general h  This divides two gf_general_t   s   gf_general_do_region_check   in gf_general h  This checks a region multiply of gf_general_t   s   gf_general_do_region_multiply   in gf_general h  This does a region multiply of gf_general_t   s   gf_general_do_single_timing_test   in gf_general h  Used in gf_time c    gf_general_inverse   in gf_general h
22.  the name of the test performed  Region tests will test with and without the XOR  flag being set  see Section 4 3 for an example   The second column displays the total time the test took to complete  measured in seconds  s   The third column displays the size of the test measured in millions of operations  Mops  for  single tests and in Megabytes  MB  for the region tests  The final column displays the speed of the tests calculated  from the second and third columns  and is where you should look to get an idea of a method   s performance    If the output of gf_unit and gf_time are to your satisfaction  you can incorporate the method into application code  using create_gf_from_argv   or gf_init_hard      The performance of    Region By Zero    and    Region By One       will not change from test to test  as all methods make  the same calls for these     Region By Zero    with    XOR  1    does nothing except set up the tests  Therefore  you may  use it as a control    Because this implementation creates the tables lazily  one would expect the performance of region_multiply   to  be faster for larger regions  because the cost of creating the tables is amortized  Indeed  that is the case        UNIX gt  gf_time 32 G  1 10240 1024  m SPLIT 32 4  r NOSSE  d EUCLID    Seed  1375822694             Region Random  XOR  0 0 021350 s MB  10 000 468 381 MB s  Region Random  XOR  1 0 021171 s MB  10 000 472 353 MB s  UNIX gt     6 4 Calling gf_init_hard      We recommend that you use crea
23.  there are  seven  and in GF  264  there are fifteen  Thus  this implementation can be a space hog  However  for w   32   this is the fastest way to perform multiply   on some machines     7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 27    When this option is employed  multiply_region   is implemented in an identical fashion to when wa   w  and wy   8     5  Wa   32 and w    2   w   32 only   I was playing with a different way to use mm_shuffle_epi8    It works   but it   s slower than when wy   4     7 5 Arguments to    GROUP       The    GROUP    multiplication option takes two arguments  gs and g    It implements multiplication in the same manner  as    SHIFT     except it uses a table of size 29s to perform gs shifts at a time  and a table of size 2  to perform gr  reductions at at time  The program gf_methods only prints the options 4 4 and 4 8 as arguments for    GROUP      However  other values of g  and gr are legal and sometimes desirable     e Forw  lt  32 and w   64  any values of gs and g  may be used  so long as they are less than or equal to w and so  long as the tables fit into memory  There are four exceptions to this  listed below     e For w   4     GROUP    is not supported    e For w   8     GROUP    is not supported    e For w   16     GROUP    is only supported for gs   gr   4    e For w   128    GROUP    requires SSE  and only supports gs   4 and gr      4 8  16      The way that gs and gy impact performance is as follows  The    SHIFT    implementat
24. 2  O0x9ee7cb3e    0x12345678 0xa6793274  Word 8  0x15331ldlc    0x12345678   0xda453482 Word 23  0x131e9660    0x12345678 Oxf7d034ee  Word 9  0x46dc3ab8 x   0x12345678   0x2e183746 Word 24  Oxde4elb12   0x12345678 Oxla98d9dd  Word 10  Oxf74635d7   0x12345678 0x909993cc Word 25  0x46bc47e0 x   0x12345678 0x6038765f  Word 11  0xc504c518    0x12345678 0x0902288b Word 26  Ox5bc9clla    0x12345678 Oxb2f  333    Word 12  0x35aa6493 x   0x12345678 Oxb  5ebfd9 Word 27  0x931d7f07    0x12345678 Oxe7937e49             8 THREAD SAFETY 35    Word 13  0x72c21d97   0x12345678   0x69b541d1 Word 28  O0xd40676e0   0x12345678   Oxfa5a5867  Word 14  O0x98f9b37c   0x12345678   0xd9107639 Word 29  Oxc85cfe86   0x12345678   0x79c00ea2    As with    SPLIT     using multiply_region   with    COMPOSITE    and    ALTMAP       will be consistent only if the  alignment of pointers and region sizes are identical     7 9 3 The mapping of    CAUCHY       With    CAUCHY     the region is partitioned into w subregions  and each word in the region is broken into w bits   each of which is stored in a different subregion  To illustrate  gfexample_7 multiplies a region of three bytes by 5  in GF 23  using    CAUCHY        UNIX gt  gf_example_7  a  0x100100190 b  0x1001001a0    a  0x0b Oxe5 Oxba  b  Oxee Oxba 0x0b    a bits  00001011 11100101 10111010  b bits  11101110 10111010 00001011    Word  Word  Word  Word  Word  Word  Word  Word  UNIX gt     YHOU BWNHE OC  ANA BON OW  annann a  ll  WrRPwWNTPA SD        FF 
25. 32 bytes starting at 0x10010090  0x100100160   3  24 bytes starting at 0x100100b0  0x100100180     In the first and third parts  the bytes are laid out according to the standard mapping  However  the second part is  split into two 16 byte regions     one that holds the high bytes of each word and one that holds the low bytes  To help  illustrate  the remainder of the output prints the 30 words of a and b as they appear in memory  and then the 30 return  values of extract_word w32       0 1 2 3 4 5 6 7 8 9  640b O7e5 2fba ce5d f1f9 3ab8 c518 1d97 45a7 0160  b  lba3 644e 84f8 be3c 4318 4905 b2fb 46eb ef01 a503    w    10 11 12 13 14 I5 16 17 18 19  a  3759 b107 9660 3fde b3ea 8a53 75ff 46dc c504 72c2  b  da27 e166 a0d2 b3a2 1699 3a3e 47fb 39af 1314 8e76    20 21 22 23 24 25 26 27 28 29  a  b469 1b97 e91d 1dbc 131e 47e0 clla 7f07 76e0 fe86  b  937c a5db Olb7 7f5f 8974 05e1 cff3 a09c de3c 4ac0             Word 0  0x640b   0x1234   Oxlba3 Word 15  0x4575   0x1234   0xef47  Word 1  0x07e5   0x1234   0x644e Word 16  O0x60dc   0x1234   0x03af  Word 2  Oxba59   0x1234   Oxf827 Word 17  0x0146   0x1234   0xa539  Word 3  Ox2f37   0x1234   0x84da Word 18  Oxc504   0x1234   0x1314  Word 4  0x5d07   0x1234   0x3c66 Word 19  0x72c2   0x1234   0x8e76  Word 5  QOxcebl x 0x1234   Oxbeel Word 20  O0xb469    0x1234   0x937c  Word 6  Oxf960   0x1234   0x18d2 Word 21  0x1b97   0x1234   Oxa5db  Word 7  0xf196   0x1234   0x43a0 Word 22  Oxe9ld   0x1234   0x01b7  Word 8  Oxb8de   0x1234   0x05a2 Word
26. 874149 128h  b1e34d34b031660676965b868b892043   UNIX gt  gf_div 382  12719ffe3978385f5d97540al3al b4c06abladbbec2f4b0ffc68e43008cbh 128h   e25209c145c0bfF29b85b21lalae2921fa   UNIX gt     5 Important Information on Alignment when Multiplying Regions    In order to make multiplication of regions fast  we often employ 64 and 128 bit instructions  This has ramifications  for pointer alignment  because we want to avoid bus errors  and because on many machines  loading and manipulating  aligned quantities is much faster than unalinged quantities     When you perform multiply_region wxx gf  source  dest  value  size  add   there are three requirements     1  The pointers source and dest must be aligned for w bit words  For w   4 and w   8  there is no restriction   however for w   16  the pointers must be multiples of 2  for w   32  they must be multiples of 4  and for  w E  64  128   they must be multiples of 8     2  The size must be a multiple of  3   With w   4 and w   8   3    1 and there is no restriction  The other  sizes must be multiples of  z  because you have to be multiplying whole elements of GF  2         3  The source and dest pointers must be aligned identically with respect to each other for the implementation  chosen  This is subtle  and we explain it in detail in the next few paragraphs  However  if you   d rather not figure  it out  the following recommendation will always work in GF Complete     If you want to be safe  make sure that source and dest are both mu
27. F F F F FH    The program prints the three bytes of a and b in hexadecimal and in binary  To see how words are broken up   consider word 0  which is the lowest bit of each of the three bytes of a  and b   These are the bits 1  1 and 0 in a  and  0  0  and 1 in b  Accordingly  the word is 3 in a  and 3 5   4 in b  Similarly  word 7 is the high bit in each byte  0  1  1   6  in a  and 1  1  0  3  in b    With    CAUCHY     multiply_region   may be implemented exclusively with XOR operations  Please see  BKK 95   for more information on the motivation behind    CAUCHY        8 Thread Safety    Once you initialize a gf_t  you may use it wontonly in multiple threads for all operations except for the ones below   With the implementations listed below  the scratch space in the gf_t is used for temporary tables  and therefore you  cannot call region_multiply  and in some cases multiply from multiple threads because they will overwrite each  others    tables  In these cases  if you want to call the procedures from multiple threads  you should allocate a separate  gf_t for each thread     e All    GROUP    implementations are not thread safe for either region_multiply   or multiplyQ  Other than     GROUP     multiply  is always thread safe     9 LISTING OF PROCEDURES 36    For w   4  region_multiply w32   is unsafe in in     m TABLE  r QUAD  r LAZY      For w   8  region_multiply w32   is unsafe in in     m TABLE  r DOUBLE  r LAZY      For w   16  region_multiply w32   is unsafe in in 
28. GF Complete  A Comprehensive Open Source Library for Galois  Field Arithmetic    Version 1 01    James S  Plank  Ethan L  Miller Kevin M  Greenan Benjamin A  Arnold  John A  Burnum Adam W  Disney Allen C  McBride    November 12  2013    Technical Report UT CS 13 716  Department of Electrical Engineering and Computer Science  University of Tennessee  Knoxville  TN 37996  http   www cs utk edu  plank plank papers CS 13 716 html    This is a user   s manual for GF Complete  version 1 0  This release supersedes version 0 1 and represents the first  major release of GF Complete  To our knowledge  this library implements every Galois Field multiplication technique  applicable to erasure coding for storage  which is why we named it GF Complete  The primary goal of this library is  to allow storage system researchers and implementors to utilize very fast Galois Field arithmetic for Reed Solomon  coding and the like in their storage installations  The secondary goal is to allow those who want to explore different  ways to perform Galois Field arithmetic to be able to do so effectively     If You Use This Library or Document    Please send me an email to let me know how it goes  Or send me an email just to let me know you are using the  library  One of the ways in which we are evaluated both internally and externally is by the impact of our work  and if  you have found this library and or this document useful  we would like to be able to document it  Please send mail to  plank cs utk e
29. INLINE_MULT log   alog  a  b  to multiply a and b  and the macro GF_W16_INLINE_DIV log  alog  a  b  to divide a and b  Make  sure you use the alog table returned by gf_w16_get_mult_alog_table   for multiplication and the one returned by  gf_w16_get_div_alog_table   for division  Here are some timings     UNIX gt  gf_time 4 M 0 10240 10240    Seed  0  Multiply  0 228860 s Mops  100 000 436 949 Mega ops s  UNIX gt  gf_inline_time 4 0 10240 10240  Seed  0  Inline mult  0 096859 s Mops  100 000 1032 424 Mega ops s  UNIX gt  gf_time 8 M 0 10240 10240    Seed  0  Multiply  0 228931 s Mops  100 000 436 812 Mega ops s  UNIX gt  gf_inline_time 8 0 10240 10240  Seed  0  Inline mult  0 114300 s Mops  100 000 874 889 Mega ops s  UNIX gt  gf_time 16 M 0 10240 10240    Seed  0  Multiply  0 193626 s Mops  50 000 258 229 Mega ops s  UNIX gt  gf_inline_time 16 0 10240 10240  Seed  0  Inline mult  0 310229 s Mops  100 000 322 342 Mega ops s  UNIX gt                 7 2 Using different techniques for single and region multiplication    You may want to    mix and match    the techniques  For example  suppose you   d like to use     m SPLIT 8 8    for  multiply   in GF  23    because it   s fast  and you don   t mind consuming all of that space for tables  However  for  multiply_region    you   d like to use     m SPLIT 32 4  r ALTMAP     because that   s the fastest way to implement  multiply_region    Unfortunately  There is no way to create a gf_t that does this combination  In this case  you 
30. SITE 2  m BYTWO_p           In the above example  the red text applies to the base field  and the black text applies to the composite field     Composite fields have two defining polynomials     one for the composite field  and one for the base field  Thus  if  you want to change polynomials  you should change both  The polynomial for the composite field must be of the  form x    sx 1  where s is an element of GF  2    To change it  you specify s  in hexadecimal  with     p     In the  example below  we multiply 20000 and 30000 in GF   28     setting s to three  and using 2     2   3  g   1  as the polynomial for the base field       gfmult 20000 30000 16  m COMPOSITE 2  p Oxlld    p 0x3         6 THE DEFAULTS 19    If you use composite fields  you should consider using    ALTMAP    as well  The reason is that the region  operations will go much faster  Please see section 7 6     As with changing the polynomial  when you use a composite field  GF   2      you are using a different field  than the    standard    field for GF 2     All Galois Fields are isomorphic to each other  so they all have the  desired properties  however  the fields themselves change when you use composite fields     With the exception of    COMPOSITE     only one multiplication technique can be provided for a given Galois  Field instance  Composite fields may use composite fields as their base fields  in which case the specification will be  recursive     6 1 4 Changing the Division Technique    There a
31. able to verify the gf_add  statements below in your head  As for the other gf_mult   s  you can simply verify that division and multiplication work  with each other as you hope they would     IX gt  gf_mult 5 4 4    sI c    UNIX gt  gf_div 75 4    P    IX gt  gf_div 744    U C    IX gt  gf_mult 8000 2 16h   100b   NIX gt  gf_add fOfOfOfOfFOfFOfOfO 1313131313131313 64h   3e3e3e3e3e3e3e3   NIX gt  gf_mult fOfOfOfOfOfFOfFOFO 1313131313131313 64h   da08da08da08da0   IX gt  gf_div 8da08da08da08da0 1313131313131313 64h   OfOfFOFOFOFOFOFO   UNIX gt  gf_add fOfOfOfOfOfOfOf01313131313131313 1313131313131313f0fOfOfOfOfOfOfO 128h  3e3e3e3e3e3e3e3e3e3e3e3e3e3e3e3   IX gt  gf_mult fOfOfOfOfOfFOfOf01313131313131313 1313131313131313f0fOfOfOfOfOfOfO 128h  86278627862784982d782d782d7816e   UNIX gt  gf_div 786278627862784982d782d782d7816e fOf0FOF0F0F0  0F01313131313131313 128h  313131313131313f0f  0f  0  0FOF0FOFO   UNIX gt     C    C       C   o cC    m        I c    hb                Don   t bother trying to read the source code of these programs yet  Start with some simpler examples like the ones  below     4 2 Quick Starting Example  1  Simple multiplication and division    The next two examples are intended for those who just want to use the library without getting too complex  The  first example is gf_example_1  and it takes one command line argument     w  which must be between 1 and 32  It  generates two random non zero numbers in GF  2     and multiplies them  After doing that  it div
32. alois  Field are the integers 0  1       2        1  Galois Field arithmetic defines addition and multiplication over these closed  sets of integers in such a way that they work as you would hope they would work  Specifically  every number has a  unique multiplicative inverse  Moreover  there is a value  typically the value 2  which has the property that you can  enumerate all of the non zero elements of the field by taking that value to successively higher powers    Addition in a Galois Field is equal to the bitwise exclusive or operation  That   s nice and convenient  Multiplication  is a little more complex  and there are many  many ways to implement it  The Paper describes them all  and the  following references provide more supporting material   Anv09  GMS08  LHy08  LD00  LBOX12  Pla97   The intent  of this library is to implement all of the techniques  That way  their performance may be compared  and their tradeoffs  may be analyzed    When used for erasure codes  there are typically five important operations     1  Adding two numbers in GF  2       That   s bitwise exclusive or     2  Multiplying two numbers in GF  2    Erasure codes are usually based on matrices in GF  2    and construct   ing these matrices requires both addition and multiplication     3  Dividing two numbers in GF  2      Sometimes you need to divide to construct matrices  for example  Cauchy  Reed Solomon codes  BKK 95  Rab89    More often  though  you use division to invert matrices for decoding  
33. an use create_gf_from_argv    included from gfmethod h  which uses an argy style array of  strings to specify the options that you want  The procedure in gf_method c parses the array and makes the proper  gf init_hard   procedure call  This is the technique used to parse the command line in gf_mult  gf_div  gf_unit et al     6 1 1 Changing the Components of a Galois Field with create_gf_from_argv    There are five main components to every Galois Field instance     w  Multiplication technique  Division technique  Region technique s   Polynomial    The procedures gf_init_hard   and create_gf_from_argv   allow you to specify these parameters when you create  your Galois Field instance  We focus first on create_gf_from_argv    because that is how the tools allow you to specify  the components  The prototype of create_gf_from_argv   is as follows     int create_gf_from_argv gf_t   gf  int w  int argc  char  xargv  int starting      6 THE DEFAULTS 16    You pass it a pointer to a gf_t  which it will initialize  You specify the word size with the parameter w  and then you  pass it an arge argv pair as in any C or C   program  You also specify a starting argument  which is where in argv  the specifications begin  If it successfully parses argc and argv  then it creates the gf_t using gf_init_hard    described  below in section 6 4   It returns one past the last index of argv that it considered when creating the gf_t  If it fails  then  it returns zero  and the gf_t is unmodified    F
34. and division of single values for all values of w  lt  32  plus w   64 and w    128  It also supports adding two regions of memory  for any value of w  since addition equals XOR   and multiplying  a region by a constant in GF  24   GF  28   GF  216   GF  232   GF  264  and GF 21 8   These values are chosen  because words in GF  2   fit into machine words with these values of w  Other values of w don   t lend themselves  to efficient multiplication of regions by constants  although see the    CAUCHY    option in section 6 1 5 for a way to  multiply regions for other values of w      2 FILES IN THE LIBRARY 6    Coo   doo   2d  9   4d    dzo  Co   dy    2d     4d     dz   a Cm doz a 2d    4d      8d   Co3   Cha   2d     4d     8d      Co 1022   49 1022   24  1022   42 1022   8d3 1022  Co 1023   do 1023   24  1023   44 gt  1023   8d3 1023    Figure 1  An example of adding two regions of numbers  and multiplying a region of numbers by a constant  in GF 2      In this example  w   8  and each disk is holding a 1KB region  The same coding equation      coj   do j   ad   j   a   dz2 j   a  d3   is applied 1024 times  However  rather than executing this equation 1024  times  it is more efficient to implement this with three region constant multiplications and three region region addi   tions                                                  2 Files in the Library    The following files compose GF Complete  First  the header files     e gf_complete h  This is the header file that applic
35. and selectively  XOR ing the multiplicand  See The Paper for more detail  It can leverage Anvin   s optimization that multiplies  64 and 128 bits of numbers in GF  2   by two with just a few instructions  The SSE version requires SSE2     6 THE DEFAULTS 18       BYTWO_b     This implements multiplication by successively multiplying the multiplicand by two and selec   tively XOR ing it into the product  It can also leverage Anvin   s optimization  and it has the feature that when  you   re multiplying a region by a very small constant  like 2   it can terminate the multiplication early  As such   if you are multiplying regions of bytes by two  as in the Linux RAID 6 Reed Solomon code  Anv09    this is  the fastest of the techniques  regardless of the value of w  The SSE version requires SSE2        SPLIT     Split multiplication tables  like the LR tables in  GMS08   or the SIMD tables for w  gt  8 in  LHy08   Anv09  PGM13b    This argument must be followed by two more arguments  wa and wy  which are the index  sizes of the sub tables  This implementation reduces the size of the table from    TABLE     but requires multiple  table lookups  For example  the following multiplies 100 and 200 in GF  28  using two 4K tables  as opposed  to one 64K table when you use    TABLE        UNIX gt    gf_mult 100 200 8  m SPLIT 8 4    79  UNIX gt     See section 7 4 for additional information on the arguments to    SPLIT     The SSE version typically requires  SSSE3        GROUP     This im
36. and then a number  in hexadecimal  the leading    Ox    is optional   It is assumed that the w th bit of the polynomial is set     you may include  it or omit it  For example  if you wish to set the polynomial for GF  216  to 21     2     x3   x   1  rather than its  default of x16   xt    x3  x  1  you may say     p 0x1002d         p 1002d         p 0x2d    or     p 2d       We discuss changing the polynomial for three reasons in other sections     e Leveraging carry free multiplication  section 7 7    e Defining composite fields  section 7 6    e Implementing rings  section 7 8 1      Some words about nomenclature with respect to the polynomial  A Galois Field requires the polynomial to be  irreducible   That means that it cannot be factored  For example  when the coefficients are binary  the polynomial x      x  a 1 may be factored as  2  1  x 1   Therefore it is not irreducible and cannot be used to define a Galois Field   It may  however  be used to define a ring  Please see section 7 8 1 for a discussion of ring support in GF Complete    There is a subset of irreducible polynomials called primitive  These have an important property that one may enu   merate all of the elements of the field by raising 2 to successive posers  All of the default polynomials in GF Complete  are primitive  However  so long as a polynomial is irreducible  it defines a Galois Field  Please see section 7 7 for a  further discussion of the polynomial    One thing that we want to stress here is t
37. ase field  in base_gf  The base field is itself a gf_t  which should have been created previously with create_gf_from_argv     gf_init_easy   or gf_init_hard    Note that this base_gf has its own base_gf member and can be a composite field  itself    You can specify an alternate polynomial in prim_poly  For w  lt  32  the leftmost one  the one in bit position w  is  optional  If you omit it  it will be added for you  For w   64  there   s no room for that one  so you have to leave it off   For w   128  your polynomial can only use the bottom most 64 bits  Fortunately  the standard polynomial only uses  those bits  If you set prim_poly to zero  the library selects the    standard    polynomial    Finally  scratch_memory is there in case you don   t want gf_init_hard   to call malloc    You may call gf_scratch_size    to find out how much extra memory each technique uses  and then you may pass it a pointer for it to use in scratch_memory   If you set scratch_memory to NULL  then the extra memory is allocated for you with malloc    If you use gf_init_easy    or create_gf_from_argv    or you use gf_init_hard   and set scratch_memory to NULL  then you should call gf_free    to free memory  If you use gf_init_hard   and use your own scratch_memory you can still call gf_free    and it will  not do anything    Both gfinit_hard   and gf scratch_size   return zero if the arguments don   t specify a valid gf_t  When that hap   pens  you can call gf_error   to print why the call failed  
38. ations should include  It defines the gf_t type  which holds all  of the data that you need to perform the various operations in GF  2    It also defines all of the arithmetic oper   ations  For an application to use this library  you should compile the library gf_complete a and then applications  include gf_complete h and compile with the library     gf_method h  If you are wanting to modify the implementation techniques from the defaults  this file provides  a    helper    function so that you can do it from the Unix command line     gf_general h  This file has helper routines for doing basic Galois Field operations with any legal value of w   The problem is that w  lt  32  w   64 and w   128 all have different data types  which is a pain  The procedures  in this file try to alleviate that pain  They are used in gf_unit and gf_time  I   m guessing that most applications  won   t use them  as most applications use w  lt  32     gf_rand h  I   ve learned that srand48   and its kin are not supported in all C installations  Therefore  this file  defines some random number generators to help test the programs  The random number generator is the    Mother  of All    random number generator  Mar94  which we   ve selected because it has no patent issues  gf_unit and  gf_time use these random number generators     gf_int h  This is an internal header file that the various source files use  This is not intended for applications to  include     3 COMPILATION  ESPECIALLY WITH REGARD TO
39. b   309 971   m BYTWO_p   258 623   m GROUP 4 8   242 076   m GROUP 4 4   227 399   m CARRY _FREE   226 785   m BYTWO_b  r NOSSE   143 403   m BYTWO_p  r NOSSE   111 956   m SHIFT   52 295       Table 6  Speed of various calls to multiply_region   for w   32     Speed  MBA      m SPLIT 64 4  r ALTMAP   3522 798   m SPLIT 64 4  r SSE  Default    2647 862   m COMPOSITE 2  m SPLIT 32 4  r ALTMAP    r ALTMAP   2461 572   m COMPOSITE 2    r ALTMAP   1860 921   m SPLIT 64 16   1066 490   m SPLIT 64 8   998 461   m CARRY_FREE   975 290   m SPLIT 64 4  r NOSSE   545 479   m GROUP 4 4   230 137   m GROUP 4 8   153 947   m BYTWO_b   144 052   m BYTWO_p   124 538   m SPLIT 8 8   98 892   m BYTWO_p  r NOSSE   77 912   m COMPOSITE 2     77 522   m BYTWO_b  r NOSSE   36 391   m SHIFT   25 282       Table 7  Speed of various calls to multiply_region   for w   64     45    REFERENCES    Speed  MBA      m SPLIT 128 4  r ALTMAP   1727 683   m COMPOSITE 2  m SPLIT 64 4  r ALTMAP    r ALTMAP   1385 693   m COMPOSITE 2    r ALTMAP   1041 456   m SPLIT 128 8  Default  872 619   m CARRY_FREE   814 030   m SPLIT 128 4   500 133     m COMPOSITE 2     289 207   m GROUP 4 8   133 583   m GROUP 4 4   116 187   m BYTWO_p   25 162   m BYTWO_b   25 157   m SHIFT   14 183       Table 8  Speed of various calls to multiply_region   for w   128     46    
40. ct of 0x3fde07e5 and 0x12345678 is 0x21 1c880d  which is stored in the lower  two bytes of words 1 and 13 of b           a  0x10010011ic b  0x1001001lec  0 T 2 3 4 5 6 7 8 9  a  562c640b 959407e5 56592fba cbadce5d ldlcf1f9 35d73ab8 6493c518 b37c1d97 8e4545a7 c0d80160  b    589f36c   146880d 74  7b349 7ea7c5c6 34827cla 93cc3746 bfd9288b 763941d1 bcd33a5d da695e64  10 11 12 13 14 15 16 17 18 19  a  965b3759 cb3eb107 1b129660 95a33fde 95a7b3ea dl6c8a53 153375ff f74646dc 35aac504 98f972c2  b  fd70f125 3274fa8f d9dd34ee c01la211c d4402403 8b55c08b da45f0ad 90992e18 b65e0902 d91069b5  20 21 22 23 24 25 26 27 28 29  a  5509b469 7  8a1b97 3472e91d 9ee71dbc de4el3le 46bc47e0 S5Sbc  9clla 931d7f07 d40676e0 c85cfe86  b  fc92b8f5 edd59668 b4bc0d90 a679e4ce 1a98f7dqd0 6038765f b2ff333f e7937e49 fa5a5867 79c00ea2  Word 0  0x562c640b   0x12345678 Oxf589f36c Word 15  0xb46945a7    0x12345678 Oxb8f  53a5d  Word 1  Ox3fde07e5    0x12345678 0x211c880d Word 16  0x55098e45    0x12345678 Oxfc92bcd3  Word 2  0x95a39594    0x12345678 OxcOlaf146 Word 17  0x1b970160    0x12345678 0x96685e64  Word 3  Oxb3ea2fba    0x12345678   0x2403b349 Word 18  0x7f8ac0d8    0x12345678 Oxedd5da69  Word 4  0x95a75659   0x12345678   0xd44074  7 Word 19  0xe91d3759 x   0x12345678 Ox0d90f125  Word 5  0x8a53ce5d   0x12345678 Oxc08bc5c6 Word 20  0x3472965b   0x12345678 Oxb4bcfd70  Word 6  Oxdl6ccbad    0x12345678 Ox8b557ea7 Word 21  Oxldbcb107    0x12345678 Oxe4cefa8s8ft  Word 7  Ox75fff1f9     0x12345678 Oxf0ad7cla Word 2
41. du  Please send bug reports to that address as well    The library itself is protected by the New BSD License  It is free to use and modify within the bounds of this  license  To the authors    knowledge  none of the techniques implemented in this library have been patented  and the  authors are not pursing patents      plank cs utk edu  University of Tennessee   elm cs ucsc edu  UC Santa Cruz   kmgreen2 gmail com  Box   This material  is based upon work supported by the National Science Foundation under grants CNS 0917396  IIP 0934401 and CSR 1016636  plus REU sup   plements CNS 1034216  CSR 1128847 and CSR 1246277  Thanks to Jens Gregor for helping us wade through compilation issues  and for Will  Houston for his initial work on this library     Finding the Code    Please see http    www cs utk edu  plank plank papers CS 13 716 htm1 to get the tar file for this code   This code is actively maintained on bitbucket at https    bitbucket org jimplank gf complete     Two Related Papers    This software acccompanies a large paper that describes these implementation techniques in detail  PGM13a   We  will refer to this as    The Paper     You do not have to read The Paper to use the software  However  if you want to  start exploring the various implementations  then The Paper is where you   ll want to go to learn about the techniques  in detail    This library implements the techniques described in the paper    Screaming Fast Galois Field Arithmetic Using Intel  SIMD Instructions
42. e  larger w than    TABLE     If the polynomial is not primitive  see section 6 1 2   then you cannot use    LOG    as  an implementation  In that case  gf_init_hard   or create_gf_from_argv   will fail        LOG_ZERO     Discrete logarithm tables which include extra room for zero entries  This more than doubles  the memory consumption to remove an if statement  please see  GMS08  or The Paper for more description   It  doesn   t really make a huge deal of difference in performance        LOG_ZERO_EXT     This expends even more memory to remove another if statement  Again  please see The  Paper for an explanation  As with    LOG_ZERO     the performance difference is negligible        SHIFT     Implementation straight from the definition of Galois Field multiplication  by shifting and XOR ing   then reducing the product using the polynomial  This is slooooooo0ow  so we don   t recommend you use it        CARRY_FREE     This is identical to    SHIFT     however it leverages the SSE instruction PCLMUL to perform  carry free multiplications in single instructions  As such  it is the fastest way to perform multiplication for large  values of w when that instruction is available  Its performance depends on the polynomial used  See The Paper  for details  and see section 7 7 below for the speedups available when w   16 and w   32 if you use a different  polynomial than the default one        BYTWO_p     This implements multiplication by successively multiplying the product by two 
43. e  then XOR   s work 128 bits at a time  For    CAUCHY    to work correctly   size must be a multiple of w     It is possible to combine region multiplication options  This is fully supported as long as gf_methods has the combi   nation listed  If multiple region options are required  they should be specified independently  as flags for gf_init_hard    and independent options for command line tools and create_gf_from_argv        6 2 Determining Supported Techniques with gf methods    The program gf_methods prints a list of supported methods on standard output  It simply enumerates flag combinations  and tests to see if they initialize without error  gfmethods does not test whether a method is working correctly   although it should be correct     we simply haven   t tested every combination of flags on every machine compiler that  exists in the world   it only prints out the method if it is supported by the machine and compiler flags  Some method  combinations are not printed  such as multi layer composite methods  gfmethods does not take any command line  arguments    gf_methods prints a line for each supported combination of flags  using the following format           w  lt w value gt    m  lt mult gt   r  lt region 1 gt   r  lt region 2 gt       r  lt region n gt   d  lt division gt       gf_methods prints the format of the string argv in create_gf_from_argv       6 THE DEFAULTS 21    6 3 Testing with gf_unit and gf_time    gf_unit and gf_time may be used to verify that a c
44. e mechanics of multiplication in    TA   BLE    and    LOG    are pretty simple  For that reason  we support inlining for    TABLE    when w   4 and w   8  and  for    LOG    when w   16  We elaborate below    When w   4  you may inline multiplication and division as follows  The following procedures return pointers to  the multiplication and division tables respectively     uint8_t xgf_w4_get_mult_table gf_t    gf    uint8_t   gf_w4_get_div_table gf_t    gf            The macro GF_W4_INLINE_MULTDIV table  a  b  then multiplies or divides a by b using the given table  This  of course only works if the multiplication technique is    TABLE     which is the default for w   4  If the multiplication  technique is not    TABLE     then gf_w4_get_mult_table   will return NULL    When w   8  the procedures gf_w8_get_mult_table   and gf_w8_get_div_table    and the macro  GF_W8_INLINE_MULTDIV table  a  b  work identically to the w   4 case    When w   16  the following procedures return pointers to the logarithm table  and the two inverse logarithm tables  respectively     uintl6_t    gf_wl6_get_log_table gf_t    gf    uintl6_t  gf_wl6_get_mult_alog_table gf_t   gf    uint1l6_t  gf_wl6_get_div_alog_table gf_t    gf               7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 25    The first inverse logarithm table works for multiplication  and the second works for division  They actually point  to the same table  but to different places in the table  You may then use the macro GF_W16_
45. e the    standard    polynomials are sub optimal with respect to     CARRY_FREE     For w   16  the polynomial 0x1002d has the desired property  It   s less important  of course   with w   16  because    LOG    is so much faster than    CARRY FREE           7 8 More on Primitive Polynomials  7 8 1 Primitive Polynomials that are not Primitive    The library is willing to work with most polynomials  even if they are not primitive or irreducible  For example  the  polynomial             x    x  1 is irreducible  and therefore generates a valid Galois Field for GF  24   However  it  is not primitive  because 2     1  For that reason  if you use this polynomial  you cannot use the    LOG    method  The  other methods will work fine     NIX gt  gf_mult 2 2 4  p Oxf      md C    IX gt  gf_mult 4 2 4  p Oxf      co C    NIX gt  gf_mult 8 2 4  p Oxf      Race    5  UNIX gt  gf_mult 15 2 4  p Oxf    1  UNIX gt  gf_div 1 15 4  p Oxf    2    NIX gt  gf_div 1 15 4  p Oxf  m LOG   sage  gf_div a b w  method    does division of a and b in GF  2 w    ad Method Specification  Cannot use Log tables because the polynomial is not primitive   UNIX gt        wW cc                   If a polynomial is reducible  then it does not define a Galois Field  but instead a ring  GF Complete attempts to  work here where it can  however certain parts of the library will not work     1  Division is a best effort service  The problem is that often quotients are not unique  If divide   returns a non zero  numbe
46. eir rough performance  The performance tests are on an Intel Core i7 3770 running at  3 40 GHz  and are included solely to give a flavor of performance on a standard microprocessor  Some processors  will be faster with some techniques and others will be slower  so we only put numbers in so that you can ballpark it   For other values of w between 1 and 31  we use table lookup when w  lt  8  discrete logarithms when w  lt  16 and     Bytwo      for w  lt  32     With SSE  Memory multiply   Performance multiply _region   Performance  Mid pager Lt Se  Table Table 11 659  Table Split Table  8 4  11 824  Log Split Table  16 4  7 749  Carry Free Split Table  32 4  5 011  Carry Free Split Table  64 4  2 402  Carry Free Split Table  128 8  833    Without SSE  Memory multiply  Performance multiply region   Performance  Ma Ten   fe  Double Table  Table  Split Table  16 8   Split Table  32 8   Split Table  64 8   Split Table  128 8        Table 1  The default implementations  memory consumption and rough performance when w is a power of two  The  variables s and t are alignment variables described in Section 5     A few comments on Table 1 are in order  First  with SSE  the performance of multiply   is faster when w   64  than when w   32  That is because the primitive polynomial for w   32  that has historically been used in Galois  Field implementations  is sub ideal for using carry free multiplication  PCLMUL   You can change this polynomial   see section 7 7  so that the performance 
47. ems   even though the polynomial is reducible     UNIX gt  gf_unit 64 A 0  m COMPOSITE 2  p Oxc5    p 2    UNIX gt        How can we demonstrate that this particular field has a problem  Well  when the polynomial is Oxc5  we can factor  x   22  1 as  x   Ox7f6f95  9   x   Ox7f6  95fb   Thus  in the composite field  when we multiply 0x17f6f95f9 by  0x17f6f95fb  we get zero  That   s the problem     7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 32    UNIX gt  gf_mult 7f  6f95f9 7  6f95fb 32h  p Oxc5     1   UNIX gt  gf_mult 17f6f95f9 17f6f95fb 64h  m COMPOSITE 2  p Oxc5    p 2    0   UNIX gt        7 9    ALTMAP    considerations and extract_word    There are two times when you may employ alternate memory mappings   1  When using    SPLIT    and w    4   2  When using    COMPOSITE        Additionally  by default  the    CAUCHY    region option also employs an alternate memory mapping    When you use alternate memory mappings  the exact mapping of words in GF  2   to memory depends on the  situation  the size of the region  and the alignment of the pointers  To help you figure things out  we have included the  procedures extract_word wxx   as part of the gf_t struct  This procedure takes four parameters     e A pointer to the gf t   e The beginning of the memory region   e The number of bytes in the memory region     e The desired word number  n     It then returns the n th word in memory  When the standard mapping is employed  this simply returns the n   th contiguous word in
48. example_2 expands on gf_example_1  If w is equal to 4  8  16 or 32  it performs a region multiply  operation  It allocates two sixteen byte regions  r1 and r2  and then multiples r1 by a and puts the result in r2 using  the multiply_region w32 function pointer     gf multiply_region w32  amp gf  rl  r2  a  16  0      That last argument specifies whether to simply place the product into r2 or to XOR it with the contents that are already  in r2  Zero means to place the product there  When we run it  it prints the results of the multiply_region w32 in  hexadecimal  Again  you can verify it using gf_mult     4 SOME TOOLS AND EXAMPLES TO GET YOU STARTED 11    UNIX gt  gf_example_2 4  12 x 2   11  11   12   2  11   2   12    multiply_region by Oxc  12     R1  the source   02d9d68  a8  gdb35c188eb0615a2c4b3936  R2  the product   Ob3zs63eatla3dade7T7T9  VFcaadgaAVdec I lL bHES5SadTE67 e  UNIX gt  gf_example_2 16   49598 x   35999   19867   19867   49598   35999   19867   35999   49598   multiply_region by Oxclbe  49598    R1  the source   8c9f b30e 5bf3 7cbb 16a9 105d 9368 4Abbe    R2  the product   4d9b 992d 02f2 c95c 228e ec82 324e 35e4  UNIX gt  gf_mult clbe 8c9f 16h   4d9b   UNIX gt  gf_mult clbe b30e 16h   992d   UNIX gt     4 4 Quick Starting Example  3  Using w   64    The program in gf_example_3 c is identical to the previous program  except it uses GF  2      Now a  b and    are  uint64_t   s  and you have to use the function pointers that have w64 extensions so that the lar
49. firms     IX gt  gf_poly 1   4 1 1 1 0 1  oly  x  4 x 1  rreducible   IX gt  gf_poly 1   4 1 3 1 2 1 1 1 0 1    H U                      Poly  x  4   X73   x72  x 1  Irreducible    UNIX gt  gf_poly 1   4 1 0 1  Poly  x  4   1   Reducible    UNIX gt     For composite fields GF   2      we are looking for a value s such that x    sa   1 is irreducible  That value  depends on the base field  For example  for the default field GF  28    a value of s   2 makes the polynomial  irreducible  However  if the polynomial Oxc5 is used  so that PCLMUL is fast     see section 7 7   then s   2 yields a  reducible polynomial  but s   3 yields an irreducible one  You can use gf_poly to help verify these things  and to help  define s if you need to stray from the defaults                    UNIX gt  gf_poly 32   2 1 1 2 0 1   Poly  x72    0x2 x   1   Irreducible    UNIX gt  gf_poly 32  p Oxc5   2 1 1 2 0 1  Poly  x  2    0x2 x     Reducible    UNIX gt  gf poly 32  p Oxc5   2 1 1 3 0 1  Poly  x 2    0x3 x   1   Irreducible    UNIX gt           gf_unit does random sampling to test for problems  In particular  it chooses a random a and a random b  multiplies  them  and then tests the result by dividing it by a and b  When w is large  this sampling does not come close to  providing complete coverage to check for problems  In particular  if the polynomial is reducible  there is a good  chance that gf_unit won   t discover any problems  For example  the following gf_unit call does not flag any probl
50. ger types may be em   ployed     UNIX gt  gf_example_3   a9Yaf3adef0d23242    61fd8433b25fe7cd   bf5acdde4c4lee0c  bf5acdde4c4lee0c   a9Yaf3adef0d23242   61f  d8433b25feTcd  bf5acdde4c4lee0c   61fd8433b25fe7cd   aYaf3adef0d23242    multiply_region by a9af3adef0d23242    R1  the source   61fd8433b25feT7cd 272d5d4b19ca44b7 3870bf7e63c345la 08992149b3e2f8b7  R2  the product   bfS5acdde4c4lee0c ad2d786cb6e4d66b7 43a7d857503fd261 d3d29c7be46bl1lfic  UNIX gt  gf_mult a9af3adef0d23242 61fd8433b25fe7cd 64h   bf5acdde4c4lee0c   UNIX gt     4 5 Quick Starting Example  4  Using w   128    Finally  the program in gf_example_4 c uses GF  21 8   Since there is not universal support for uint128_t  the library  represents 128 bit numbers as arrays of two uint64_t   s  The function pointers for multiplication  division and region  multiplication now accept the return values as arguments     5 IMPORTANT INFORMATION ON ALIGNMENT WHEN MULTIPLYING REGIONS 12    gf  multiply wl128  amp gf  a  b  c    Again  we can use gf_mult and gf_div to verify the results     UNIX gt  gf_example_4  e252d9c145c0bf29b85b21alae2921fa    b23044e7f45daft4d70695fb7b  F249432    7883669ef3001d7  fab  83784d52eb414    multiply_region by e252d9c145c0bf29b85b2lalae2921fa    R1  the source   f  4f56f08fa92494c5faa57ddcd874149 b4c06abladbbec2 f4b0ffc68e43008cb  R2  the product   b1le34d34b031660676965b868b892043 382f12719ffe3978385f5d97540al3al  UNIX gt  gf_mult e252d9c145c0bDf29b85b2lalae2921fa f4F56f08Ffa92494c5faa57ddcd
51. hat changing the polynomial changes the field  so fields with different  polynomials may not be used interchangeably  So long as the polynomial is irreducible  it generates a Galois Field that    6 THE DEFAULTS 17    is isomorphic to all other Galois Fields  however the multiplication and division of elements will differ  For example   the polynomials 0x13  the default  and 0x19 in GF 2   are both irreducible  so both generate valid Galois Fields   However  their multiplication differs     oOo C w C    N       2                6 1 3    NIX gt  gf_mult 8 2 4  p 0x13      IX gt  gf mult 8 2 4  p 0x19      IX gt  gf_div 3 8 4  p 0x13      UNIX gt  gf_div 9 8 4  p 0x19      UNIX gt     Changing the Multiplication Technique    The following list describes the multiplication techinques that may be changed with     m     We keep the description  here brief  Please refer to The Paper for detailed descriptions of these techniques        TABLE     Multiplication and division are implemented with tables  The tables consume quite a bit of memory  QY x 2    x Ea bytes   so they are most useful when w is small  Please see    SSE        LAZY        DOUBLE    and     QUAD    under region techniques below for further modifications to    TABLE    to perform multiply_region          LOG     This employs discrete  or    Zeph     logarithm tables to implement multiplication and division  The  memory usage is roughly  3 x 2    x  5  bytes   so they are most useful when w is small  but they tolerat
52. hat enumerates most of the implementation methods supported by GF Complete     gf_poly c  A program to identify irreducible polynomials in regular and composite Galois Fields     3 Compilation  especially with regard to the SSE instructions    This is not a complex library  so simply using make should be sufficient  This will compile the library and the example  programs  By default this does not utilize SSE optimizations  so that it can compile on as many machines as possible   However  to do truly fast operations  if your CPU supports SSE instructions  it is best to compile GF Complete so that  it may leverage them     4 SOME TOOLS AND EXAMPLES TO GET YOU STARTED 8    The directory flag_tester contains supporting programs for figuring out which SSE instructions GF Complete can  use on your machine  The shell script which_compile_flags sh performs a battery of tests on your machine  with your  compiler  and emits the proper flags for you to put into the GNUmakefile    By default  which_compile_flags sh uses cc to build its test programs  because it is fairly standard for ce to be  symlinked to the machine   s default compiler  If you would like to specify which compiler to use  simply pass its name  as a command line argument when running the script  If you do this  don   t forget to specify the CC variable to the  same compiler in GNUmakefile    In which_compile_flags sh  we first test whether your CPU supports SSE2  SSSE3  SSE4 2  and PCLMUL  We  then test your compilation
53. ides the product by  each number    To perform multiplication and division in GF  2      you must declare an instance of the gf_t type  and then initialize  it for GF  2     by calling gf_init_easy    This is done in gf_example_l c with the following lines     4 SOME TOOLS AND EXAMPLES TO GET YOU STARTED 10    gf_t of     if    gf_init_easy    amp gf  w      fprintf stderr   Couldn   t initialize GF structure  n     exit  0          Once gf is initialized  you may use it for multiplication and division with the function pointers multiply w32 and  divide w32  These work for any element of GF  2     so long as w  lt  32     c   gf multiply w32  amp gf  a  b      printf  Su   Su   Su n   a  b  c    printf  Su   Su   Su n   c  a  gf divide w32  amp gf  c  a     printf  Su   Su   Su n   c  by gf divide w32  amp gf  c  b       Go ahead and test this program out  You can use gf_mult and gf_div to verify the results     UNIX gt  gf_example_1 4    12   4 5  5   12   4  5   4   12    NIX gt  gf_mult 12 4 4    U   5   UNIX gt  gf_example_1 16   14411   60911   44568   44568   14411   60911   44568   60911   14411   UNIX gt  gf_mult 14411 60911 16  44568   UNIX gt        gf _init_easy    and later gf_init_hard    do call malloc   to implement internal structures  To release memory  call  gf _free    Please see section 6 4 to see how to call gf_init_hard   in such a way that it doesn   t call malloc       4 3 Quick Starting Example  2  Multiplying a region by a constant    The program gf_
54. instructions  a compi   lation environment may not be able to produce code with the needed machine instructions  For example  the default  compiler in Mac OS 10 8 4  as of this writing  failed to compile PCLMUL instructions even on Core i series CPUs  which support the instructions  Using the latest version of clang  3 3  via MacPorts solved this issue  For that reason   which_compile_flags sh tests both the CPU and and compilation environment     4 Some Tools and Examples to Get You Started  4 1 Three Simple Command Line Tools  gf_mult  gf_div and gf_add    Before delving into the library  it may be helpful to explore Galois Field arithmetic with the command line tools   gf_mult  gf_div and gf_add  These perform multiplication  division and addition on elements in G F  2      The syntax  is     4 SOME TOOLS AND EXAMPLES TO GET YOU STARTED 9    e gf mult a b w   Multiplies a and b in GF 2       e gf div a b w   Divides a by bin GF 2        e gf_add a b w   Adds a and b in GF  2        You may use any value of w from 1 to 32  plus 64 and 128  By default  the values are read and printed in decimal   however  if you append an    h    to w  then a  b and the result will be printed in hexadecimal  For w   128  the  h    is  mandatory  and all values will be printed in hexadecimal    Try them out on some examples like the ones below  You of course don   t need to know that  for example  5   4   7  inGF  24   however  once you know that  you know that z   4 and 7   5  You should be 
55. ion works by performing a  carry free multiplication in w steps  and then performing reduction in w steps  In    GROUP     the carry free multipli   cation is reduced to ex   steps  and the reduction is reduced to  4    Both require tables  The table for the carry free  multiplication must be    created at the beginning of each multiply    or multiply_region    while the table for reduction  is created when the gf_t is initialized  For that reason  it makes sense for gr to be bigger than gs    To give a flavor for the impact of these arguments  Figure 3 shows the performance of varying gs and gr for  single multiplication and region multiplication respectively  in GF  232  and GF  264   As the graphs demonstrate   multiply   performs better with smaller values of gs  while multiply_region   amortizes the creation of the shifting  table  and can tolerate larger values of gs  When gs equals g   there are some optimizations that we hand encode   These can be seen clearly in the multiply_region   graphs     7 6 Considerations with    COMPOSITE       As mentioned above  using    ALTMAP    with    COMPOSITE    allows multiply_region   to recursively call multi   ply_region    rather than simply calling multiply Q on every word in the region  The difference can be pronounced     gf_time 32 G 0 10240 10240  m COMPOSITE  Speed   322 MB s  gftime 32 G 0 10240 10240  m COMPOSITE             Speed   3 368 MB s  gf time 32 G 0 10240 10240  m COMPOSITE 2  m SPLIT LTMAP    r ALTMAP    Speed
56. iplication with    ALTMAP    and then convert back to the standard representation   The performance difference using    ALTMAP    can be significant     gf_time 16 G 0 1048576 100  m SPLIT Speed   8 389 MB s  gf time 16 G 0 1048576 100  m SPLIT 16 4  r ALT Speed   9 212 MB s    gf_time 32 G 0 1048576 100  m SPLIT 32 4  r ALTMAP   Speed   7 146 MB s  gf_time 64 G 0 1048576 100  m SPLIT 64 4   Speed   2 595 MB s  gf_time 64 G 0 1048576 100  m SPLIT T Speed   3 436 MB s    2  Wa is equal to w and wy is equal to eight  Now  bis broken into bytes  each of these is used in its own 256 element  lookup table  This is typically the best way to perform multiply_region   without SSE     gf_time 32 G 0 1048576 100  m SPLIT 32 4   Speed   5 304 MB s                   Because this is a region optimization  when you specify these options  you get a default multiply       see  Table 1 for a listing of the defaults  See section 7 2 for using a different multiply   than the defaults     3  Wa is equal to w and wy is equal to 16  This is only valid for w   32 and w   64  Now  b is broken into shorts   each of these is used in its own 64K element lookup table  This is typically slower than when wy equals 8  and  requires more amortization  larger buffer sizes  to be effective     4  Wa and w  are both equal to eight  Now both a and b are broken into bytes  and the products of the various bytes  are looked up in multiple 256 x 256 tables  In G F  218   there are three of these tables  In GF 23   
57. lignment errors  see Section 5     Mutually exclusive region types  Some combinations of region types are invalid  All valid and implemented  combinations are printed by gf_methods c     Incompatible division types  Some choices of multiplication type constrain choice of divide type  For ex   ample     COMPOSITE    methods only allow the default division type  which divides by finding inverses  i e    neither    EUCLID    nor    MATRIX    are allowed   For each multiplication method printed by gf_methods c  the  corresponding valid division types are also printed     Arbitrary    GROUP    arguments  The legal arguments to    GROUP     are specified in section 7 5     Arbitrary    SPLIT    arguments  The legal arguments to    SPLIT    are specified in section 7 4     Threading problems  For threading questions  see Section 8     No default polynomial  If you change the polynomial in a base field using    COMPOSITE     then unless it is  a special case for which GF Complete finds a default polynomial  you    ll need to specify the polynomial of the  composite field too  See 7 8 2 for the fields where GF Complete will support default polynomials     Encoding decoding with different fields  Certain fields are not compatible  Please see section 7 2 for an  explanation        ALTMAP     is confusing  We agree  Please see section 7 9 for more explanation     I used    ALTMAP    and it doesn   t appear to be functioning correctly  With 7 9  the size of the region and  its alignment b
58. ltiples of 16  That is not a  strict requirement  but it will always work     If you want to relax the above recommendation  please read further    When performing multiply_region wxx    the implementation is typically optimized for a region of bytes whose  size must be a multiple of a variable s  and which must be aligned to a multiple of another variable t  For example   when doing multiply_region w32   in GF  216  with SSE enabled  the implementation is optimized for regions of  32 bytes  which must be aligned on a 16 byte quantity  Thus  s   32 and t   16  However  we don   t want multi   ply_region w32   to be too restrictive  so instead of requiring source and dest to be aligned to 16 byte regions  we  require that  source mod 16  equal  dest mod 16   Or  in general  that  source mod t  equal  dest mod t     Then  multiply_region wxx   proceeds in three phases  In the first phase  multiply wxx   is called on successive  words until  source mod t  equals zero  The second phase then performs the optimized region multiplication on    6 THE DEFAULTS 13    0x10006 0x10010  which is aligned on      16 bytes  0x10100 0x10108       1 gt     Source                                        10 bytes multiplied 256 bytes multiplied in chunks 8 bytes multiplied  with 5 calls to of s   32 bytes at a time  with 4 calls to  r multiply w32    r multiply w32     dest  0x20006 0x20010  which is aligned on 16 bytes  0x20100 0x20108    Figure 2  Example of multiplying a region of 274 bytes in
59. ltiply_region       e    ALTMAP     Use an alternate mapping  where words are split across different subregions of memory  There  are two places where this matters  The first is when implementing    SPLIT w 4    using SSE when w  gt  8  In  these cases  each byte of the word is stored in a different 128 bit vector  which allows the implementation to  better leverage 16 byte table lookups  See section 7 4 for examples  and The Paper or  PGM13b  for detailed  explanations     The second place where it matters is when using    COMPOSITE     In this case  it is advantageous to split each  memory region into two chunks  and to store half of each word in a different chunk  This allows us to call  region_multiply   recursively on the base field  which is much faster than the alternative  See Section 7 6 for  examples  and The Paper for an explanation     It is important to note that with    ALTMAP     the words are not    converted    from a standard mapping to an  alternate mapping and back again  They are assumed to always be in the alternate mapping  This typically  doesn   t matter  so long as you always use the same    ALTMAP     calls  Please see section 7 9 for further details  on    ALTMAP     especially with respect to alignment        CAUCHY     Break memory into w subregions and perform only XOR   s as in Cauchy Reed Solomon cod   ing  BKK7T95   also described in The Paper   This works for any value of w  lt  32  even those that are not  powers of two  If SSE2 is availabl
60. matches w   64     6 THE DEFAULTS 15    The region operations for w   4 and w   8 without SSE have been selected to have a low memory footprint  There  are better options that consume more memory  or that only work on large memory regions  see section 6 1 5      6 1 Changing the Defaults  There are times that you may want to stray from the defaults  For example     You may want better performance    You may want a lower memory footprint    You may want to use a different Galois Field or even a ring   You only care about multiplying a region by the value two     Our command line tools allow you to deviate from the defaults  and we have two C functions     gf_init_hard    and create_gf_from_argv       that can be called from application code to override the default methods  There are six  command line tools that can be used to explore the many techniques implemented in GF Complete     gf_methods is a tool that enumerates most of the possible command line arguments that can be sent to the other  tools     gf_mult and gf_div are explained above  You may change the multiplication and division technique in these  tools if you desire     gf_unit performs unit tests on a set of techniques to verify correctness     gf_time measures the performance of a particular set of techniques     gf_poly tests the irreducibility of polynomials in a Galois Field     To change the default behavior in application code  you need to call gf_init_hard   rather than gf_init_easy     Alternatively  you c
61. mplete h  This returns a pointer to an inverse logarithm table that can be  used for inlining multiplication when w   16  See section 7 1     gf_w4_get_div_table   in gf_complete h  This returns a pointer to a division table that can be used for inlining  when w   4  See section 7 1     gf_w4_get_mult_table   in gfcomplete h  This returns a pointer to a multiplication table that can be used for  inlining when w   4  See section 7 1     gf_w8_get_div_table   in gf complete h  This returns a pointer to a division table that can be used for inlining  when w   8  See section 7 1     gf_w8_get_mult_table   in gf_complete h  This returns a pointer to a multiplication table that can be used for  inlining when w   8  See section 7 1     10 TROUBLESHOOTING 39    10 Troubleshooting    SSE support  Leveraging SSE instructions requires processor support as well as compiler support  For exam   ple  the Mac OS 10 8 4  and possibly earlier versions  default compile environment fails to properly compile  PCLMUL instructions  This issue can be fixed by installing an alternative compiler  see Section 3 for details     Initialization segfaults  You have to already have allocated your gft before you pass a pointer to it in  gf _init_easy    create_gf_from_argv    or gf_init_hard       GF Complete is slower than it should be  Perhaps your machine has SSE  but you haven   t specified the SSE  compilation flags  See section 3 for how to compile using the proper flags     Bad alignment  If you get a
62. mposites field if you want  For example  this one   s not too slow for  region operations  641 MB s      gf_time 128 G 0 1048576 100  m COMPOSITE 2  m COMPOSITE 2  m COMPOSITE 2   m SPLIT 16 4  r ALTMAP    r ALTMAP    r ALTMAP    r ALTMAP               Please see section 7 8 1 for a discussion of polynomials in composite fields     777    CARRY FREE    and the Primitive Polynomial    If your machine supports the PCLMUL instruction  then we leverage that in    CARRY FREE     This implementation  first performs a carry free multiplication of two w bit numbers  which yields a 2w bit number  It does this with  one PCLMUL instruction  To reduce the 2w bit number back to a w bit number requires some manipulation of the  polynomial  As it turns out  if the polynomial has a lot of contiguous zeroes following its leftmost one  the number of  reduction steps may be minimized  For example  with w   32  we employ the polynomial 0x 100400007  because that    7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 29    is what other libraries employ  This only has 9 contiguous zeros following the one  which means that the reduction  takes four steps  If we instead use 0x1000000c5  which has 24 contiguous zeros  the reduction takes just two steps   You can see the difference in performance           gf_time 32 M 0 1048576 100  m CARRY_FREE   Speed   48 Mega ops s                gf time 32 M 0 1048576 100  m CARRY_FREE  p Oxc5     Speed   81 Mega ops s   This is relevant for w   16 and w   32  wher
63. n 7 9 for details     6 The Defaults    GF Complete implements a wide variety of techniques for multiplication  division and region multiplication  We have  set the defaults with three considerations in mind     1  Speed  Obviously  we want the implementations to be fast  Therefore  we choose the fastest implementations  that don   t violate the other considerations  The compilation environment is considered  For example  if SSE is    6 THE DEFAULTS 14    enabled  region multiplication in GF  2   employs a single multiplication table  If SSE is not enabled  then a     double    table is employed that performs table lookup two bytes at a time     2  Memory Consumption  We try to keep the memory footprint of GF Complete low  For example  the fastest  way to perform multiply w32   in GF  23   is to employ 1 75 MB of multiplication tables  see Section 7 4  below   We do not include this as a default  however  because we want to keep the default memory consumption  of GF Complete low     3  Compatibility with    standard    implementations  While there is no de facto standard of Galois Field arith   metic  most libraries implement the same fields  For that reason  we have not selected composite fields  alternate  polynomials or memory layouts for the defaults  even though these would be faster  Again  see section 7 7 for  more information     Table   shows the default methods used for each power of two word size  their alignment parameters s and t  their  memory consumption and th
64. ombination of arguments works correctly and efficiently on your  platform  If you plan to stray from the defaults  it is probably best to run both tools to ensure there are no issues with  your environment  gf_unit will run a set of unit tests based on the arguments provided to the tool  and gf_time will  time Galois Field methods based on the provided arguments    The usage of gf_unit is     gf_unit w tests seed  method     The usage of gf_time is        gf_time w tests seed buffer size iterations  method  params       The seed is an integer     negative one uses the current time  The tests are specified by a listing of characters  The  following tests are supported  all are supported by gf_time and    A        S    and    R    are supported by gf_unit         M     Single multiplications       D     Single divisions       I     Single inverses     G     Region multiplication of a buffer by a random constant       0     Region multiplication of a buffer by zero  does nothing and bzero          T     Region multiplication of a buffer by one  does memcpy   and XOR        2     Region multiplication of a buffer by two     sometimes this is faster than general multiplication      S     All three single tests       R     All four region tests       A     All seven tests     For example  suppose you want to explore using the    SPLIT    technique in GF  2        You run gf_methods and  grep for    w 32    SPLIT     It prints quite a few lines  and from them  you decide that you   
65. on  2 Files in the Library  3 Compilation  especially with regard to the SSE instructions  4 Some Tools and Examples to Get You Started  4 1  4 2  4 3  4 4  4 5  5 Important Information on Alignment when Multiplying Regions  6 The Defaults  6 1 Changing the Defaults           6 1 2 Changing the Polynomial       6 2  6 3 Testing with gf_unit and gftime       6 4 Calling gfinithard            6 5 SfS1ZE0  s 2a eee Soe eS  7 Further Information on Options and Algorithms  7 1  7 2  Ta General w i vie ca lee ee  7 4 Arguments to    SPLIT              7 5 Arguments to    GROUP             7 6 Considerations with    COMPOSITE     TT  7 8 More on Primitive Polynomials       7 9       ALTMAP    considerations and extract_word     2    00  a  7 9 1 Alternate mappings with    SPLIT          2    0 0 0 0    0 00002 eee eee  7 9 2 Alternate mappings with    COMPOSITE                    2 000000 00    7 9 3 The mapping of    CAUCHY       8 Thread Safety    o0 o     o    10  11  11    12    13    15  16  17  19  19  20  21  22  24    24  24  25  25  26  27  27  28  29  29    31  32  32  34  35    35    CONTENTS    9 Listing of Procedures  10 Troubleshooting    11 Timings  11 1 MultiplyO       11 2 Divide         11 3 Multiply_Region      1 INTRODUCTION 5    1 Introduction    Galois Field arithmetic forms the backbone of erasure coded storage systems  most famously the Reed Solomon  erasure code  A Galois Field is defined over w bit words and is termed GF  2      As such  the elements of a G
66. or example  gf_mult c calls create_gf_from_argv   by simply passing argc and argv from its main   declaration   and setting starting to 4    To choose defaults  argv starting  should equal          Otherwise  you specify the component that you are chang   ing with     m    for multiplication technique      d    for division technique      r    for region technique  and     p    for the  polynomial  You may change multiple components  You end your specification with a single dash  For example  the  following call multiplies 6 and 5 in GF  24  with polynomial 0x19 using the    SHIFT    technique for multiplication   we   ll explain these parameters later      UNIX gt    gf_mult 6 5 4  p 0x19  m SHIFT    7  UNIX gt     If create_gf_from_argv   fails  then you can call the procedure gf_error    which prints out the reason why cre   ate_gf_from_argv   failed     6 1 2 Changing the Polynomial    Galois Fields are typically implemented by representing numbers as polynomials with binary coefficients  and then  using the properties of polynomials to define addition and multiplication  You do not need to understand any of that to  use this library  However  if you want to learn more about polynomial representations and how they construct fields   please refer to The Paper    Multiplication is based on a special polynomial that we will refer to here as the    defining polynomial     This  polynomial has binary coefficients and is of degree w  You may change the polynomial with     p    
67. oth matter in terms of how    ALTMAP     performs multiply_region    Please see section 7 9 for  detailed explanation     Where are the erasure codes   This library only implements Galois Field arithmetic  which is an underlying  component for erasure coding  Jerasure will eventually be ported to this library  so that you can have fast erasure  coding     11 Timings    We don   t want to get too detailed with timing  because it is quite machine specific  However  here are the timings on  an Intel Core 17 3770 CPU running at 3 40 GHz  with 4 x 256 KB L2 caches and an 8MB L3 cache  All timings are  obtained with gf_time or gf_inline_time  in user mode with the machine dedicated solely to running these jobs     11 TIMINGS 40     m TABLE INLINE    TA     m BLE      m LOG      m SHIFT     CARRY PREE     M   _    0 100 200 300 400 500 600 700 800 900 1000   m TABLE INLINE     m TABLE     m LOG_ZERO_ EXT     m SPLIT 8 4     m LOG_ZERO     m LOG     m COMPOSITE 2       m SHIFT     m BYTWO p    m CARRY_FREE     8   m BYTWO _b   y     0 100 200 300 400 500 600 700 800 900 1000   m LOG INLINE       m LOG     m SPLIT 16 4     m LOG_ZERO     m COMPOSITE 2       m SPLIT 8 8     m CARRY_FREE  p 0x1002d     m GROUP 4 4     m CARRY_FREE     m SHIFT      M BYTWO      N 16    m BYTWO_b   wa    0 100 200 300 400 500 600 700 800 900 1000  Speed  Mega ops   s     Figure 4  Speed of doing single multiplications for w      4  8  16      11 1 Multiply     The performance of multiply   is displayed in
68. plements the    left to right comb    technique  LBOX12   I   m afraid we don   t like that name   so we call it    GROUP     because it performs table lookup on groups of bits for shifting  left  and reducing  right    It takes two additional arguments     gs  which is the number of bits you use while shifting  left  and g   which  is the number of bits you use while reducing  right   Increasing these arguments can you higher computational  speed  but requires more memory  SSE version exists only for w   128 and it requires SSE4  For more  description on the arguments gs and gr  see section 7 5  For a full description of    GROUP    algorithm  please  see The Paper        COMPOSITE     This allows you to perform operations on a composite Galois Field  GF   2     as described  in  GMS08    LBOX12  and The Paper  The field size w is equal to lk  It takes one argument  which is k  and  then a specification of the base field  Currently  the only value of k that is supported is two  However  that may  change in a future revision of the library     In order to specify the base field  put appropriate flags after specifying k  The single dash ends the base field   and after that  you may continue making specifications for the composite field  This process can be contin   ued for multiple layers of    COMPOSITE     As an example  the following multiplies 1000000 and 2000000  in GF  21       where the base field uses BYTWO_p for multiplication       gf mult 1000000 2000000 32  m COMPO
69. r  then that number will be a valid quotient  but it may be one of many  If the multiplication technique is     TABLE     then if a quotient exists  one is returned  Otherwise  zero is returned  Here are some examples     the  polynomial       1 is reducible  and therefore produces a ring  Below  we see that with this polynomal  1 6   6  and 14 6   6  Therefore     has two valid quotients  1 and 14  GF Complete returns 14 as the quotient     UNIX gt  gf_mult 1 6 4  p 0x1    6   UNIX gt  gf_mult 14 6 4  p 0x1    6   UNIX gt  gf_div 6 6 4  p 0x1      7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 30    14  UNIX gt     2  When    EUCLID    is employed for division  it uses the extended Euclidean algorithm for GCD to find a number   s  inverse  and then it multiplies by the inverse  The problem is that not all numbers in a ring have inverses  For  example  in the above ring  there is no number a such that 6a   1  Thus  6 has no inverse  This means that even  though    has quotients in this ring     EUCLID    will fail on it because it is unable to find the inverse of 6  It will  return 0        UNIX gt  gf_div 6 6 4  p 0x1  m TABLE  d EUCLID    0  UNIX gt              3  Inverses only work if a number has an inverse  Inverses may not be unique     4     LOG    will not work  In cases where the default would be    LOG        SHIFT    is used instead     Due to problems with division  gf_unit may fail on a reducible polynomial  If you are determined to use such a  polynomial  don 
70. re two techniques for division that may be set with     d     If     d    is not specified  then appropriate defaults  are employed  For example  when the multiplication technique is    TABLE     a table is created for division as well as  multiplication  When    LOG     is specified  the logarithm tables are used for division  With    COMPOSITE     a special  variant of Euclid   s algorithm is employed that performs division using multiplication and division in the base field   Otherwise  Euclid   s algorithm is used  Please see The Paper for a description of Euclid   s algorithm applied to Galois  Fields    If you use     d     you must also specify the multiplication technique with     m   To force Euclid   s algorithm instead of the defaults  you may specify it with     d EUCLID     If instead  you would  rather convert elements of a Galois Field to a binary matrix and find an element   s inverse by inverting the matrix   then specify     d MATRIX     In all of our tests     MATRIX    is slower than    EUCLID        MATRIX    is also not defined  for w  gt  32     6 1 5 Changing the Region Technique    The following are the region multiplication options      r         e    SSE     Use SSE instructions  Initialization will fail if the instructions aren   t supported  Table 2 details the  multiplication techniques which can leverage SSE instructions and which versions of SSE are required     Multiplication    multiply    multiply_region   Comments  Technique ea       TABLE 
71. should  simply create two gf_t   s  and use one for multiply   and the other for multiply_region    All of the implementations  may be used interchangably with the following exceptions     e    COMPOSITE    implements a different Galois Field   e If you change a field   s polynomial  then the resulting Galois Field will be different     e If you are using    ALTMAP    to multiply regions  then the contents of the resulting regions of memory will  depend on the multiplication technique  the size of the region and its alignment  Please see section 7 9 for a  detailed explanation of this     e If you are using    CAUCHY    to multiply regions  then like    ALTMAP     the contents of the result regions of  memory the multiplication technique and the size of the region  You don   t have to worry about alignment     7 3 General w    The library supports Galois Field arithmetic with 2  lt  w  lt  32  Values of w which are not whole number powers of  2 are handled by the functions in gf_wgen c   For these values of w  the available multiplication types are    SHIFT        7 FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 26       BYTWO_p        BYTWO_b        GROUP        TABLE    and    LOG        LOG    is only valid for w  lt  28 and    TABLE     is only valid for w  lt  15  The defaults for these values of w are    TABLE    for w  lt  8     LOG    for w  lt  16  and     BYTWO_p    for w  lt  32     7 4 Arguments to    SPLIT       The    SPLIT    technique is based on the distributi
72. te_gf_from_argv   instead of gf_init_hard    However  there are extra things that  you can do with gf_init_hard    Here   s the prototype     int gf_init_hard gf_t   gf     int w   int mult_type   int region_type     int divide_type   uint64_t prim_poly   int argl    int arg2    GFP base_gf    void xscratch_memory          The arguments mult_type  region_type and divide_type allow for the same specifications as above  except the  types are integer constants defined in gf_complete h           GF_MULT_SHIFT     typedef enum  GF_MULT_DEFAULT              6 THE DEFAULTS 23       GF_MULT_CARRY_FREE   GF_MULT_GROUP   GF_MULT_BYTWO_p   GF_MULT_BYTWO_b                                                                 GF_MULT_TABLE   GF_MULT_LOG_TABLE   GF_MULT_LOG_ZERO   GF_MULT_LOG_ZERO_EXT   GF_MULT_SPLIT_TABLE                             GF_MULT_COMPOSITE   gf_mult_type_t                                                           define GF_REGION_DEFAULT  0x0   define GF_REGION_DOUBLE_TABLE  0x1   define GF_REGION_QUAD_TABLE  0x2   define GF_REGION_LAZY  0x4   define GF_REGION_SSE  0x8   define GF_REGION_NOSSE  0x10   define GF_REGION_ALTMAP  0x20   define GF_REGION_CAUCHY  0x40   typedef enum   GF_DIVIDE_DEFAULT           GF_DIVIDE_MATRIX   GF_DIVIDE_EUCLID   gf_division_type_t                       You can mix the region types with bitwise or  The arguments to GF_MULT_GROUP  GF_MULT_SPLIT_TABLE  and GF_MULT_COMPOSITE are specified in arg1 and arg2  GF _MULT_COMPOSITE also takes a b
73. to worry about any of this     words will always be laid out contiguously in memory    When w   32  the middle region is a multiple of 64  and each word in the middle region is broken into bytes  each  of which is in a different 16 byte region  When w   64  the middle region is a multiple of 128  and each word is  stored in eight 16 byte regions  And finally  when w   128  the middle region is a multiple of 128  and each word is  stored in 16 16 byte regions     7 9 2 Alternate mappings with    COMPOSITE       With    COMPOSITE     the alternate mapping divides the middle region in half  The lower half of each word is stored  in the first half of the middle region  and the higher half is stored in the second half  To illustrate  gfexample_6  performs the same example as gf_example_5  except it is using    COMPOSITE    in GF   2        and it is multiplying  a region of 120 bytes rather than 60  As before  the pointers are not aligned on 16 bit quantities  so the region is broken  into three regions of 4 bytes  96 bytes  and 20 bytes  In the first and third region  each consecutive four byte word is a  word in GF  23    For example  word 0 is 0x562c640b  and word 25 is 0x46bc47e0  In the middle region  the low two  bytes of each word come from the first half  and the high two bytes come from the second half  For example  word 1  as reported by extract_word   is composed of the lower two bytes of word 1 of memory  0x07e5   and the lower two  bytes of word 13  Ox3fde   The produ
74. ve property of multiplication and addition   ax  b c    a b   ax c      This property allows us to  for example  split an eight bit word into two four bit components and calculate the product  by performing two table lookups in 16 element tables on each of the compoents  and adding the result  There is much  more information on    SPLIT    in The Paper  Here we describe the version of    SPLIT    implemented in GF Complete       SPLIT    takes two arguments  which are the number of bits in each component of a  which we call wa  and the  number of bits in each component of b  which we call wp  If the two differ  it does not matter which is bigger     the  library recognizes this and performs the correct implementation  The legal values of wa and wy fall into five categories     1  Wa is equal to w and wp is equal to four  In this case  b is broken up into 4 four bit words which are used  in 16 element lookup tables  The tables are created on demand in multiply_region   and the SSSE3 instruc   tion mm_shuffle_epi8   is leveraged to perform 16 lookups in parallel  Thus  these are very fast implementa   tions  When w  gt  16  you should combine this with    ALTMAP    to get the best performance  see The Paper  or  PGM13b  for explanation   If you do this please see section 7 9 for information about    ALTMAP    and    alignment     If you don   t use    ALTMAP       the implementations for w      16  32  64  convert the standard representation into     ALTMAP     perform the mult
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Service Manual - Homer City Automation  GE EAMD Installation Guide  VPCF13MGX/B - Clearance Club  Grundig HIFI STEREO MICRO SYSTEM VERTIGA UMS 5100 User's Manual  Cirkuit Planet Pendrive 3D Natty Bob, 8GB  iVent101 - Medigroba  TVIP62000  MVI69-AFC User Manual    OM, KLIPPO, Brilliant, BrilliantSSelfstart, Cobra, CobraS, ProSCobra    Copyright © All rights reserved. 
   Failed to retrieve file