Home
        31762103922173 - ScholarWorks
         Contents
1.                            25               LEE 25  Neural Signal Data from a                                                                 26  Practical Problems Related to the EM algorithm                            eene 27  Model  Number SeleefIOB    oed iie reae aes ei isc atin eas 27  Initialization of the EM Algorithm          sessessesessesesseseressessesessessessessereeseesesesersesese 28                   OF the        castes cscs                                               30  Numeric Considerations m Matlab    cepe          eaa eter ee aie rna 30  The Log      AJgortbm     5   s enne eorr                                                   30  Using    Diagonal Covariance Matrix                 eese eene eene eene enne nennen ennt 34  Implementation Result m Matlab    cii etie tpee                   36  Matlab Performante decido deer                                             d eee ee n di sabe pde 38  4  PARALLEL DSP IMPLEMENTATION                 eeeeeee ete tnnt tnt           40  HERO                  URN 40  Hard WATE        rre                    ox vL UR                 40  SHARC ADSP 21160M DSP Processor                  eese eee ene 40  Bittware Hammerhead PCI board and VisualDSP                            ees 41  Analysis of the EM Algorithm                       eres eene nnne nnn entente 43  Parallel Transformation of the EM Algorithm                       eee 44    Performance on the Parallel DSP Platform                                     49       
2.              oe e  if DMA    1      extmem to intmem s2m   int  Y PR  BLOCK 1 LoopCount DSP             BLOCK  LENGTH   CANDC LENGTH    candc  start   CANDC LENGTH     Buffer size   10     DMA Channel    TRUE       Wait for DMA to compl             100    else       for   2 0 i2 lt CANDC_LENGTH i2            candc start   i     int    PR_BLOCK_1 LoopCount DSP_JUMP P_BLOCK_LENGTH   CANDC LENGTH          J                Finish Moving                                                                                     Begin update the covars                                       i for  i 0 i lt nclusters i          vecvadd logcovars_master i  candcTempf i  logcovars_master i  ndim           for  i 0 i lt nclusters i               vecsmlt logcovars master i  ogpriors master O  i  covars i  ndim      if DMA    1        intmem_to_extmem_m2s     int   covars     Source buffer    int    P_BLOCK_1 P_BLOCK_LENGTH CANDC_LENGTH    CANDC_LENGTH     Buffer size   10     DMA Channel  TRUE       Wait for DMA to compl    intmem  to extmem md2s  int   covars     Source buffer     int        BLOCK  1 P BLOCK  LENGTH CANDC LENGTH   DSP JUMP               LENGTH      Buffer size    10     DMA Channel   TRUE       Wait for DMA to compl    intmem to extmem m2s  int   covars     Source buffer     int         BLOCK  1    BLOCK  LENGTH CANDC  LENGTH42 DSP JUMP    CANDC LENGTH     Buffer size    10     DMA Channel   TRUE       Wait for DMA to compl    else     for  i2 0 i2 lt CANDC_LENGTH i2           
3.          TABLE OF CONTENTS   CONTINUED    S FPGA IMPEBMENATION   uide rete dee te heir TOi eke 51                     ME 51  The FPGA        Its                              edet a Pp kan ee UR          edd 52   Field Programmable Gate                                 52   Structure of Xilinx   s Virtex                9                                      tial 53                    hf  sek tat aeter              duds casas NR OR ER RI Tbv snis Fus Redes 56  The Fixed Point Representation of the EM Algorithm          58  Using Lookup Tables   2 2 ptio oae tni see        De pape SAE 59  Parallelism needed for FPGA implementation                         eene 65  Synthesis and Simulation zie eee                   v Fre    ipa oie 67  FPGA Result Evaluatldl                             e C tesa      69  FPGA Performante RIP   6  CONCLUSION AND FUTURE WORK                           reet 73  CONCLUSION mec 73                Em              TR 74                                                                                                        76   APPENDICES                                                               80  APPENDIX     SIMPLYFYING THE EXPRESSION OF Q O O                        81  APPENDIX B  FLOATING POINT MATLAB CODE                    eese 85  APPENDIX C  C CODE ON THE PARALLEL DSP PLATFORM                        89    APPENDIX D  FIXED POINT MATLAB CODE USING ACCELFPGA            114                              LIST OF FIGURES    FIGURE PAGE    1 1 This extr
4.         12 07                  yy yp              The form of Q O  O   can still be greatly simplified  We not that for       1      M             N  Da Fd   0 1x 05  z ly  l         j 1   M    M M M N      dD                       0                     1             1 yy tj  j  i          ER    II Q2  6 1x  0   201x 0   p  Jal  js             p I x O      M  since   Po       0    1         we can write Q O  O   as     Jja                            Q 0 0          log a  pix               x  O           i l                             11   0    5            10 00015 0    l 1 i l                This is the same equation as Eq  2 16  As shown in Chapter 2  from this equation we         use Lagrange multiplier and other methods to get the Eq   2 20    2 21  and  2 22   which  together form the basis of the EM algorithm for iteratively estimating the parameters of a    Gaussian Mixture Model              85    APPENDIX B    FLOATING POINT MATLAB CODE FOR THE EM ALGORITHM                   86    diaghalfLogEmGmm m     function  StoreLike  LoopCount  OldMgmm   diaghalfLogEmGmm  ncluster  data   OldMgmm      min_std   0 0001    ndatadim ndata    size data        Initialize the covariance to be diagonal  OldMgmm covars   ones ncluster ndatadim      PrevLike    100    StopCri   0 0001       MaxLoop   100    LoopCount   0    StopOrNot   100    StoreLike   zeros 1 MaxLoop      range_max   0   range_min   1000   range_per   1000     track   zeros 3 1000        The main EM Loop  while  St
5.         Gates Slices Blocks Blocks DOMS pads    XC2V3000 3M 14336 96 96 8 720                56  AccelFPGA    FPGA programming can be done using a Hardware Descriptive Language  HDL   such as VHDL or Verilog  These HDLs are low level languages which have a very  fine control of the actual hardware implementation  However  for the complex  applications  the HDL coding process will be tedious and error prone and requires  costly debugging iterations  Furthermore  a lot of low level work is repetitive and can  be automated  To solve this problem  many researchers have focused on how to raise  the level of abstraction to a general purpose programming language such as C C     32  or Java  33   Recently  a compiler has been developed that can take Matlab code  and generate optimized hardware for a FPGA  The AccelFPGA Compiler V1 6 was  used for this thesis      Figure 5 3 shows the basic steps in the process of converting an algorithm in  MATLAB into an FPGA implementation using AccelFPGA  The first step is to  convert the floating point model of the algorithm into a fixed point model that meets  the algorithm s signal fidelity criterion  Next  the user adds compiler directives for  AccelFPGA  These directives can specify the usage of FPGA s embedded hardware  resources sdb as the BIlockRAM and the embedded multiplier blocks  The third step  is the creation of the RTL model in VHDL or Verilog and the associated testbenches  vides automatically created by the AccelFPGA compiler  The
6.        48  5 1 Virtex II architecture overview  31                   eese tette 54   5 2 Slice Structure of Virtex II FPGA  31                  eese tentent 54    5 3 Top down design process using AccelFPGA                     i               H    57    5 4 Histogram of the input and output data range for the LUT table   a    The mp  t   b  Phe  Outputs cuti erts REA                          62  5 5 Input and output curve of the exponential operation                        eee 63   5 6 Diagram of implementing the LUT in block memory                                    64  5 7 Diagram of the EM algorithm implementation in the FPGA                           67  5 8 Floorplan of the EM algorithm implementation on a Virtex II FPGA            69    5 9 The mean vectors of the GMMs   a  is the floating point version   b   is the output of the FPGA implementation                       eee 70    5 10 The clustering results of both floating point and fixed point  imiplementatiOris     5 operto ri ettet      EIE XA                                71                Vili    LIST OF TABLES  TABLE PAGE     3 1 Performance of the spike sorting system on several                                      38    4 1 List of semaphores and their functions                    eese 46    4 2 Execution time for 8 iterations of EM algorithm on single and 4     DSP Sy SECU cis              E e Oe NOR                           49     4 3 Performance of the EM algorithm on DSP system and                      1
7.       er    Time  ms     Figure 1 1 This extracellular recording waveform shows different action potentials  from an unknown number of local neurons  The data were recorded from an adult  female cricket s cercal sensory system   Data courtesy of Alex  7      Spike sorting provides an alternative to physical isolation for multi neuron  recording  In this approach  no effort is made to isolate a single cell  rather the spikes  due to several cells are recorded simultaneously and sorted into groups according to  their waveforms  Each group is presumed to represent a single cell since their    waveforms change as a function of position relative to the electrode  The closer an             electrode is to a particular neuron  the greater the amplitude of the signal will be    compared with other waveforms     Spike Sorting System    A typical system that measures and analyses extracellular neural signals is shown  in Figure 1 2  There are four stages between the electrode and the identification of    spike trains from individual neurons            Feature  Extractor        Classifier        Detector            Figure 1 2 A typical neural signal recording  detecting and spike sorting system     In the first stage neural signals are picked up by extracellular electrodes and  amplified and then filtered  Low pass filters are used to smooth the spike waveforms  and provide an anti aliasing filter before sampling  The frequency content of the  waveform is typically less than 10 KHz  Oth
8.      Also calculate the likelihood of each data    for     0 j lt nclusters j       logproXiLTemp   logproXiL_master j  i    maxTemp   logproxXiLTemp   expf logproXiLTemp    logproXiLTemp   logproXiLTemp   priors  0  j    loglhXi_master 0  i    loglhXi_master 0  i    logproXiL Temp      loglhXi_master 0  i    logf loglhXi master 0  i     maxTemp     Ihdata master   vsum f D loglhXi master 0  nblockdata  ndata      Likelihood of data in this block   hdatanew   Ihdata master      8 8 ae ae                               kk       Calculate the Likelihood of whole data set                                            sie      A kkk              End and Wait and Start                         while   GetIOP SEM_LOCAL  1   amp  GetlOP SEM  LOCAL 2    amp  GetIOP SEM  LOCAL 3     1                 96     3   ee  for G 0 i lt nprocessors 1 i        Ihdatanew   Ihdatanew      float    LH_BLOCK_1   i DSP_JUMP          thres   Ihdatanew   Ihdataold   thres   fabsf thres    if  thres  lt  setThres    break   Ihdataold   Ihdatanew   IoopCountAll                End and Wait and Start                                  End and Wait and Start                            SetIOP SEM  GLOBAL  1    while   GetlOP SEM  LOCAL 1               5      LOCAL  2     GetIOP SEM LOCAL  3     0     15   SetIOP SEM  GLOBAL 0                      ER OR          se ope      be sp sj oko eoe a ae leet kk       bu Calculate probablity of being each cluster                 ese espe ie sje oj                      kkk  
9.      ER KK                         the priors                spe spe                          for  1 0 i  nclusters i4             logpriors_block_1 0  i    vsum_sf_DdogproLXi_block_1 i  1 nblockdata      logpriors block  1 O0  i    logpriors O  i  ndata                       sle ste    she ake ste ae      e sfe e           RR ake RR ske        Updata the centers      a    Be af eae sje spese fee ole ke ob sp ojo fe oe    A             He boe eode bete RR EK    for i 0 i lt nclusters i          forG 0 j lt ndim j     blocksums j    0   block_sum blocksums blockdata logproLXi_block_1 i  offset    vecvadd logcenters_block_1 i  blocksums logcenters_block_1 i  ndim             ee      sk oleo sje eoe seo SH HR        kk kkk kk k ekk kek kkk         Transfer Priors and centers        Get New Priors and centers back 17           ke sk sj spe                  eae                        HE                  End and Wait and Start            on     SetIOP SEM LOCAL 1     x    while  GetlOP SEM  GLOBAL     STOP   UJ    SetIOP SEM  LOCAL 0                     107                                      PRR AR AR sk      AR RR oj ok      sje                    HR RR k          Updata the                                              oo ok ak    for i 0 i lt nclusters i     for j 0 j lt nblockdata j          vecvsub blockdata j  centers i  blockdata diff j  ndim      vecvmlt blockdata  diff j  blockdata  diff j jblockdata diff j  ndim         for j 0 j lt ndim j     blocksums j    0     block 
10.     20    Expectation Maximization for the Gaussian Mixture Model    In the spike sorting case  the GMM model consists of M classes  each class  represents a neuron that has been recorded  Let X    x  x  x       Xy   be the sequence of    Observation vectors  i e  the measured waveforms  In this case  we assume the following    probabilistic model     M D              gt a  p x  0   Eq  2 11    1 1    i    where the parameters are       0  05     05  0  0      0   such that        1 and each  1 1        is a single Gaussian density function  Then the log likelihood expression for this    density from the data set X    x  x  x      x   is given by     N N M  log    O  X   log                  Y tog  Va p x    o  Eq  2 12         i i Iz    which is difficult to optimize because it contains the log summation  Now  we introduce a  new set of data Y  Y is a sequence of data Y          y            yy  whose values inform us  which neuron generated the observation vector x   This variable Y is considered to be a  hidden variable  If we assume that we know the values of Y  the likelihood becomes    N N   log    0  X Y    2 os   PG  lo  P                         0    84213   Since the values of Y are unknown to us  we don   t actually know which neuron  generated each spike  However  if   assume Y is a random vector which follows a  known distribution  we can still proceed    The problem now is to find the distribution of the hidden variable Y  First we guess    that parameters Of ar
11.     READ   int MSGR4    in EXTERNAL  BASE        GLOBAL DECLARATIONS             91    segment   seg_global_flag   int global_flag   0   segment   seg local flag   int local flag   0   float priorsTemp 1  nclusters     0     float candcTemp nclusters  ndim     0      segment   seg_data   float data ndata  ndim       0   include  rescal_720_data_directload dat   IE    Data to be transmitted to external memory    float blockdata nblockdata  ndim     0      segment   seg pmda   float blockdata_diff nblockdata  ndim     0      float blockdata  diff nblockdata  ndim     0    float centers nclusters  ndim       O   include  rescal        centers directload dat       float priors 1  nclusters     0        float covars nclusters  ndim     0    float logproXiL  master nclusters  nblockdata     0    float logproLXi  master nclusters  nblockdata     0    float blocksums ndim   0    float loglhXi_master 1  nblockdata     0      float logpriors_master 1  nclusters     0    float logcenters master nclusters  ndim     0    float logcovars_master nclusters  ndim     0      void main                  0   0           12 12    int loopCount   0   int loopCountAll   0   float testlog   0   float testlog2     float cycle_count   0     int ntasks   0    int offset   0    int flagsum   0    int offsetTemp   0   int data_offset   0   int LoopCount   0                    92    int  data_start   0    int  priors_start   0    int  candc start   0    int  global    int  blockdata_slave 3     0      int 
12.     f                  ___________ 0 Eq  2 17  o a     M        1   Eq  2 18                 We introduce the Lagrange multiplier 4  and solve the following equation                      gt  1             1     0   4           1    1 1 m    121         0    Eq  2 19  ola    and we get the expression  id          T   p l  x  0   Eq  2 20  i      To get the expression of     and 2   we have to go through a similar procedure  but    the computation is much more complicated  To calculate these two  parameters for the  Gaussian distribution function  we ll need to take derivatives of a function of matrices     The detailed procedure can be found in  25   and here we just give the results     N              x  08        E        Eq  2 21     gt  Pl x  0      izl    N   gt   P  x  08 x    ur  x    uy   I    Y 2d Eq  2 22    N            x  0     i l  Collectively  Eq   2 20    2 21  and  2 22  form the basis of the EM algorithm   for iteratively estimating the parameters of a GMM  The detailed steps to implement    these equations will be discussed in the next section                 23    Implementation of the EM algorithm  As shown in Figure 2 5  the procedure to implement the iterative EM algorithm  consists of the following steps    1  Initialization  Initialize the model parameters O 0    2  E step  Calculate the expected likelihood of the given data based on the  current parameters O i   Calculate the increase of likelihood  If the  difference is below a certain threshold  sto
13.     for  i 0 i lt nclusters i        for  j 0 j lt nblockdata j           logproLXi master i  j    logproXiL_master i  j    logf priors O  i     loglhXi_master 0  j    logproLXi_master i  j    expf logproLXi master i  j       Use float instead of log                     OR ee         He sese        A               HH kkk    de Updata the priors i Si              zb sk           k           k kk k k kk    for G 0 i lt nclusters i        logpriors_master 0  i    vsum_sf_D ogproLXi_master i  1 nblockdata    priors 0  i    logpriors O  i  ndata         Ek ke AR obe bebe sese e      RA A k RE             97         Updata the centers   T  ese sk sk seo sk sie eoe soe ee                                 a      for i 0 i  nclusters i        for j 0 j lt ndim j     blocksums j    0     block_sum blocksums  blockdata logproLXi_master i  offset    vecvadd logcenters master i  blocksums logcenters master i  ndim       ES                 serbe be ebbe k k k         Update Priors and centers              he ke sje ke he ke ope ce                   of                      ae oe ae oh ae oe k      while   GetlIOP SEM_LOCAL_1   amp  GetlOP SEM_LOCAL_2    amp  GetIOP SEM  LOCAL  3     1     if  DMA    1      extmem  to intmem s2m     int          BLOCK 1   LoopCount DSP  JUMP      Source buffer   priors start       Destination buffer   P BLOCK LENGTH CANDC LENGTH     Buffer size   10     DMA Channel   TRUE       Wait for DMA to compl       else     for  i2 0 i2 lt P_BLOCK_LENGTH CANDC_LENGTH i
14.    Figure 5 2 Slice Structure of Virtex    FPGA  31                  55    The Virtex II devices incorporate large amouts of 18 Kbit block memories   called Block SelectRAM or BlockRAM  Each Virtex    blockRAM is an 18 Kbit true  dual pot RAM with two independently controlled synchronous ports that access a  common storage area  The blockRAM supports various configurations  including  single  and dual port RAM and various data address aspect ratios  Supported memory    configurations are shown in Table 5 1     Table 5 1 Supported configuration of the block SelectRAM    16K x 1 bit 2K x 9 bits  8K x 2 bits 1K x 18 bits  4K x 4 bits 512 x 36 bits    The Virtex II devices also incorporate a number of embedded multiplier blocks   These multiplier blocks are 18 bit by 18 bit 2 s complement signed multipliers  Some  of the interconnection to the Switch Matrix is shared between the BIockRAM and the  embedded multiplier  This sharing of the interconnection is optimized for an 18 bit   wide BlockRAM feeding the multiplier and allows a fast implementation of a  multiplier accumulator  MAC  function which is commonly used in DSP applications    The Virtex II family is comprised of 11 members  ranging from 40K to 8M  system gates  The FPGA we used for the thesis was the XC2V3000  which features    3M system gates  Table 5 2 shows the specifications of the XC2V3000 FPGA     Table 5 2 The specifications of the Virtex II XC2V3000 FPGA                            Multiplier SelectRAM Max I O 
15.    MONTANA    STATE UNIVERSITY    Using FPGAS to accelerate the training process of a Gaussian mixture model based spike sorting  system  by Yongming Zhu    A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in  Electrical Engineering   Montana State University      Copyright by Yongming Zhu  2003     Abstract    The detection and classification of the neural spike activity is an indispensable step in the analysis of  extracellular signal recording  We introduce a spike sorting system based on the Gaussian Mixture  Model  GMM  and show that it can be implemented in Field Programmable Gate Arrays  FPGA   The  Expectation Maximization  EM  algorithm is used to estimate the parameters of the GMM  In order to  handle the high dimensional inputs in our application  a log version of the EM algorithm was  implemented  Since the training process of the EM algorithm is very computationally intensive and  runs slowly on Personal Computers  PC  and even on parallel DSPs  we mapped the EM algorithm to a  Field Programmable Gate Array  FPGA   It trained the models without a significant degradation of the  model integrity when using 18 bit and 30 bit fixed point data  The FPGA implementation using a  Xilinx Virtex II 3000 was 16 4 times faster than a 3 2 GHz Pentium 4  It was 42 5 times faster than a  parallel floating point DSP implementation using 4 SHARC 21160 DSPs           USING FPGAS TO ACCELERATE THE TRAINING PROCESS OF A GAUSSIAN    MIX
16.   Xilinx Virtex If FPGA and compares the performance of FPGA with the previous      implementations  Finally  Chapter 6 summarizes the work of this thesis and suggests    possible future research directions                 CHAPTER 2  THE GAUSSIAN MIXTURE MODEL AND THE EM ALGORITHM  Introduction    This chapter introduces the Gaussian Mixture Model  GMM  and a maximum  likelihood oere          the Expectation Maximization  EM  algorithm  which is  used to estimate the GMM parameters  The discussion in this chapter will focus on  the motivation  advantage and limitation of using this statistical approach  Several  considerations for implementing the EM algorithm on actual hardware will be  discussed at the end of this chapter    The chapter is organized as follows  Section 2 2 serves to describe the form of  the GMM and the motivations to use the GMM in clustering the action potentials  from different neurons  In section 2 2  first a description of the Gaussian Mixture  Model is presented  Then several assumptions and limitations of using the GMM in  spike sorting are discussed  Section 2 3 introduces the clustering of the action  potentials based on Bayesian decision boundaries  Finially  the EM algorithm and    derivation of the GMM parameters estimation procedure are presented     Mixture Model for Spike Sorting      The Gaussian Mixture Model  GMM    A Gaussian mixture density is a weighted sum of a number of component  densities  The Gaussian Mixture Model has the ability 
17.   int    GLOBAL_P P_BLOCK_LENGTH CANDC_LENGTH i2       int   covars   i2                  101           Finish Moving data                                                               stop count   timer off       cycle_count     float start_count    float stop_count  80000000     divide by 80     gt                    102    For the Slave DSPs  Slave c       include  lt stdio h gt     include  lt matrix h gt     include  lt math h gt     include  lt stdlib h gt     include  lt vector h gt     include  lt 21160 h gt     include  lt def21160 h gt     include  speeddsp h     include  diag DMA_slave_2 h      define DMA 1    Use DMA channel for external memory data transaction   defme processorID 3      define START 1    define STOP 0    defme MASTER  BASE 0x0100000    defme MULTI BASE 0x0700000    defme                   BASE 0x0800000     defme SEM GLOBAL   MSGRO   define SEM LOCAL MSGR1   define SEM_DMA_WRITE    nt MSGR4    in EXTERNAL  BASE     segment   seg_local_flag   int local_flag   0   segment   seg global flag          global flag   0   segment   seg lh  block   float Ihdata block 1   0     segment   seg prior block   floatlogpriors block 1 l1  nclusters     0    segment   seg centers block   float logcenters block l nclusters  ndim     0    segment   seg covars block   floatlogcovars block l nclusters  ndim     0      segment   seg p  block   float centers nclusters  ndim     0    segment   seg_p_block   float covars nclusters  ndim     0    segment   seg_p_block   fl
18.   multi variant Gaussian model of the spike waveforms generated from a  particular neuron  The solid line shows the mean vector of this model while the    dashed line shows the three    boundary according to the covariance matrix  All the  waveforms in this model are shown in the background     Furthermore  if we assume that action potentials from different neurons are  independent and identically distributed  16   the observation vector distribution can be  given by a Gaussian Mixture Model  Eq  2 1   Thus  the GMM results directly from a    probabilistic modeling of the underlying spike generating process     Spike sorting using the GMM    In the last section  we describe how to use the GMM to model the spike  generating process  From the perspective of spike sorting  the goal is to estimate the    parameters of the GMM and classify each action potential with respect to each neuron             14    class in which the action potential is generated from  To do this  the first step is to  estimate all the parameters  including all the priors    probabilities  a   together with  the means and covariance of each Gaussian component  Maximum likelihood  ML   parameter estimation is the method we will use to find these parameters with a set of  training data  It is based on finding the model parameters that are most likely to  produce the set of observed samples  The maximum likelihood GMM parameter  estimates can be found via an iterative parameter estimation procedure  namely th
19.   quantize q_int  I ncluster   centers d 3 m 9    quantize q data mid  centers d 3 m 9    mid  priors inv m 9     mid centers d 3 m 9   quantize q mid centers const zero      end     end               Function to get the log of priors              gt  gt  gt  gt  gt  gt  gt  gt  gt  gt  gt priors_log   fun_lut_priors_log priors        ACCEL SHAPE mid  priors log addr 1       ACCEL SHAPE priors  log 5   for m 10   quantize q int 1 ncluster    mid  priors log addr   acc  l_bitshl priors m_10  11 q1 q_addr    priors log m  10    quantize q lut log lut log content mid priors log addr 1     end     loop flag   quantize q intjloop flag   1    id   quantize q_int  1    LogLikeHoodX   0    end    else  loop_flag    2              Function to get the new covars             mid covars   fun_accum_cov indatabuf mid_covars centers ProLXi id         ACCEL SHAPE sub_cov_1 ncluster     1  ACCEL SHAPE sub  cov  2 ncluster      ACCEL SHAPE square  cov ncluster        ACCEL SHAPE mult  cov ncluster      ACCEL SHAPE mid   covars  temp ncluster       ACCEL REGISTER sub cov  1     ACCEL REGISTER sub_cov_2    ACCEL REGISTER square_cov          125      ACCEL REGISTER mult cov    ACCEL REGISTER mid  covars temp    for d 4   quantize q int 1 ndim       ACCEL TILE m i 3      ACCEL USE embedmults 30  m i 3  embedmults 35  m i 3   for m_i_3   quantize q int 1 ncluster     sub_cov_1 m_i_3    quantize q_small indatabuf d_4 m_i_3    centers d_4 m_i_3     sub_cov_2 m_i_3    quantize q_small indatabuf d_4 m
20.  1 O  i      if  logproXiL  block  1116   gt  maxTemp   maxTemp   logproXiL  block  1 j  i                    Normalize the pro to  0 1       Also calculate the likelihood of each data  for  j 0 j lt nclusters j       logproXiL Temp   logproXiL  block 1 j  i    maxTemp    logproXiLTemp   expf logproXiLTemp      logproXiLTemp   logproXiLTemp   priors  O  j   ur  loglhXi block 1101111   loglh Xi block 1 O  i    JogproXiL Temp              loglhXi_block_1 0  i    logf loglhXi  block  1 0  i     maxTemp        lhdata_block_1   vsum f D loglhXi block  1 0  nblockdata  ndata      Likelihood of data in this block         sk se ses se sk se sese                                ses ses f        Transfer likelihood of this block           If necessary continue the calculation      ke ke            oo OK    A             ae        boe kk              End and Wait and Start            SetIOP SEM  LOCAL 1      while  GetIOP SEM  GLOBAL     STOP    3                 106    SetIOP SEM_LOCAL 0    m                                         ee oR a he ae a                    oe k k k         Calculate probablity of being each cluster                          ae seo        eos of                      k eek       for  i 0 i lt nclusters i          for  j 0 j lt nblockdata j          logproLXi_block_1 i  j      logproXiL_block_i i  j    logf priors O  i     loglhXi block  1 0  j    logproLXi_block_1 i  j    expf logproLXi block 1 i  j       Use float instead of log                                  
21.  3 d 0            119    indatabuf d_0 4    indatabuf_4 d_0    indatabuf d_0 5    indatabuf_5 d_0    end       ACCEL SYNTHESIS OFF   indatabuf   1    indatabuf_1     indatabuf   2    indatabuf_2     indatabuf   3    indatabuf_3     indatabuf   4    indatabuf 4     indatabuf   5    indatabuf_5       ACCEL SYNTHESIS ON             Memory map for look up table                  MEM MAP lut exp content to ram_s18_s18 56  at 0     ACCEL MEM  MAP lut  log content to ram 518 s18 58  at 0  969 1ACCEL MEM  MAP lut  inv content to ram_s18_s18 57  at 0      ACCEL MEM  MAP indatabuf 1 to ram  518 s18 1  at 0    ACCEL MEM  MAP indatabuf 2 to ram  818 s18 2  at 0    ACCEL MEM  MAP indatabuf 3 to ram 518 518 3  at 0     ACCEL MEM  MAP indatabuf 4 to ram  818  s18 4  at 0    ACCEL        MAP indatabuf 5 to ram  818 s18 5  at 0               This part tricks accelchip to initialize the blockmem    ACCEL        MAP centers 1 to ram_s18_s18 16  at 0                 MEM  MAP centers_2 to ram 518 s18 17  at 0    ACCEL MEM  MAP centers_3 to ram_s18_s18 18  at 0    ACCEL MEM MAP centers_4 to ram_s18_s18 19  at 0    ACCEL MEM MAP centers 5 to ram_s18_s18 20  atO      if loop_flag                Function to calculate the log version probability in each cluster                gt  gt  gt  gt LogProXiL   i              gt  gt  gt  gt fun_logproxil indatabuf covars_inv covars_det centers   lt  lt  lt  lt  lt  lt       ACCEL SHAPE acc  1 1      ACCEL SHAPE         2 1                   SHAPE acc_3 1     AC
22.  7 Diagram of the EM algorithm implementation in the FPGA    Synthesis and Simulation    After generating the VHDL code from AccelFPGA  the next step was to simulate  and synthesize the VHDL modules  The RTL VHDL code could be simulated in the  simulation tools such as Modelsim using the testbenches generated by AccelFPGA   They could also be put into synthesis tools like ISE from Xilinx to generate the bit    stream to be downloaded into the FPGAs  The toolchain used can be seen in Table 5 5     Table 5 5 The toolchain used in the FPGA implementation     Toolchain Version  Matlab V6 5 Release 13  AccelFPGA V1 6  ISE V5 2 02i    Modelsim Starter 5 6e             68    First  a RTL level simulation was made to verify that the VHDL code was  working correctly  Then  a post synthesis model was generated by the ISE synthesis  tool from Xilinx  This post synthesis model is simulated again in the simulation tool  Modelsim       the simulation results were the same as the output of the fixed point  Matlab implementation  This shown that the FPGA implementation of the EM  algorithm was working correctly    The system performance of the FPGA implementation can be seen in the  synthesis report from the ISE tool  The highest clock rate of this EM system in Xilinx  Virtex    3000 platform is 62MHz  Table 5 7   The hardware resources used is shown    in Table 5 5 and the actual floorplan inside the FPGA is shown in Figure 5 8     Table 5 6 Device utilization summary of the Xilinx Virtex I
23.  A R Martin   amp  BG Wallace  1992  From Neuron to Brain  Sinauer  Associates  Inc  Sunderland  Massachusetts     M Abeles  1991  Corticonics  Cambridge University Press  Cambridge     Evarts EV  1968  A technique for recording activity of subcortical neurons in  moving animals  Electroencephalogy Clin Neurophysiol     Georgopoulos AP  Taira M   amp  Lukashin  1993  Cognitive neurophysiology of  the motor cortex  Science  260     Alexander G Dimitrov  2002  Spike sorting the other way  Neurocomputing  2003     Carroll Michael  2002   Spike Sorting for Neurosurgical Localization  Learning  From Data  Winter 2002     Kyung Hwan Kim  amp  Sung June Kim  2000  Neural Spike Sorting Under Nearly  0 dB Signal to Noise Ratio Using Nonlinear Energy Operator and Artificial  Neural Network Classifier  IEEE Transactions on Biomedical Engineering  47     Susumu Takahashi  Yuichiro Anzai   amp  Yoshio Sakurai  2002  Automatic  Sorting for Multi Neuronal Activity Recorded With Tetrodes in the Presence of  Overlapping Spikes  J Neurophysiol  89     Abeles M  amp  Goldstein MH  1977  Multispike train analysis  Proc IEEE  65   162 133     Kyung Hwan Kim  amp  Sung June Kim  2003  Method for Unsupervised  Classification of Multiunit Neural Signal Recording Under Low Signal to Noise  Ratio  IEEE Transactions on Biomedical Engineering  50     MeNaughton B L   J O Keefe   amp  C A Barnes  1983  The stereotrode   a new  technique for simultaneous isolation of several single units in central nervou
24.  TO ram  818 51 8    _9 15            ACCEL USE embedmults m_9 25   for m_9   quantize q_int 1 5      centers d_3 m_9    quantize q_data mid_centers d_3 m_9        mid_priors_inv m_9     mid_centers d_3 m_9    quantize q_mid_centers const_zero          end    In the FPGA  all the computations among the five clusters were parallelized and  ran at the same time  A lot of BlockRAMs were used to ensure that the multiple use of  a single resource were possible  The embedded dedicated 18 bit 2   s complement  multipliers were used wherever multiplication is needed to give the highest system  performance  Registers were added to pipeline the operations to obtain a high system    clock rate    The training data was streamed into the FPGA to implement the training process      of the EM algorithm  The data has to be steamed in twice for each iteration step since  both the E step and M step need the training data for the computation  The advantage  of      the data is to make the algorithm independent of the data size since there  is relatively little chip memory compared to PCs  The final results will were stored in    the block memories inside FPGA  Figure 5 7                  67                       Likelihood  improvement     gt  threshold                Cluster 2            X                  8  lt  data 720 8 E data 720   E 5 Stop the interation Eg                 nv and store the current E      E    raph   parameters to the 5   data 2  block memories  9   data 1        Figure 5
25.  discrimination during simultaneous recording from several multi unit nerve  filaments  J Neurosci Methods  64  181 187     Susumu Takahashi  Yuichiro Anzai   amp  Yoshio Sakurai  2003  Automatic  Sorting for Multi Neuronal Activity Recorded With Tetrodes in the Presence of  Overlapping Spikes  J Neurophysiol  89  2245 2258     Robin Hanson  John Stutz   amp  Peter Cheeseman  1996  Bayesian Classification  Theory  Technical Report FIA 90 12 7 01     Sahani M   1999  Latent Variable Models for Neural Data Analysis  Ph  D  Thesis     Richard A Redner  amp  Homer F Walker  1984  Mixture Densities  Maximum  Likelihood and EM Algorithm  SIAM Review  26  195 239     Lei Xu  amp  Michael  Jordan  1995  On Convergence Properties of the EM  Algorithm for Gaussian Mixtures  C  B  C  L  Paper No  111     Alexander G Dimitrov  John P Miller  Tomas Gedeon  Zane Aldworth   amp  Albert  E Parker  2000  Analysis of neural coding using quantization with an  information based distortion measure  Network  Computation in Neural Systems                 79     28  Paul M Baggenstoss  2002  Statistical Modeling Using Gaussian Mixtures and  HMMs with MATLAB      29  Lewicki Micheal S   1995  Bayesian Modeling and Classification of Neural  Signals  Neural Computation      30  Dominique Lavenier  2000  FPGA implemenation of the K means clusting  algorithm for hyperspectral images  Los Alamos National Laboratory LAUR 00   3079      31   2002  Virtex I Platform FPGAs  Introduction and Overview      32  
26.  eoe HH              initialize the covars and priors ui  Sos e                          sob se pos AK HE kk k      initGmm priors centers covars    local flag   1            Start Timer  timer  set  OXFFFFFFFF  0xFFFFFFFF    start_count   timer_on        enable timer       while  START           End and Wait and Start            while  GetlOP SEM_GLOBAL     0      3   SetlOP SEM_LOCAL 0                                                         oe ke sk sk spo                           a Clear all the log version matrix TE                     seo          kk HK k k kk      for  i 0 i lt nclusters i       for   j 0 j lt ndim j          logcovars block 11    0    Clear the logcovars matrix  logcenters block 1l i     0               the logcenters matrix         eR Ree be bebe ok    soe a ie kk EB ae     TS Calculate the probablity in each cluster         sek sek e sk                                    ok kk    for  i 0 i lt nblockdata i               logproXiL_block_1 j  i    getproXi centers  j  blockdata i  covars j       for  j 0 j lt nclusters j                105                                                                          Calculate the liklihood of each data in this model                               fefe sje obe                   ok        kk       for  0 i lt nblockdata i               Clear the store matrix  loglhXi_block_1 0  i    0       Get the max number of each column  for  j 0 j lt nclusters j       if j  0             else    maxTemp   logproXiL block
27.  fourth step is to  synthesize the VHDL or Verilog models using a logic synthesis tool  The gate  level  netlist is then simulated to ensure functional correctness against the system  specification  Finally the gate level netlist is placed and routed by the place and route    appropriate for the FPGAs used in the implementation                    57    One advantage of AccelFPGA is that bit true simulations can be done in the  Matlab environment  Testbenches are automatically generated along with the  simulations  You can use these testbenches later in the hardware simulation and  compare the results with the Matlab simulation  As a result  it is very easy to know  whether your hardware implementation is correct or not    Since Matlab is a high level language  the compiled implementation may not  give as good performance as direct HDL coding  However  by using proper directives   you can specify the parallelism of your algorithm and maximize the usage of the on   chip hardware resources  Hence  you can get a reasonably good performance of your    implementation     Matlab Design in floating  point representation    Matlab    Convert to fixed point  representation    Apply AccelFPGA complier  directives    AccelFPGA    AccelFPGA generates RTL  models and testbench    Logic synthesis to map RTL  models to FPGA primitives    Synthesys tool    Logic simulation for bit true  verification    Figure 5 3 Top down design process using AccelFPGA  The versions of all the tools  used in 
28.  means for each cluster     Data Transferred   New_priors L  Size  5 per Node  New_centers L  Size  800 per Node    Phase 5     Calculate middle results for  updating the ovariance              5      SARIS    matrices    Data Transferred   Phase 6  Mid_covars L  Size  800 per Node  Update the covariance    matrices for each cluster   Data Transferred   New covars L  Size  800 per Node    Figure 4 3 Diagram of multiple DSP implementation of EM algorithm                49  Performance on the Parallel DSP Platform    Table 4 1 gives the execution time for 8 iterations of the EM algorithm until it  converged  Performance on a single DSP and multiple DSPs are both shown  Also  different data transfer methods are compared  The results show that the computational  speed of the 4 DSP system was almost 4 times better than single DSP  Also  the DMA  transfer sped up the whole system and gave the best performance  Notice that the bus  contention auto resolution method in the multiple DSP system significantly slowed  down the performance of the whole system  Using semaphores was necessary to    accelerate the data transferring process     Table 4 2 Execution time for 8 iterations of EM algorithm on single and 4 DSP system    Data Transfer Methods Single DSP  sec  Four DSP  sec   Normal Read 14 5 13 5  DMA transfer 15 9 3 4  Bus Contention N A  gt  7000       Table 4 3 shows the performance of the same EM application between the DSP  and Personal Computer  PC  configurations       perfor
29.  oe o ojo      obe seo e oo oe    WR Use DMA move data to internal and cal                                                                        k    data  start    int    data    offset   0                                  spe o peo spe           ak ake      Read data from SRAM to internal memory       soe sk oko oe seo      e spo        spo ope ee ae k k kk   i                    Moving Data                                         if            1      extmem_to_intmem   data_start     Source buffer    int   blockdata     Destination buffer   Blocksize     Buffer size   10     DMA Channel  TRUE       Wait for DMA to compl    extmem to intnem    int    data_start Blocksize      Source buffer   int    blockdata_slave 0  SLAVE_BASE_1      Blocksize     Buffer size   10     DMA Channel  TRUE      Wait for DMA to compl    extmem  to intmem    int    data  start  2 Blocksize     Source buffer   int    blockdata slave 1   SLAVE BASE 2      Blocksize       Buffer size   10     DMA Channel  TRUE       Wait for DMA to compl    extmem to intmem   Gnt    data  start  3 Blocksize     Source buffer                94     int    blockdata_slave 2  SLAVE_BASE_3      Blocksize     Buffer size   10     DMA Channel  TRUE       Wait for DMA to compl     else  for  i2 0 i2 lt nblockdata i2     for  j2 0 j2 lt ndim j2       5  blockdata i2  j2    data offset   121121            I               Finish Moving data                                     while  loopCountAll  lt  maxLoop               En
30.  of all the clusters are set to be equal  All the EM training processes    in this paper start from this same initialization so that all the results are comparable        1 5                              0 5    ane t             L 1      1      0 20 40 60 80 100 120 140 160    Figure 3 3 Initialization of the EM algorithm             30       Evaluation of the Algorithm    One hard problem of spike sorting is that it is hard to evaluate the clustering  results  Signals from different cells may be very similar and difficult to distinguish  even for trained neurophysiologists  It is nearly impossible to create reliable expert   based validation data sets  Hence  simulated neural data are often used for evaluating  the spike sorting algorithm  Recently  simultaneously intro  and extra cellular  recording technology has provided the possibility to evaluate the algorithm on real  data    In this paper  we focus on how to implement the GMM based spike sorting  algorithm on hardware  Since evaluations for both simulated and real data show that  the GMM approach gives very good results  16 29   we just concentrated on the  software and hardware implementations  When targeting FPGAs  the use of fixed  point data types is important because of the increase in speed that can be attained   Since using fixed point data will lose a certain amount of precision as compared to  floating point data  we evaluated the FPGA implementation with respect to the    floating point implementations     N
31.  p x    OD         ae  Glo    Eq  3 6  iz  N  In  0  X     Y Intp  O  Eq  3 7            After changing into the log domain  the products      Eq  3 2 and Eq  3 4 are    turned into additions and subtractions  Exponential operations are also eliminated                   32    except in Eq  3 6  Since all the e   6  terms are very small  this can still cause  underflow problems       0    To handle this  we need to add another normalization  process to calculate Eq  3 5  We do a normalization across all the clusters by    subtracting all the In p x  0    terms by the maximum value across the clusters  We  then store the max In p x  0     for furtive use  After subtraction  the exponential  value el  will be guaranteed between  0  1   We then performed the  multiplications and additions  The log version of the middle results can be obtained by  applying In to it  The last step is to add the max In p x   o      back to the middle  result  This way  without changing the final result  we go around the numerical zero    problems and get the same results           In p x   O                 8       3 8             B   max         0  Eq  3 9   After getting the likelihood of the whole data and before updating the parameters    the probability of being in one cluster given input x  has to be calculated  This    probability can be calculated by Baye   s rule in Eq  3 10          E pos  0     0                           Eq  3 10    p o   Sap  0   p x         k l    However  this equatio
32.  q1 acc 4    log const   covars_det 4     acc 5    quantize q1 acc 5    log const   covars det 5       LogProXiL 1    quantize ql acc 1     LogProXiL 2    quantize q1 acc 2     LogProXiL 3    quantize q1 acc 3     LogProXiL 4    quantize q1 acc 4     LogProXiL 5    quantize q1 acc 5                   Function to calculate the log likelihood of one data               gt  gt  gt  gt LogLikeHoodXi                 gt  gt  gt  gt fun_loglikehoodxi LogProXiL  priors    lt  lt  lt  lt  lt  lt  lt  lt  lt        ACCEL REGISTER maxa             121      ACCEL SHAPE          1     ACCEL SHAPE a 5   maxa   quantize q1 max LogProXiL          ACCEL REGISTER a     NOTE  This loop will not after synthesis in ISE   a 1    quantize q1  LogProXiL 1    maxa     a 2    quantize q1  LogProXiL 2    maxa        3    quantize q1  LogProXiL 3    maxa     a 4    quantize q1  LogProXiL 4    maxa     a 5    quantize q1  LogProXiL 5    maxa      for m 2   quantize q int 1 ncluster   ifa m 2  gt 4  a m 2    quantize q1 3 998    end  end    for m 3   quantize q_int 1 ncluster     ACCEL SHAPE a  addr 1      addr   accel bitshl a m  3  8 41 4 addr     Use shift left to get the address     ACCEL SHAPE         content 1024     ACCEL SHAPE a_lut_out 5   a_lut_out m_3    quantize q_lut_exp lut_exp_content a_addr 1     end    ACCEL SHAPE c 1        quantize q1 0      ACCEL REGISTER c_temp    ACCEL REGISTER c    ACCEL USE embedmults 10     embedmults 11  embedmults 12  embedmults 13  embedmults 14     for m_4   qu
33.  the convergence of the EM algorithm     Table 5 3 The error between the original floating point implementation and the LUT  simulations     The fixed point LUT output    Error per data 0 0126    Figure 5 6 shows how this LUT was implemented in the FPGA hardware  The    input variables  which start from bit 4  represented the least precision 2   and are used             64    as the input to the LUT  The bits after bit 4 are neglected since these bits are not    required for the input  The content in the LUT is determined by Eq 5 3     LUT  k    exp  kx2   Eq  5 3  in which k represented the decimal address of the block memory  Notice that the  inputs of the LUT were positive which are the inverse of the real inputs  This was  done in the previous calculation without additional overhead  Another thing to notice  is the last element in the LUT is 0 instead of the result from Eq 5 3  The zero accounts    for all the outputs with the inputs less than  4     A 18 bit LUT input A bank of Block SeceletRAM   18bit    1024                  8 2712 HEX  DEC     exp  0    2 9    1 0000  0           1    278    0 9961  0001  1           2    2    0 9922   0002         exp  k   279     exp  1022    279    0 0184  05FD  1022   _    Jost  ioo    Figure 5 6 Diagram of implementing the LUT in block memory               10 bits of the input connect    Binary Point  to the LUT address line           INPUT  14 4          0 6 4PPv  ssauday        All the other LUTs were determined and tested in M
34.  the structure and components on the Hammerhead board will  be described  The software we use to develop the system will also be introduced  In  the rest of Chapter 4  we focus on how to parallelize the EM algorithm and how to  effectively implement it on a parallel DSP system  In the end of this Chapter  the    performance of the EM implementation on this DSP system will be presented     Hardware    SHARC ADSP 21160M DSP Processor    On the Hammerhead PCI board  there are four high performance 32 bit floating    point DSP processors  namely the Analog SHARC ADSP 21160s  This DSP features                41    4 Mbits on chip dual ported SRAM and the Super Harvard Architecture with the  Single Instruction Multiple Data  SIMD  computational engine  Running at 80MHz   ADSP 21 160 can perform 480 million math operations  peak  per second    The ADSP 21160 has two independent  parallel computation units  Each unit  consists of an ALU  multiplier  and shifter  In SIMD mode  the parallel ALU and  multiplier operations occur in both processing elements  The ADSP 21160 has two  independent memories   one for data and the other for program instructions  Two  independent address generators  DAG  and a program sequencer supply address for  memory access  ADSP 21160 can access both data memory and program memory  while fetching an instruction  In addition  it has a zero overhead loop facility with a  single cycle setup and exit  The on chip DMA controller allows zero overhead data  transf
35.  values of covariance matrices are  approximately 107  so during the training process the matrices will easily become  singular because of limited precision    We tried three different approaches to solve this problem  The first method was  to eidion the covariance matrices whenever a singular matrix appears  Every  time when there was a numeric zero in the covariance matrix  we added a very small  number  10   to prevent it from being zero  This way  the algorithm could go on to  the next step until it converged  In the second approach  we reduced the data  dimension by using Principal Copanenr Analysis  As discussed in Chapter 1  PCA  can find the directions that represent the most variation of the original data  We used  the first 10 components that could represent 98  variation of the data set  By  reducing the dimension to 10  there was much less chance that full covariance matrix  could  become singular  The last method was to use a diagonal matrix instead of the  full covariance matrix  Using a diagonal matrix can enabled log calculations of the  determination and prevented numerical zeros  By using the diagonal covariance  matrix  we made the assumption that there was no correlation between each  dimension in the 160 dimensional space  Making this assumption can cause a loss of  information in the original data set  However the first two methods  also lost  information by recondition and reducing the vector dimensions  We tested all these    methods and the best metho
36.  was about 700   times slower than the other approaches because of the bus contention issue    We then set a semaphore to exclude the use of the cluster bus  By reading this  semaphore  only one DSP could access the bus at any time  This gave a much better  performance than the first approach    Finally  we used the DMA transfer mode to transfer parameters between the  DSPs  The master processor was in charge of the DMA controllers of all the  processors  Whenever a data transfer was needed  the master first made sure that no  DSP was using the bus  The master then wrote into the registers of both the  transceiver and receiver s DMA controller  After the DMA transfer is initialized  no  other DSP could access the cluster bus until this DMA transfer finished  This sped up  the process and gave the best performance  The performance comparison will be    shown in the next section                 48    Phase 1     1  Calculate p xj I    2  Calculate log likelihood  of each data point     Data Transferred     Phase 2   LogLIke Xi L  Size  180 per Node    Ja seyy    SABIS                    1  Calculate log likelihood    of the whole data set   2  Test whether to stop the  interation or not Data Transferred   LogPro L Xi  Size  180 per Node  Phase 3     Calculate middle results  for updating the priors and                                            means   Data Transferred   Phase 4  Mid_priors L  Size  5 per Node     Mid_centers L  Size  800 per Node  Update the priors and Mastar 
37.  while the  average speed was around 3 seconds  This is quite slow comparing to the average    neural spike period  The slow speed is due to the intensive computation needed for the             39    training of the Gaussian Mixture Model based on the EM algorithm  We needed the  training process to be as fast as possible so that minimum time is spent in the training  phase  As a result  PCs are not as fast as we would like in order to do this    In the rest of the thesis  we will focus on how to speed up the training process of    the EM algorithm  In next chapter  the parallel DSP implementation will be    introduced  Later an FPGA implementation will also be presented                 40  CHAPTER 4  PARALLEL DSP IMPLEMENTATION  Introduction    To speed up the GMM based spike sorting system  the first approach we used  was to implement the EM algorithm on a parallel DSP system with 4 DSPs  Digital  Signal Processors have become more posed with the increase in speed and size of  the on chip memory  Furthermore  some modern DSPs are optimized for  multiprocessing  The Analog ADSP 21160M DSP we used for our project has two  types of integrated            support  which        the link ports and    cluster bus   We used Bittware   s Hammerhead PCI board to test our implementation  The board  contains four Analog ADSP 21160 floating point DSPs  We wrote a parallel version  of the EM algorithm to take advantage of the multiprocessing capability of the board    In the next section 
38.  wl int   18   fp int   0       quantize for pure fractional number  wl fra   18   fp fra  17       This word length for lut  binary point can be varied due to the    precision      requirement    95 Most fixed points are the same  but in case I will change one later     justO     leave it as different fp    wl lut   18    fp      cov inv   3    fp lut        log   13    fp      exp   13    fp lut log   13    fp      prior   13    fp lut inv   13              116    q data   quantizer  floor   wl_data  fp_data      ql   quantizer    floor   wrap   w  fp      q1 int   quantizer    floor     wrap   wl 0      q lut  cov  inv   quantizer    floor   wrap   w _lut fp  lut  cov  inv       ut cov log   quantizer floor   wrap   wl lut fp lut        log     q lut exp   quantizer    floor     wrap   wl_lut fp  Jut  exp      q lut log   quantizer floor   wrap  w  lut fp 104 log         lut prior   quantizerz floor  wrap   wl lut fp lut prior     q  int   quantizer    floor     wrap   wl_int fp  int      q  fra   quantizer    floor   wrap   wl_fra fp fra     q lut inv   quantizer floor  wrap  w l lut fp lut inv       q mid centers   quantizer  floor  wrap   18 8     q mid priors   quantizer    floor   wrap     18 8     q small   quantizer  floor   wrap   18 13      q square cov   quantizer    floor     wrap   18 151    q covars   q square  cov    q priors   q covars    q_mid_covars   quantizer    floor   wrap   18 12     q loglike   quantizer    floor   wrap   18 7           fixed point f
39. 2            priors_start   12       int   PR_BLOCK_1  LoopCount DSP_JUMP  i2                             Finish Moving data                     lt                   I                Begin update the canen ed                    for  i 0 i lt nclusters i            logpriors_master 0  i    logpriors_master 0  i    priorsTemp 0  i                            98    vecvadd logcenters master i  candcTemp i  logcenters master i  ndim           for  i 0 i lt nclusters i       priors O  ij   logpriors master O  i  ndata   logpriors master O  i    l logpriors master O  i    vecsmlt logcenters master i  logpriors master O  i  centers i  ndim                   End update the centers                                                                              Moving Data to slave dsps                                                 if  DMA    1      intmem  to extmem  m2s    int   priors     Source buffer    nt   P BLOCK  1     Destination buffer   P BLOCK LENGTH CANDC LENGTH     Buffer size   10     DMA Channel  TRUE       Wait for DMA to compl    intnem to extmem m2s    int   priors     Source buffer     nt  P BLOCK 1 DSP  JUMP      Destination buffer   P  BLOCK  LENGTH CANDC LENGTH     Buffer size   10     DMA Channel   TRUE       Wait for DMA to compl    intnem  to extmem m2s    int   priors     Source buffer     int  P BLOCK 142 DSP JUMP      Destination buffer   P BLOCK  LENGTH CANDC LENGTH     Buffer size   10     DMA Channel   TRUE       Wait for DMA to compl    else    for  2 01
40. 2 lt    BLOCK  LENGTH CANDC LENGTH i2             Gnt   GLOBAL P i2             priors   12                   End and Wait and Start                                 SetIOP SEM  GLOBAL 1    while   GetIOP SEM  LOCAL 1    GetlOP SEM  LOCAL 2               5     LOCAL 3      0                             3   SetlOP SEM_GLOBAL 0    p                                         oj oe che sje obe    e sb sje obe se be sje desde      k kk            data the covars                                                        forG 0 i lt nclusters i          for j 0 j lt nblockdata j          vecvsub blockdata j  centers i  blockdata  diff j  ndim    vecvmlt blockdata diff j  blockdata diff j  blockdata  diff j  ndim         forG 0 j lt ndim j     blocksums j    0     block sum blocksums blockdata diff logproL Xi master i  offset    vecvadd logcovars  master i  blocksums logcovars master 1  ndim                    e see se o se eo se se o sje              ok spe esee ese k kk          Update the covars           sese        sje sk oe      oe eoe sje eoe oe oe fe sje kk che                     End and Wait and Start                           while   GetIOP SEM_LOCAL_1   amp  GetIOP SEM  LOCAL 2   amp  GetIOP SEM_LOCAL_3     1   6                                       lt                                for  LoopCount 0 LoopCount lt nprocessors 1 LoopCount                       Get mid results of priors and centers from slaves                               Moving                                   
41. 4 4 0  49 j    5 1 Supported                     of the block ene  E 55    5 2 The              of      Virtex    XC2V3000 FPGA                                402  5 3 The error between the original floating point implementation and   tie bU T                    coitu Qo                     et atas 63  5 4 List of directives can be used in AccelFPGA                     eee 65  5 5 The toolchain used in the FPGA implementation                     eene 65  5 6 Device utilization summary of the Xilinx Virtex II FPGA                       a 68  5 7 The post synthesis timing report    russo cernerent tonne eno                            68      5 8 The difference between the floating point output and FPGA      5 9 Confusion matrix of fixed point spike sorting result for 720 nerual  spikes from 5 neurons  Overall correct rate is 97 25  comparing  tothe Floating pomt                                              5 10 Performance comparison between all the platforms                                      72             ABSTRACT    The detection and classification of the neural spike activity is an indispensable  step in the analysis of extracellular signal recording  We introduce a spike sorting  system based on the Gaussian Mixture Model  GMM  and show that it can be  implemented in Field Programmable Gate Arrays  FPGA   The Expectation  Maximization  EM  algorithm is used to estimate the parameters of the GMM  In  order to handle the high dimensional puis in our application  a log ver
42. A Ghosh  J Kunkel   amp  S Liao  1999  Hardware Synthesis from C C    Proc   Design  Automation and Test in Europe Conference and Exhibition      33  R Helaihel  amp  K Olukotun  1997  Java as a Specification Language for  Hardware Software Systems  Proc  International Conference on Computer   Aided Design  690 697      34   2003  AccelFPGA User s Manual V1 7           80       APPENDICES             81      APPENDIX A    SIMPLIFYING THE EXPRESSION OF Q 0 0         82             83    The procedure of how to estimate the parameters of the Gaussian Mixture Model  using Expectation Maximization algorithm is introduced in Chapter 2  In this section   we will present the mathematic details about how to simplify the expression of the    expected value of the log likelihood based      the whole data set and current model  parameters  We define it as Q O O      To find the equations to iteratively estimate parameters of a GMM  one critical    step is to simplify the expression of Q O O     From Eq  2 9 we know that    Q 0 0              p X  Y  O   X O     Combine the above equation with Eq  2 12  Eq  2 13  Eq  2 14 and Eq  2 15  we   can get the following equation     Q 0 0        log    O  X  Y pQY   X O      YeY    N       2 Jog a  p   x   0 XL EG  13 09    YeY                  2         0  y  Py     0     PO    x  0    jal    Me  Ms     Mer     Me    M N  3 5  log               9  00         0       Me   s  Ms          y      A il    ja  M N           log  a  p   x  lo Y Y     
43. Array is an ASIC where logic functions can be  changed to suit a user s needs  The circuit elements within these devices can be  configured repeatedly because the current configuration is stored in SRAM cells  31    The FPGAs can be configured to operate in a nearly unlimited number of user   controlled ways by downloading various configuration bit streams    The FPGA devices are usually organized into generic logic blocks that can  perform a limited number of logic functions  The specific function being performed is  determined by the configuration bits preloaded into the hardware  These logic blocks  are connected by programmable routing  which is also determined by the  configuration bits    The particular structures of the logic blocks and routing are different inside the  FPGAs from different manufacturers  The FPGA we use for our application is the  Xilinx Virtex II series FPGAs  The structure of this FPGA is described in the    following section                    53       Structure of Xilinx   s Vittex II FPGA  The Xilinx Virtex II family is a FPGA developed for high speed high density  logic designs with low power consumption  As shown in Figure 5 1  the  programmable device is comprised of input out blocks  IOBs  and internal  configurable logic blocks  CLBs   In additional to the      configurable logic  blocks there are three major elements    1  Block SelectRam memory provide 18 Kbit of dual port RAMs   2  Embedded 18 x 18 bit dedicated multipliers   3  Digital C
44. CEL SHAPE         4 1     ACCEL SHAPE acc  5 1       ACCEL SHAPE LogProXiL ncluster      ACCEL SHAPE acc ncluster      ACCEL SHAPE sub  1 ncluster     ACCEL SHAPE sub_2 ncluster     ACCEL SHAPE mult ncluster     ACCEL SHAPE squa ncluster      ACCEL SHAPE sub_temp ncluster     acc 1    quantize q1 0                    120    acc 2    quantize q1 0    acc 3    quantize q1 0    acc 4    quantize q1 0    acc 5    quantize q1 0        ACCEL REGISTER sub 1    ACCEL REGISTER sub 2    ACCEL REGISTER squa     ACCEL REGISTER mult    ACCEL REGISTER acc    for d_1   quantize q_int  1 ndim       ACCEL TILE m_i_0     ACCEL MEM  MAP indatabuf   m_i_0  TO ram  818 s18 m i 04 58  ATO   ACCEL USE embedmults  1 m_i_0  embedmults 4 m_i_0     form i 0   quantize q_int 1 ncluster   sub l m i 0    quantize q small indatabuf d 1  i O                 4 1 m i 0     sub 2 m i 0    quantize q small ndatabuf d 1 m i 0    centers d 1 m i 0     squa m i 0    quantize q_small sub_1 m_i_0    sub 2 m i 0     mult m  i 0    quantize q1 squa m i 0    covars  inv d 1 m i 0      acc m i 0    quantize q1 acc m i 0    mult m i 0      end      end           1    quantize q1  acc 1  2    acc 2    quantize q1  acc 2  2    acc 3    quantize q1  acc 3  2    acc 4    quantize q1  acc 4  2           5   quantize g1  acc 5  2      acc 1    quantize q1 acc 1    log const   covars_det 1     acc 2    quantize q1 acc 2    log const   covars det 2     acc 3    quantize q1 acc 3    log  const   covars_det 3     acc 4    quantize
45. Clustering result for the training data  Five clusters are labeled using  different color           38  Matlab Performance    Since we want to implement a high speed spike sorting system where the models  are trained as fast as possible  we have to know how fast this algorithm can run on a  PC  As described in Chapter 2  the whole spike sorting process can be divided into  two phases  the model training phase and the spike classification phase  We tested the  algorithm using the newest version  Version 6 5 Release 13  of Matlab on different    PCs  The test results are shown in Table 3 1     Table 3 1 Performance of the spike sorting system on several PCs       Execution Time  Pentium AMD AMD Pentium    M12G  1 37G _1 2GDuo _1v3 3G _Aversse_ Best    Training  s  833 3 42 1 45 1 31 322 1 31   Classification  720 0 42 0 14 0 05 0 04 0 16 0 04   data points   s    Classification  per 0 58 0 19 0 06 0 05 0 21 0 05    data point   ms       1 The Pentium    1 26 and AMD 1 376 systems both have 512 MB DDRAM   2 The AMD 1 2G Duo system has 4 GB DDRAM   3 The Pentium IV 3 36 system has 1GB DDRAM     The performance of making classifications on a PC is pretty good  With the  existing models  a spike waveform can be classified within less than 0 2 ms  This  training process can meet the real time requirement since the typical neural spike  period is around 2 3 ms  However  the training process ran much slower than the  classification process  The best performance we got was 1 31 seconds
46. EL SHAPE LogProLXi_addr 5   LogProLXi_addr m_6_1    accel_bitshl LogProLXi m_6_1  8 q1 q_addr    end    ACCEL SHAPE lut  exp ProLXi 1024 ncluster      ACCEL TILE    6  905 1ACCEL MEM  MAP ProLXi   m  6  TO ram s18 s18 m 6 5  AT 0     ACCEL MEM  MAP lut exp ProLXi   m 6  TO ram  818 s18 m 6410  AT 0  for m  6   quantize q_int 1 ncluster       ACCEL SHAPE ProLXi ndata ncluster   ProLXi id m 6    quantize q lut exp lut exp ProLXi LogProLXi          6    1 m_6     end    96 1ACCEL SHAPE mid priors 5   961ACCEL TILE m 7  for m 7   quantize q int  1 ncluster   mid  priors m  7    quantize q        priors mid priors m 7    ProLXi id m 7     end     ACCEL SHAPE mid  centers ndim ncluster      ACCEL SHAPE mid centers temp  1 ncluster              123        ACCEL SHAPE mid_centers_temp_2 ncluster                   REGISTER mid_centers_temp_1      ACCEL REGISTER mid centers temp 2   for d 2   quantize q  int  1 ndim       ACCEL TILE m i 1    ACCEL USE embedmults 15  m  1 1     form i 1   quantize q_int 1 ncluster   mid centers temp 1 m i 1      quantize q1 indatabuf d 2 m i 1    ProLXi id m i 1     mid centers temp 2 m i 1      quantize q1 mid centers temp  1 m  i l  mid centers d  2 m     1     mid  centers d 2 m i 1      quantize q mid centers mid centers temp 2 m i 1     end    end  id   quantize q_int id 1    if id    quantize q_int ndata 1        Here is a 30 bit multiplier   step_incr   quantize q1 LogLikeHoodX   newLikeHood      To calculate the increase of the likelihood   newL
47. I FPGA     FPGA resources Number of used Number of Percentage of the  units available units available resources  Slices 6378 14336 44   Slice Flip Flops 4790 28672 16   4 input LUTs 11473 28672 40   bonded IOBs 293 720   40   Embedded Multiplier 42 96 43   BlockMems 58 96 60   GCLK 1 16 6       Table 5 7 The post synthesis timing report     System Parameters Performance  Minimum Period 15 987 ns  Maximum Frequency 62 551 MHz  Maximum input arrival time   9 901 ns    Maximum output required time 7 333 ns                69     amp  Array    EBR       Figure 5 8 Floorplan of the EM algorithm implementation on a Virtex II FPGA    FPGA Result Evaluation    Since we used fixed point data in the FPGA implementation  the precision of the  output was not as good as the floating point implementation  The LUT application for  the math operations furthermore decreased the precision of the output  We compared  the final GMM parameters from the floating point implementation and the fixed point  FPGA implementation    Figure 5 9  b  shows the plot of the result means of each cluster from the FPGA    implementation  The difference is pretty small comparing to the floating point results                70     Figure 5 9  a    Table 5 8 shows the average errors between the FPGA  implementation and floating point implementation  We can see that the differences are    small           15 1 15r T     1    0 5 0 5                      Figure 5 9 The mean vectors of the GMMs   a  is the floating poin
48. OP   DMAC10  dma_channel 10    DMAC FLUSH     SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC_FLUSH               return  TRUE          int SetSem int sem           while  set_semaphore  int   sem 1 1     0    3   return  0          int ClearSem int sem     while  set  semaphore  int   sem 0 1     0    3   return  0               RRR A AER BR HE                                    End of lib c     i Rede eek tbe bee RR OR RS A SR E RR RR A A A A EE                114    APPENDIX D    FIXED POINT MATLAB CODE USING ACCELFPGA                115    AccelChip m    warning off     Directive to indicate the target for synthesis     ACCEL TARGET XC2V3000      package_quantizer_constant       q data   18 6     for data input     ql   30 13     for all the accumulator and middle results    q_lut_cov_inv  18 3     for        inv look up table     g_lut_cov_log  18 13     for        log look up table     q lut          18 13     for first exp look up table     q lut log   18 13     for log look up table     q lut prior   18 13     for prior inv look up table     q int   18 0     for integer     q fra   18 17     for fractional number      Define the width of the word to 30 and binary point is 13       Just simplify the whole process using one fixed point length      May be able to devide algorithm to different precision    wl   30    fp   13       This word length is for the input data  18 bit can be fit in the block    memory    wl data   18    fp  data   6       quantize for integer 
49. TURE MODEL BASED SPIKE SORTING SYSTEM    by    Yongming Zhu    A thesis submitted in partial fulfillment  of the requirement for the degree    of  Master of Science  in    Electrical Engineering    MONTANA STATE UNIVERSITY  Bozeman  Montana    November 2003               COPYRIGHT  by  Yongming Zhu  2003  All Rights R  served                ii       373 APPROVAL    Z  2  of a thesis submitted by    Yongming Zhu    This thesis has been read by each member of the thesis committee and has been  found to be satisfactory regarding content  English usage  format  citations   bibliographic style  and consistency  and is ready for submission to the College of  Graduate Studies     Dr  Ross K  Snider Kos 24     oe  2 1 07     Signature  Date  Approved for the Department of Electrical and Computer Engineering    Dr  James Peterson     2   2     Date       Approved for the College of Graduate Studies    Dr  Bruce  T m     Signature        6 04    Date             STATEMENT OF PERMISSION TO USE    In presenting this thesis in partial fulfillment of the requirements for a master   s  degree at Montana State University  I agree that the Library shall make it available to    borrowers under rules of the Library     If I have indicated my intention to copyright this thesis by including a  copyright notice page  copying is allowable only for scholarly purposes  consistent  with    fair use  as prescribed in the U S  Copyright Law  Requests for permission for        extended quotation from or 
50. _i_3    centers d_4 m_i_3       square cov m i 3    quantize q_square_cov sub_cov_1 m_i_3    sub_cov_2 m_i_3     mult cov m i 3    quantize ql square cov m i 3    ProLXi id m 1 3       mid covars temp m  i 3       quantize q mid covars mid covars d 4 m i 3    mult cov m i 3     mid covars d 4 m i 3    mid covars temp m i 3    end i    end   id   quantize q_int id   1      if id    quantize q_int ndata 1     loop_flag   quantize q_int loop_flag   1     id   quantize q_int  1      for d 5   quantize q_int 1 ndim        ACCEL SHAPE mid  covars ndim ncluster      ACCEL TILE m 12      ACCEL        MAP mid  covars   m  12  TO ram  518 s18 m 12430            ACCEL MEM MAP covars  m 12  TO ram s18 s18 m 12425  AT 0   965  ACCEL USE embedmults m 124 40     for m 12   quantize q int 1 ncluster     covars d_5 m_12    quantize q_covars mid_covars d_5 m_12    mid_priors_inv m_12     covars_test m_12    quantize q1 covars d_5 m_12       end  end             Look up table for inv and log of covars               gt  gt  gt  gt  gt  gt  gt  gt  gt  covars_inv covars_log    fun_lut_covars covars                      SHAPE covars_addr 1       ACCEL SHAPE covars_inv ndim ncluster       ACCEL SHAPE covars_log ndim ncluster   for d 6   quantize q_int 1 ndim              126    if covars d_6 1   gt  upper_const  covars d_6 1    quantize q_covars upper_const     2  2   24 12  end  end    ford 9   quantize q_int  1 ndim   if covars d_9 2   gt  upper_const  covars d_9 2    quantize q_covars upper_cons
51. acellular recording waveform shows different action  potentials from an unknown number of local neurons  The data  were recorded from an adult female cricket   s cercal sensory  system   Data courtesy of Alex  7               NER 2    1 2 A typical neural signal recording  detecting and spike sorting    SYSUOM  20 0     ccsesscssessserscssnscvsseeeessssceeessosseeeessosseesaesseseasesssceesessoeessseesosseseseneseeees 3  2 1 Non Gaussian density estimation using    GMM                            s     10  2 2 Illustration of the spike generating process                     essei eee 12    2 3 A multi variant Gaussian model of the spike waveforms generated  from a particular neuron  The solid line shows the mean vector of  this model while the dashed line shows the three 9 boundary  according to the covariance matrix  All the waveforms in this  model are shown in the                                                                 13    2 4 Block Diagram of the Bayesian   s decision rule used for spike  sorting system  The log probability for a spike being in a cluster is  computed for each individual cluster and the highest scoring    Cluster is the identified                                 4                                    eee tenete 15  2 5 Diagram of EM algorithm for GMM parameter estimation                             23  3 1 Plot of all the 720 neural spike waveforms                           cessere 26    3 2 The increase in likelihood due to the increase of cluster 
52. ally include more waveforms in this  cluster where some spike shapes are noticeably different  Using the diagonal  covariance matrix approach  the small variance was captured and a small cluster was  generated to represent this specific spike shape  For our data set  the diagonal  covariance approach lost the least amount of information in the original spike    waveforms  We chose this approach for all hardware implementations     Implementation Result in Matlab    Using the initialization condition mentioned in section 3 3 1  the training process  for 720 160 dimensional inputs took 8 iterations to converge  The mean waveform  and priors of each cluster are show in Figure 3 6  Figure 3 7 shows the cluster results  of the 720 training data using the GMM    As you can see in Figure 3 7  all the main clusters have been recognized  Cluster  1 and Cluster 2 are the clusters with the most data  Cluster 3 and Cluster 4 can be  regarded as the noisy versions of Cluster 1 and Cluster 2  Their mean vectors are very  close to Cluster 1 and Cluster 2  But their waveforms are noisier  thus the covariance  of these two clusters are bigger  Cluster 5 has its own specific spike shape  It has the  least number of spikes among the five clusters so it has the smallest prior probability     Overall  this 5 component Gaussian Mixture Model can pretty well model the whole    data set from visual inspection              37       0 041667  0 16245      0 1903      0 25275  0 35282       Figure 3 7 
53. antize q int 1 ncluster   c  temp   quantize q1  priors m 4    a_lut_out m_4          quantize q1 c   c temp     end    ifc gt 0 5       quantize q1 0 4995    end   1  ACCEL REGISTER c_addr     ACCEL REGISTER c lut out     ACCEL SHAPE c_addr 1   c_addr   accel_bitshl c 11 q1 q_addr      ACCEL SHAPE lut_log_content 1024     ACCEL SHAPE c_lut_out 1         lut         quantize q_lut_log lut_log_content c_addr 1                122      ACCEL REGISTER LogLikeHoodXi     ACCEL SHAPE LogLikeHood 1        ACCEL SHAPE LogLikeHoodX 1    LogLikeHoodXi   quantize q_loglike c_Iut_out   maxa        ACCEL REGISTER LogLikeHoodX_temp     ACCEL SHAPE LogLikeHoodX_temp 1       ACCEL USE embedmults 15    LogLikeHoodX_temp   quantize q1 LogLikeHoodX   LogLikeHoodXi   ndata_inv    LogLikeHoodX   quantize q1 LogLikeHoodX_temp        ACCEL SHAPE LogProLXi 5     LogProLXi 1    quantize q1  LogProXiL 1    priors_log 1    LogLikeHoodXi    LogProLXi 2    quantize q1  LogProXiL 2    priors_log 2    LogLikeHoodXi    LogProLXi 3    quantize q1  LogProXiL 3    priors log 3    LogLikeHoodXi    LogProLXi 4    quantize q1  LogProXiL 4    priors_log 4    LogLikeHoodX1    LogProLXi 5    quantize q1  LogProXiL 5    priors log 5    LogLikeHoodXi      for m_5_2   quantize q int 1 5   if LogProLXi m 5  2   gt  4  LogProLXi m 5  2    quantize q1 3 998    end  end  for m 5 3   quantize q_int 1 5   if LogProLXi m 5  3     0  LogProLXi m 5  3    quantize q1 0    end  end    for m 6 1   quantize q int  1 ncluster     ACC
54. at      centers 4   quantize q_data load    centers_4 dat      centers 5   quantize q_data load    centers_5 dat      centers   1    centers_1    centers   2    centers_2    centers   3    centers_3    centers   4    centers 4    centers   5    centers 5      Define constant   log_const   quantize q1 147 0302     log_const   log 2 pi   ndim 2    ndim   quantize q_int 160     ndata   quantize q_int 5     ncluster   quantize q_int 5     stop_thre   quantize q_int 6 25     MaxLoop   quantize q_int 27     LoopCount   quantize q_int 1     loop_flag   quantize q_int 1     id   quantize q_int 1      centers_test   zeros 1 5    covars_test   zeros 1 5    upper const   quantize q fra 0 2498      covars     quantize q_covars zeros ndim ncluster  1    priors   quantize q priors  zeros 1 ncluster   1    0 2    ndata inv   quantize q fra 0 0014      const zero   quantize q_int 0      The middle result we need    indatabuf   zeros 1 ndim     LogProXiL   quantize q1 zeros 1 ncluster     LogLikeHoodXi   quantize q_loglike 0    LogLikeHoodX   quantize q1 0      ProLXi   quantize q_lut_exp zeros ndata ncluster       mid_priors   quantize q_mid_priors zeros 1 ncluster      mid centers quantize q_mid_centers zeros ndim ncluster     mid                quantize q mid covars zeros ndim ncluster     mid              inv   quantize q        inv zeros 1 ncluster                   118    newLikeHood   quantize q1 0      covars_inv   quantize q_lut_cov_inv zeros ndim ncluster  1     initialize the cov_i
55. atlab in the similar way  A    list of the LUTS is found in Appendix D                 65  Parallelism Needed for FPGA Implementation    To fully use the computational power of the FPGA  we needed to parallelize the  EM algorithm  In Matlab  programs are executed sequentially according the program  sequence  AccelChip offers directives which can tell the compiler when to parallelize  the algorithm and when to use the embedded FPGA resources    The following are some directives that have been used extensively in the Matlab    program     Table 5 4 List of directives can be used in AccelFPGA     REGISTER 2  hardware registers for specific variables to add pipeline           MAP Map the array variable to a specific block memory on the FPGA    TILE Specify an array of computations to happen concurrently     Specify the hardware resources such as embedded multiplier for    HSE the particular operations    The code in List 5 1 shows how to actually using these directives in the Matlab  code  The operation in the for loop is parallelized using TILE directive  Block  memories are used on each array variable and an embedded multiplier is used for each  multiplication  By using these directives  the multiplication operations applying to all    5 clusters can be executed in one clock cycle                 66    List 5 1 Example codes for using directives in Matlab      ACCEL TILE m 9    ACCEL         MAP mid_centers   m_9  TO ram_s18_s18 m_9 20  ATO                MEM MAP centers   m 9 
56. d and Wait and Start                             SetlIOP SEM_GLOBAL  1    while   GetIOP SEM  LOCAL 1    GetlOP SEM_LOCAL_2    GetlOP SEM_LOCAL_3      0     E  SetIOP SEM_GLOBAL 0           see so se eo sk ooo sese Rok             ok kk kk         ae Clear all the log version matrix                   sk spen see e sk    e ore eo oe oh e                  for  i 0 i lt nclusters i       for  j 0 j lt ndim j         logcovars_master i  j    0    Clear the logcovars matrix  logcenters_master i  j    0    Clear the logcenters matrix         se ee ee as eo sk oko he see ee eoe oe                          Calculate the probablity in each cluster                                       sk e sek espe k a oH a kkk      for  i 0 i lt nblockdata i       for  j 0 j lt nclusters j                        95    logproXiL_master j  i    getproXi centers j  blockdata i  covars j               kk se sk                                            oj obe ojo                Calculate the liklihood of each data      this model    Phe He ease oh    sk                                                          sj obe khe                            bebe eae    for G 0 i lt nblockdata i             Clear the store matrix  logihXi_master 0  i    0      Get the max number of each column  for  j 0 j lt nclusters j       if  j  0             else         maxTemp   logproXiL  master O  i        if  logproXiL  master j  i   gt  maxTemp   maxTemp   logproXiL  master j  i                Normalize the pro to  0 1   
57. d hardware will reduce the    training        needed to develop sophisticated models  which will allow more time to  be devoted to data collection  This is important in electrophysiology where limited  recording times are the norm    We implement a Gaussian Mixture Model based spike sorting system on both a    DSP system and a Field Programmable Gate Array  FPGA  platform  The FPGA    implementation speeds up the system dramatically which will allow an online spike    sorting system to be implemenated   Gaussian Mixture Model    If we view the measured spike signal as the combination of the original action  potentials with the addition of random background noise  we can use a Statistical  process to model the spike signal generating process  Clustering can then be viewed  as a model of the statistical distribution of the observed data  The whole data set can  then be modeled with a mixture model with each cluster being a multivariate  Gaussian distribution    Gaussian mixture models have been found very useful in many signal processing  applications  such as image processing  speech signal processing and pattern    recognition  17 18   For the spike sorting problem  a number of studies  based on the            Gaussian Mixture Model  have also been tried to provide statistically plausible   complete solutions  These studies have shown that better performance can be obtained  by using a GMM than other general clustering methods  such as K means  19   fuzzy  c means  20  and ne
58. d increase is shown in the bar plot  The dashed line shows  the threshold value     Initialization of the EM Algorithm    The selection of initial model parameters will affect the final results  This is  because the likelihood surface associated with GMM tends to exhibit multiple local  maxima  Thus  different initialization points will lead to different local maximum  results  A generally used method is to apply the K means clustering algorithm first    and then use its output as the initial value of the mixture model s parameters  16               29    However  the K means method itself is also sensitive to initialization and does not  guarantee to get the right initialization for the mixture model  Furthermore  K means  algorithm is computationally expensive to implement    Another more straight forward method is to randomly select initialization from  the data set  It has been shown that this method gives the same performance as the K   means method  24   In this paper  we initialize the means to randomly chosen data  points and pick a single covariance matrix for all the clusters  We set values of the  initialized covariance matrix to be reasonably large so that each cluster can see all the  data at the beginning of the training  We randomly picked 100 starting points from the  data set and chose the result that provided the largest likelihood  Figure 3 3 shows the    means for the 5 initialization clusters  The covariance matrix   X    and prior    probabilities       
59. d was the diagonal matrix as shown in the next section                    35                   Y  20 40 6 80 100 120 140 160  PCA belore full covariance matrix with first 5 components             Figure 3 5 Results of three different approach  The diagonal covariance approach was  the only method that captured the small cluster     As we mentioned in the previous section  it was hard to validate the cluster result  since we didn   t actually know the typical spike shapes that came from individual  neurons  However  by looking at the mean vectors of each cluster  we had a good idea  about the ability of each approach to capture the clusters among the data  Figure 3 5  shows all the cluster means of the three methods  All the spikes are in yellow and the  mean vector of each cluster is shown in a different color  While all of the three  approaches caught the main clusters  the diagonal matrix approach did the best in  catching the smallest cluster  red line   The other two methods failed to catch this  smallest cluster  Notice that in the recondition approach  the prior probabilities for the  small clusters tend to be big  This is because adding a number to the covariance    prevents the cluster to be small  Also notice that the biggest cluster in the                36    reconditioned and PCA method is different from the diagonal covariance matrix  method  This is due to the inability of these two methods to distinguish small variance  in the low amplitude waveforms  They actu
60. e  Expectation Maximization  EM  algorithm  EM algorithms have been widely used in  estimating the parameters of a stochastic process from a set of observed samples   Details about ML and EM algorithm will be discussed in the next section    After estimation of the model parameters  classification is            by the  Bayesian decision rule  The probability that each observation   x  belongs to a    particular classe   0    is calculated by using Bayes    rule  Eq  2 3      saloa ple po  Eq  2 3  i M                                k 1    Each individual density p o        is determined by the means and covariance of  each component density using Eq  2 2  The prior probability p o   is identical to the    weights a  in Eq  2 1   Since the cluster membership is probabilistic  the cluster boundaries can be  computed as a function of a confidence level  The spikes that don t fall into any    clusters will be considered as noise  or overlapping spikes that can be dealt with later                 15    Mathematically  it can be shown that if the model is accurate  the Bayesian decision    boundaries will be optimal  resulting in the fewest number of misclassifications  23            log                   Cluster 1  04                      log                  Cluster 2    02    22 42 05          Max      Cluster  selection    Result       log                    Cluster M                             Figure 2 4 Block Diagram of the Bayesian s decision rule used for spike sorting  sy
61. e  lt vector h gt     include  lt 21160 h gt     include  lt def21160 h gt     include  speeddsp h     include  diag        master  2 h      define DMA 1    Use DMA channel for external memory data  transaction    define GLOBAL 0x0750000    Global Flag    define DSP_TUMP 0  0100000    Memory adress jump between DSPs    define LH_BLOCK_1 0x0250002    Base address of Likelihood of 1st block    define LC FLAG  1 0x0250001    Base address of local flag   fidefine MULTI BASE 0x0700000    define EXTERNAL  BASE 0x0800000     define SLAVE BASE 1 0x0200000   define SLAVE  BASE 2 0x0300000   define SLAVE BASE 3 0x0400000     define START 1    Define start for glocal flag     define STOP 0    Define stop for global flag   define nprocessors 4    How many processors in the system     define PR_BLOCK_1 0x0250004    Base address for middle results of DSP 1     define    BLOCK 1 0x025064C    Base address for transfer results to DSP 1   define GLOBAL P  0x075064C    Global address for transfer results to all DSPs         define PRIOR LENGTH   3    Length of priors array  define            LENGTH 800    Length of centers and covars array   defme P BLOCK LENGTH 1606     Length of the Parameters Block     define SEM GLOBAL  Gnt MSGRO    int  MULTI  BASE     define SEM  LOCAL 1  Gnt MSGR1    int SLAVE BASE 1    define SEM  LOCAL 2  Gnt MSGR2    int SLAVE BASE 2    define SEM  LOCAL 3  Gnt MSGR3    in  SLAVE BASE 3      define SEM  DMA WRITE       0  5    4    in EXTERNAL  BASE    define SEM     
62. e appropriate for the likelihood         Given O   we can easily                21    compute p   x    O   for each data vector x  for each component density y   In addition   the mixing parameters     can be thought of as prior probabilities of each mixture    component  that is       p 0     Therefore  using Bayes s rule  we can compute     2 8             x    Oy    M    PCy    x  0      Eq  2 14   gt     Py   x  0   ye  N  pv   X 0       pty               Eq  2 15    i l   Here we assume that each neuron fires independently from each other  Now we  have obtained the desired marginal density by assuming the existence of the hidden  variables and making a guess at the initial parameters of their distribution  We can  now start the E step in the EM algorithm and using this marginal density function to  calculate the expected value of the log likelihood of the whole data  The expected  value is found to be  The derivation can found in Appendix A      M N M N  Q 0 0        Y logia  p 1    O     9 V logo  x   0   p  x  0   Eq  2 16    11        11         With the expression for the expected value of the likelihood and we can go to the    next M step to maximize this expected likelihood with respect to the  parameters       0   4 2   We can maximize the term containing 0  and the term  containing O  independently since they are not relai   To find the expression for      we have to solve Eq  2 17 with the constraint of Eq 2     18              22    M N  90    loga   p x   0   
63. e status of the Master  processor  Each slave processor had its own local semaphore which told the Master    DSP their status     Table 4 1 List of semaphores and their functions     Function of each semaphore    SET  1  CLEAR  0   Master Semaphore Master Processor is running Master task finished  Local Semaphore 1 3 Slave Processors is running Slave tasks finished  DMA Semaphore DMA bus available DMA bus occupied    The system diagram is shown in Figure 4 3  The algorithm was divided into 6  phases  The phases in shaded blocks were implemented on multiple DSPs  Notice that  the master processor acts also like a slave while doing the parallel computation  Most    of the computational intensive parts  which have been discussed in section 4 2  were                   47    implemented on multiple DSPs  The parts implemented solely on the master processor  were just basically additions  So  during the whole training process of the        algorithm  most of the time  all four processors were running at full speed    Another significant amount of time was spent on transferring data between DSPs   The cluster bus ran at SOMHz and was shared by the all four DSPs as well as the on   board SDRAM  We first let the DSPs resolve the bus contention automatically  The  on chip arbitration logic of ADSP  21160 processors determined which DSP was the  current bus master based on the Bus Request  BR  pins of the DSPs  Multiple DSP  implementations can be done in this mode  However the approach
64. er methods  such as wavelet denoising  8   can also be used to remove the recording noise  The second stage is spike detection   This is usually achieved by a simple threshold method  in which spikes are picked up  when the maximum amplitude is bigger than a manually set threshold value  However   other methods including non linear energy operators  9   wavelet based detectors  10     and slope shape detectors  8  have been used to improve the detection performance                 especially in the low SNR situations  A new approach  in which a noise model is first  generated and then the distance from this noise model is computed to identify spikes   7  is used to obtain the spikes for this paper  However  even in this approach a  threshold must be chosen      Once the data has been collected  features must be extracted to      used in  classification  The features can be  simple user defined features such as peak spike  amplitude  spike width  slope of initial rise  etc  Another widely used method for  feature extraction is Principle Components Analysis  PCA   11   PCA is used to find  an ordered set of orthogonal basis vectors that can capture the directions in the data of  largest variation  The first K principal components will describe the most variation of  the data and can be used as features to classify the spikes  Wavelet decomposition has  also been used to define spike features in a combined time frequency domain  Under  high background noise  another linear trans
65. er mode  the cluster bus  is still the bottle neck of speed in the multiple DSP system    Performance on the DSP system could probably be enhanced by more manual  tuning on the assembly language level  Using DSPs with high core speed or clustering  more DSPs can also improve the performance  However  the improvement will be  very limited and too much time and effort will be required  So  to gain better  performance for the real time spike sorting system  we turned to another solution   namely Field Programmable Gate Arrays  FPGA     In the next Chapter  we will implement our EM algorithm on a Xilinx Virtex II  FPGA system  To implement the EM algorithm on FPGAs  a fixed point version of  the algorithm was first developed  Also  specialized math operations have to be  transformed to facilitate the implementation of the algorithm on a FPGA  These  problems will be addressed in the next Chapter and the implementation results will be    presented                 51  CHAPTER 5  FPGA IMPLEMENATION  Introduction    Field Poe ane Gate Arrays  FPGAs  are reconfigurable hardware that can  be changed in electronic structure either statically or dynamically  Because the  implementation is on the actual hardware  FPGAs can obtain a significant  performance improvement over the software implementations based on DSP chips or  PCs  30       One drawback of a FPGA implementation is that  for complex algorithms like  the EM algorithm  it usually takes a long time to design and optimize the alg
66. ers without processor intervention    The ADSP 21160 offers powerful features to implement multiprocessing DSP  systems  The external bus supports a unified address space that allows direct  interprocessor accesses of each of the ADSP 2116078 internal memory  Six link ports  provide a second method of multiprocessing  Based on the link ports  a large    multiprocessor system can be constructed in a 2D or 3D fashion     Bittware Hammerhead PCI board and VisualDSP     The structure of Bittware Hammerhead PCI board is shown in Figure 4 1  This  board has four Analog Device   s ADSP 21160 processors  up to 512 MB of SDRAM   2 MB of Flash Memory  and two PMC mezzanine sites   Bittware   s FIN ASIC is used  to interface the ADSP 21160 DSPs to the 64 bit  66 MHz PCI bus  the Flash memory     and a peripheral bus  The Hammerhead PCI   s four DSPs can communicate with the                   42    host PC through the PCI bus  Four of the link ports are connected to other DSPs   which allow the DSPs to communicate through the link ports  Each processor is also  connected to a common 50MHz  64 bit ADSP 21160 cluster bus  which gives it  access to the other three processors and to up to 512 MB of SDRAM    In our application  we used all four ADSP 21160 processors on the board as well  as the 256 MB SDRAM  We used the cluster bus to communicate between four DSPs  and SDRAM  The cluster bus approach is relatively easy to implement  It was also  faster than using the link ports    We used Vis
67. estimation procedure is given       Maximum Likelihood Estimation   Maximum likelihood parameter estimation is a general and powerful method for  estimating the parameters of a stochastic process from a set of observed samples  We  have a density function p x O  that is governed by the    set of parameters          4 2   We also have a data set of size N  supposedly drawn from this  distribution  ie  X     x 3x  x    xy     We assume that these data vectors are  independent and identically distributed  i i d   with distribution p x O   Therefore     the resulting probability for all the samples is    N  p X 10   I p  0  7    0  X  Eq  2 4  This function    O  X  is called the likelihood of the parameters given the data     or the likelihood function  The likelihood is treated as a function of the parameters O  where the data X is fixed  For maximum likelihood parameters estimation  our goal is  to find the O that maximizes    O  X   That is  we wish to find O  where  O    argmax     O  X  Eq  2 5      If p x O  is simply a single Gaussian distribution and O    07   Then we  can set the derivative of the    function to zero and solve the likelihood Eq  2 6    directly for    and                  O    0 Eq  2 6  30          18    However  in our case using the GMM  p x O  is a weighted sum of several  single Gaussian distributions  Attempting to solve Eq  2 6 directly for the GMM  parameters  O    2  4 2     i   1     M  does not yield a closed form solution  25      Thus we mu
68. forming method based on entropy  maximization has been reported to generate good features for classification  12     People use extracted features instead of the full sampled waveform because of  the    Curse of Dimensionality     which means the computational costs and the amount  of training data needed grow exponentially with the dimension of the problem  In high  dimensional space  more data will be needed to estimate the model densities  Hence   with limited computational ability and training data  high dienen problems are  practically impossible ii solve  especially in regards to statistical modeling  A  person s inability to view high dimensional data is another reason to decrease the data    dimension to two or three for manual clustering                    In this paper  we present a log version of the Expectation Maximization  EM   algorithm which solves the numerical problem arising from the high dimensional  input based on the Gaussian Mixture Model  In this case  we ciae the full sampled  spike waveforms which preserve all the information of the neural spikes  as the  classifier input    The last stage is the clustering of spikes  In this stage  spikes with similar  features are grouped into clusters  Each cluster represents the spikes that arise from a  single neuron  In most laboratories  this stage is done manually with the help of  visualization software  All the spikes are plotted in some feature ses and the user  simply draws ellipses or polygons around set
69. he specimen   s bodies  Details about the  experiments can be found in  27   After raw data was collected  neural spikes were  detected by first identifying the segments of having no spikes  A noise model was  generated by assuming that the noise was Gaussian  Then  the spikes were detected if  the distance between the recorded data and the noise model was bigger than a  manually set threshold  7   After the threshold was set  we got rid of any waveforms  that were obviously not action potentials and left 720 spikes for training of the EM  algorithm  The 720 spikes are show in Figure 3 1  Each spike lasts for 3 2ms and has  160 sampling points  This set of 720 160 dimensional vectors was used in the EM    algorithm                    Figure 3 1 Plot of all the 720 neural spike waveforms                 27    Practical Problems Related to the EM algorithm    Model Number Selection   The number of Gaussians for the mixture model must be selected before using  the EM method  However  in spike sorting  we do not have a priori information of  how many neurons are producing action potentials  So we have to estimate the model  number using the whole data set    Based on the probabilistic fundado of GMM  we can view the choice of the  optimal cluster number as a problem of fitting the data by the best model  This  problem can be solved by applying the maximum likelihood estimate  However  the  likelihood is a monotonically increasing function of the number of parameters  In  other wo
70. ikeHood   quantize q1 LogLikeHoodX         ACCEL SYNTHESIS OFF   newLikeHood  bin 2 num2bin q1 newLikeHood     newLikeHood  num   bin2num q1_int newLikeHood_bin     fprintf     d  LH is      Improve is 96d Num is  d Bin is   s n  LoopCount newLikeHood step_incr newLikeHood_num num2str newLikeHood_bin         ACCEL SYNTHESIS         if step incr    0  step incr    step incr     end                Function to get the inv of the mid_priors                gt  gt  gt  gt  gt  gt  gt mid_priors_inv                   gt  gt  gt  gt  gt  gt  gt fun_lut_priors_mid_inv mid_priors        ACCEL SHAPE mid_priors_inv_addr 1   96  ACCEL SHAPE mid priors inv 5   95 1ACCEL SHAPE      inv content 1024   for m 8   quantize q int 1 ncluster   mid priors inv addr  accel_bitshl mid_priors m_8  1 q1 q_addr    mid priors inv m 8     quantize q lut inv lut inv content mid priors inv   4      1     96 Change to 2 for simulation                        124    end      ACCEL TILE m 11   76  ACCEL USE embedmults m_11 20    for m_11   quantize q_int 1 ncluster   priors m 11    quantize q_priors mid_priors m_11    ndata_inv    mid_priors m_11    quantize q_mid_priors const_zero     end  q_test_mult   quantizer  floor   wrap   36 21      for d 3   quantize q int  1 ndim     961ACCEL TILE m  9    ACCEL         MAP mid  centers   m 9  TO ram  518 s18 m 9  20            ACCEL MEM  MAP centers   m 9  TO ram 518 s18 m 9 15  AT 0                  USE embedmults m  94 25    96  ACCEL SHAPE centers test 5   for m 9 
71. imized for high speed fixed point addition   subtraction and multiplication  However  a number of other math operations such as  division or exponentiation have to be customized    Using Taylor series is an option which can transform the complicated operation  into simple additione and multiplications  However  for the EM algorithm  the  dynamic ranges of these operations are quite large so that it will require a very high  order Taylor series to obtain good approximations  Another way is to use a look up  table  LUT   Knowing the          range of these math operations  we can use a            to one look up table to approximate them  By dividing the dynamic range input into                60    substantially smaller intervals  this approach will gain enough precision for the  algorithm to converge  The BlockKRAM on the Virtex II FPGA can be    ised to store  the content of the look up tables and their address bus can be used as the look up table  inputs  The look up table operations can be done within one system clock cycle based  on the BlockRAMs    Nonlinear mapping and output interpolation can be used to improve the LUT  performance  Here  we just used a single linear LUT approach  in which we divided  the input data space evenly into small spaces and mapped every point in the input  space to the output space  Two factors  input precision and the LUT size have to be  determined for each LUT depending on the input dynamic range  These two factors  have the following relat
72. ing inside the  FPGA  However  this would take much more work    Another way to speed up the current system is to find the critical path of the  FPGA implementation and add more registers  By increasing the pipeline depth of the  critical path  we can increase the clock rate of the whole system  This could be  explored in AccelFPGA by rewriting the Matlab code to produce more intermediate  results and adding more compiler directives    The current implementation on the Virtex II XC2V3000 FPGA used about half  of the embedded hardware resources  More parallelism of the EM algorithm could be  exploited by using more hardware resources in the FPGA  For example  in the  training process of the each cluster  the whole training data set could be divided into  smaller blocks  The computation of all the data blocks could be parallelized on the  FPGA  Also  the use of multipe FPGAs could enable the more complex    implementations  requiring more hardware resources than what is contained on a    single FPGA           76    BIBLIOGRAPHY           1      2      3      4    5      6      7      8      9      10      11      12      13      14        77    Hodgkin  amp  A L Huxley  1952  A quantitative description of membrane current  and its application to conduction and excitation in nerve  Jourial of Physiology   117  500 544     F Rieke  D Warland  R de Ruyter van Steveninck   amp  W Bialek  1997  Spikes   Exploring the Neural Code  MIT press  Cambridge  Massachusetts     J G Nicholls 
73. ionship     Dynamic Range    LUT size        Precision    Eq  5 1    Ideally  we want the dynamic range to be as large as possible while the precision  as small as possible  which leads to an infinitely large LUT size  However  in most of  the real world applications  the input dynamic range is limited to a certain range  The  precision requirement may vary for different applications  For the EM algorithm  all  we needed was to estimate the parameters for the GMM and be able to classify the  neural spikes  As long as the EM algorithm converges to the right solution  the  precision of each variable is not extremely important  Hence  a reasonable size of the    LUT can be used for each math operation in the EM algorithm and can be    implemented on the Virtex II FPGA platform              61    We built the LUTs using the BlockRAM in the Xilinx Virtex II FPGA  As  described in section 5 2 2  the BlockKRAM are 18 Kbit block memories  Each 18 Kbit  block memory can be configured into various address data aspect ratios as shown in  Table 5 1  We chose the 1K 18bit configuration for our LUT implementation  The  18bit output bit width gave enough precision for the LUT outputs  Furthermore  the  output width matched the embedded multipliers    input width which maximized the  usage of the embedded multiplier resources    After choosing the LUT size  which was 1024  1K  in our case  we found the  input precision by tracking the input dynamic range  Once the input dynamic range  was de
74. ize q1 priors 1     outdatabuf 7   quantize q1 priors 2     outdatabuf 8   quantize q1 priors 3     outdatabuf 9   quantize q1 priors 4     outdatabuf 10   quantize q1 priors 5     outdatabuf 11   quantize q1 newLikeHood      outdata_1 n    quantize ql outdatabuf 1    outdata 2 n    quantize q1 outdatabuf_2    outdata 3 n    quantize q1 outdatabuf_3    outdata 4 n    quantize q1 outdatabuf 4    outdata  5 n    quantize q1 outdatabuf 5    outdata_6 n    quantize ql outdatabuf 6    outdata 7 n    quantize q1 outdatabuf 7    outdata  8 n    quantize ql outdatabuf 8    outdata_9 n    quantize ql outdatabuf 9    outdata_10 n    quantize q1 outdatabuf 10   1  outdata 11 n    quantize q1 outdatabuf 11      ACCEL END HARDWARE  outdata l outdata 2 0utdata 3 outdata 4 outdata S outdata 11  end   end                             762 10392217 3          
75. lgorithm              44    Figure 4 2 shows the most computational part in the EM algorithm  The    calculation of p x  0   and the update of means and covariance matrices take about    70  of all the execution time  These two parts should be implemented on multiple    DSPs  The calculation of p o  x   and    x  O  and rest the algorithm is relatively    less computationally intensive  A close look at parts  a  and  b  shows that most of the  computational operations are multiply and multiply accumulation type of instructions   Thus the floating point MAC inside the ADSP 21160 will be able to speed up the    operations     Parallel Transformation of the EM Algorithm    To fully use the computational power of the four DSPs on the Bittware board   the EM algorithm was transformed to a parallel version  The most straightforward  way is to map the computation of each cluster onto each single DSP  However  there  were several drawbacks of this approach      First  there are five clusters in our model while there are only 4 DSPs on the  Hammerhead board  Hence we have to put the computation of two clusters into one  DSP creating an uneven computational load  This will reduce the speed of the whole  system since three DSPs have to wait while the other DSP deals with the extra  computation for the second cluster  Secondly  the update of parameters is not  independent among all the clusters  Before updating the parameters of each cluster   the probability of each data in the current 
76. likely to generate a large number after many             Using  30 bit word length prevented overflow problems for the accumulation results    After deciding on the word length  the next step was to choose the binary point  for each variable  This step actually determined the precision we got from the FPGA    implementation  Since some parts of the algorithm have been transformed into the log                59    domain  the dynamic ranges of each variable are quite different  So with the same  word length  each variable will need its own binary point to meet the precision  requirement  This  would be quite hard to implement in a FPGA if an HDL coding  approach was used  Whenever a math operation occurs between variables with  different binary points  a shift is needed to align these variables  A lot of work has to  be done in the low level HDL programming to make sure every math operation has  the binary point aligned correctly  But in AccelFPGA  the compiler will automatically  align the binary point  This allows the binary point of each variable to be easily  changed and the algorithm tested to make sure the fixed point implementation is  sod correctly  The newest version of AccelFPGA  V2 0  can automatically find  the suitable ond length and binary point for your algorithm  However  at the time the  thesis was done  this feature was not available and all the binary points were tuned    manually in AccelFPGA V1 6     Using Lookup Tables    The structure of the FPGAs is opt
77. lloc  ndim  sizeof  float      meanTemp    float    calloc  ndim  sizeof  float         detcovTemp    float    calloc  ndim  sizeof  float    1    for  j 0 j lt ndim j         1  invcovTemplj    1 covars j                log version of the diagonal covar   vlog    DD detcovTemp covars ndim        det   prod cov   logdet   sum logcov   divider   vsum  sf  D detcovTemp 1 ndim    divider   divider   0 5     vecvsub blockdata centers meanTemp ndim        vecvmlt meanTemp meanTemp meanTemp ndim    vecvmlt meanTemp invcovTemp meanTemp ndim    proXiTemp   vsum  sf  D meanTemp 1 ndim    proXiTemp    0 5 proXiTemp   proXi   proXiTemp   divider   log_normal     free invcovTemp      free mean Temp    free detcovTemp      return  proXi           void block_sum float blocksums   float blockdata   ndim  float logproLXi    int offset        int i j 0     for G 0 i lt ndim i                      110    for       0 2  lt nblo ckdata j           blocksums i    blocksums i    blockdata j  i    logproLXi j      int extmem_to_intmem  int  extsrc  int  intdest  int num to copy  int dma  channel  int wait         int chan_offset    int DSP_offset    int New_intdest   int dma  status  bit   int testwait        if internal memory is in MASTER  no offset and intdest need not change         offset   0  5  New  intdest    int  intdest        Check for valid DMA channel    if   dma  channel  lt  10      dma  channel  gt  13     return  FALSE      chan  offset    dma  channel   10   lt  lt  3        if in
78. loating point result   b  Fixed point result     Figure 5 10 The clustering results of both floating point and fixed point  implementations           72  EPGA Performance    In the post synthesis simulation  it took 60M clock cycles to finish one step of  the EM algorithm in the FPGA  Running at 60MHz  it took 60M 1 60M 8   0 08  second to finish the training process  Table 5 10 shows that the FPGA implementation  is 16 4 times faster than the fastest PC and 42 5 times faster than the parallel DSP  implementation       Table 5 10 Performance comparison between all the platforms     Single Four PC 2 3  DSP___DSP__Average _ PCBes FPGA  Execution Time  s  13 5 3 4 3 22 1 31 0 08  Speedup Related to the 0 05 04 04 1 164    best PC  times     l Average from following four systems   1  AMD 1 37GHz  512 MB memory   2  Pentium     1 0 GHz  512 MB memory   3  AMD 1 2GHz Duo Processor  4 GB memory   4  Pentium IV  3 2 GHz  1 GB memory    2 Best performances from Pentium IV 3 2 GHz system with 1GB memory     3 AccelFPGA V1 6  ISE V5 2 02i and Modelsim Starter 5 6e were used in the FPGA  implementation                       73  CHAPTER 6    CONCLUSION AND FUTURE WORK  Conclusion    This thesis presents the implementation results of a Gaussian Mixture Model  training process on three different platforms  These platforms include PCs  multiple  DSPs  and an FPGA  The Expectation Maximization algorithm was used to estimate  the parameters of the GMM  The fastest implementation was obtained b
79. lock Manager  DCM  blocks   The  functionality of these    elements will be introduced later  All of these  elements are interconnected by routing resources   Th    general routing matrix is an  array of routing switches  Each vlog atis element is tied to a switch matrix   allowing multiple connections to the general routing matrix  All programmable  elements  including the routing resources  are controlled by values stored in static  memory cells  These values are loaded in memory cells during configuration and can  be reloaded to change the functions of the programmable elements    Configurable Logic Blocks  CLB  are the basic functional units of Xilinx FPGAs   Theses blocks are organized in an array and are used to build combinational and  synchronous logic designs  Each CLB element comprises 4 similar slices The four  slices are split in two columns with two independent carry logic chains and one  comman shift chain  Figure 5 2 shows the details of a slice inside one CLB  Each  slice includes two 4 input function generators  carry logic  arithmetic logic gates     wide function multiplexers and two storage elements  Each 4 input function generator                54    can be programmed as a 4 input LUT  16 bits of distributed SelectRAM memory  or a  16 bit variable tap shift register  Each CLB is wired to the switch matrix to access the    general routing matrix to be interconnected to other CLBs                                                                             
80. mances on the PC were from  Matlab implementations  The performance of the parallel DSP implementation was at    the same level as the average PC configuration but not as good as the high end PCs     Table 4 3 Performance of the EM algorithm on DSP system and PCs    Single DSP Four DSP PC Average  PC Best   Execution Times 13 5 3 4 3 22    131    Average from following four systems   1  AMD 1 37GHz  512 MB memory   2  Pentium III    1 0 GHz  512 MB memory   3  AMD 1 2GHz Duo Processor  4 GB memory   4  Pentium IV  3 2     2  1 GB memory      Best performances from Pentium IV 3 2 GHz system with 1GB memory       VisualDSP   3 0 was used in the DSP implementation              50    There are two reasons for why the parallel DSP method was similar if not slower  than the PCs  First  the core speed of DSP is much less than the high end PC CPUs   The core speed of ADSP 21160 is running at 80MHz while the core clock of Pentium  IV is running at 3 3 GHz  The high end PC is about 40 times fast than the DSP  In  spite of the optimal structure of DSP core for signal processing operations  the overall  performance of floating point operations on a high end PC is much better than on  individual DSPs  Another reason is the bus speed  Since the lack of on chip memory  of DSPs  the data has to be transferred between DSPs through cluster bus  The speed  of cluster bus is at 5 0MHz which is much slower comparing to the front bus on the  PCs which is up to 133MHz  Although using the DMA transf
81. model has to be calculated  In Eq  4 1 the  probability of each data in each cluster has to be added together across all the clusters     So at this point all the middle results have to be sent to one DSP  Last but not the least                    45    the limited on chip memory requires the DSP to read data from external memory  If    you look closely at the algorithm  for the calculation of p x  0    as well as the    update of the means and covariance  the whole data set is needed  In other words   during these calculations  the DSPs have to get the value of the whole data and put  them into ALU for computing  However  with the limitation of the internal memory  size  it is impossible to store all the data set inside a single DSP  The only option is to  read the data set from external RAM  i e  the on board SDRAM  These operations  will require extra overhead and will slow down the process  They can also disable the  SIMD capability of DSP  Since all the DSPs and SDRAM share the same 50MHz  cluster bus  multiple simultaneous uses of the bus will cause extra delays and even  bus contention    Due to these drawbacks  we decided to choose another approach  We divide the  whole data set into small blocks and each processor handles one block of data  This  way  we eliminated the overhead of transferring data from external memory  With all  the data inside the DSPs  we fully used the SIMD mode of the DSPs to accelerate the  process  This also allowed us to divide the computatio
82. n can also exhibit underflow problems if all            0  78    are very small  which will occur if a spike doesn   t belong to any clusters  So we still    need to convert Eq  3 10 into the log domain as follows                       33    In p o    x    1   p x   0     In    0    In p x        Eq  3 11   Since we have already completed In              from Eq  3 8  it is fairly easy to get   the result of Eq  3 11  We can use exponentiation to get p 0  x   back into the  normal domain and begin the parameter updates  All the parameter updates of are  done in the normal domain since no underflow problems will occur in this process   The new parameters will be used to calculate the likelihood for the whole data set  again in the log domain to determine whether the iterative process should stop  The    whole process can be seen in the Figure 3 4          M Step    Estimate Q i 1   using Q         E Step        p o                       Initialization  Q Q 0     log    O  X     Calculate log likelihood  of the data using Q             5 log likelihood  L i  L i 1  gt threshold          Figure 3 4 Diagram of the revised EM algorithm  The E step implemented in the log  domain is shown inside the dashed rectangle                 34    Using A Diagonal Covariance Matrix    While computing Eq  3 2  we need the determinant of the covariance matrices      Xi     However  it is difficult to get the determinant of the full covariance matrix    since the data dimension is large  All the
83. n evenly for each DSP  regardless of the cluster number  By being independent on the cluster number  this  approach will allow scaling to more DSPs  We still need to transfer data between  DSPs due to the dependence of updating the parameters  But the transfer is limited to           results of the parameters which is much smaller in size comparing to the    whole data set  This transmission can be monitored by one master DSP  DMA mode    was used to accelerate the parameter transfers              46    We divided the whole data set into four blocks  Each block contained 180 data  points  These blocks were fitted into a single DSP   s internal memory  One of the four  DSPs was considered to be the master while other three were slaves  The master did  the same computation tasks as the slaves except for two additional tasks  cross data  calculation and the management of DMA transfers  Whenever cross data calculation  was needed  all the middle results were sent to the master from the slaves through the  cluster bus  When the master finished the calculation  the results were sent back to the  slaves  If the DMA mode was used  the master was the only DSP that had the control  of all the DSP   s DMA controllers  Synchronization between the DSPs was handled  via semaphores  Semaphores were also used to prevent DMA bus contention  Table 4   1 shows all the semaphores that were used in the program  The master semaphore was  a global semaphore that all slave processors read to know th
84. nt in a Gaussian Mixture Model  Details about using GMMs to model    the spike sorting process will be described in the next section     GMMs for spike sorting    Since action potentials are observed with no labeling from neurons they arise  from  spike generating can be considered to be a hidden process  To see how the  hidden process leads itself to modeling the spike waveforms  consider the generative    process in Figure 2 2                 12    The characteristic White Noise  spike shapes                               No 1  US White Noise  A                  No 2 neuron White Noise si Song  Spike Train ystem  No 3 neuron White Noise  No 4 neuron    Figure 2 2 Illustration of the spike generating process     The neural spike train is considered to be generated from a hidden stochastic  process  Each spike is generated from one of the neurons according to an a priori  discrete probability distribution  a   for selecting neuron i  If we assume that the  action potentials from each neuron has its own characteristic waveform shape  mean  value  and addition of white noise with different variance  covariance   observations  from each neuron can be modeled by a multi variable Gaussian model as seen in    Figure 2 3  The observation sequence thus consists of feature vectors drawn from    different statistical populations  each neuron  in a hidden manner                 13                      5        ein ER   071                   20 40 60 80 100 120 140 160       Figure 2 3  
85. ntly isolate action potentials arising  from individual neurons    Single nerve cell recording is possible by using intracellular electrodes  Glass  pipettes and high impedance sharp electrodes are used to penetrate a particular nerve  cell and monitor the electrical activities directly  This has been used to understand the  mechanism of action potentials and many other neuronal phenomena  5 6   However  these electrodes are not practical in multi neuron recording  For intact free moving  animals  a small movement of the intracellular electrodes will easily damage the nerve  cell tissue  Furthermore  it is very difficult to isolate a large number of neurons in a    small local region  In awake animals  the isolation sometimes lasts for a very short                period of time  Fortunately  since all that we need is the timing of action potential  it  is possible to use extracellular electrodes to acquire this information  With larger tips  than intracellular electrodes  extracellular electrodes can simultaneously record  signals of a small number  3 to 5  of neurons from a local region  Figure 1 shows the  waveform comprised of action potentials recorded by one extracellular microelectrode   Each voltage spike in the waveform is the result of an action potential from one or  more neurons near the electrode  The process of identifying and distinguishing these    spikes that arise from different neurons is called    spike sorting                    1    L    20 25      35 40 
86. numbers    The amount of the log likelihood increase is shown in the bar plot    The dashed line shows the threshold value                               eese 28  3 3 Initialization of the EM algorithm                      esee eee 29    3 4 Diagram of the revised EM algorithm  The E step implemented in    the log domain is shown inside the dashed rectangle                                    33  3 5 Results of three different approach                          35  3 6 Clustering result from Matlab                                                   37    3 7 Clustering result for the training data  Five clusters are labeled  using different color                     eee eese eee essen                                                Vii  LIST OF FIGURES   CONTINUED    FIGURE PAGE  4 1 Block diagram of Hammerhead PCI system          ssessssssssssssssssssessssasensssesees 42  4 2 The diagram shows most computational intensive parts in the EM  algorithm and their execution percentage in the whole process   a     Calculation of    x   9    probability of each data point in each  cluster based on the current parameters   b  Update of means and    covariance matrices   c  Calculation of p o       and 2   0     probability of being in each cluster given individual input data  point and likelihood of each data point in the current Gaussian    Mixture Model   d  Rest of the algorithm                     eere 43    4 3 Diagram of multiple DSP implementation of EM algorithm                   
87. nv  covars_log   quantize q_lut_cov_log zeros ndim ncluster      covars_det   quantize q1 zeros 1 ncluster      priors_log   quantize q_lut_log  zeros 1 ncluster  1     1 6094      result_covars   quantize q_covars zeros ndim ncluster     result_centers   quantize q_small zeros ndim ncluster     result_priors   quantize q_priors zeros 1 ncluster       BUFSIZE   160     one at a time      Revise  Use frame input buffer    160 point at once  To see how it works     NUMSAMPS   160   720    18860 input data point                                         Main      LOOP      Fo Fo Fo Fo Fo Fo Fo Fo Fo Fo Vo Vo Yo Fo Fo Fo Fo Fo To Vo Vo To To To To o Po Po PP     while 1     data steam in     ACCEL STREAM n   for n   1 ndim NUMSAMPS ndim 41      section targeted for hardware starts here      ACCEL BEGIN HARDWARE indata_1  indata 2  indata_3  indata_4  indata_5     ACCEL SHAPE indatabuf_1 ndim     ACCEL SHAPE indatabuf_2 ndim       ACCEL SHAPE indatabuf_3 ndim       ACCEL SHAPE indatabuf_4 ndim       ACCEL SHAPE indatabuf_S ndim      indatabuf_1   quantize q_data  indata_1 n n ndim 1     indatabuf_2   quantize q_data  indata_2 n n ndim 1     indatabuf_3   quantize q_data  indata_3 n n ndim 1     indatabuf_4   quantize q_data  indata_4 n n ndim 1     indatabuf_5   quantize q_data  indata_5 n n ndim 1         ACCEL SHAPE indatabuf ndim ncluster     for d 0   quantize q int 1 ndim    indatabuf d_0 1    indatabuf l d 0    indatabuf d 0 2    indatabuf 2 d 0    indatabuf d 0 3    indatabuf
88. oat priors 1  nclusters     0      segment   seg_data   float data ndata  ndim       O   include  rescal_720_data_directload dat         Data to be transmitted to external memory    float blockdata nblockdata  ndim     0    segment   seg_pmda   float blockdata  difffnblockdata  ndim     0                  103    float logproXiL_block_1 nclusters  nblockdata     0    float logproLXi_block_1 nclusters  nblockdata     0    float blocksums ndim   0      float cycle_count   0     float loglhXi_block_1I 1  nblockdata     0      y E oes            ee as Tee re  Processor Identifications  pR EEEE TAEAE EAE T C      void main       inti 0  j   0   int 12 2     int loopCount   0   int loopCountAll   0   float testlog   0   float testlog2   int ntasks   0   int offset   0   int flagsum   0   int offsetTemp   0   int data_offset   0   int  data  start   0     float start_count   float stop_count     int proTemp   0   int prepro   0     float logproXiLTemp   float maxTemp     int  global   float logproLXiTemp ndata     0    float maxTempProLXi nclusters     0      SetIOP SEM  LOCAL 0    offset   0    loopCount   0    prepro   0    testlog2   logf 2 71               104    testlog   nclusters               kso OB Re                                               Get the address of blockdata and transfer to Master       rk rk e sje spo spe espe             obe spe oe                  ok esee sek f    SetlOP SEM  LOCAL  int blockdata              spe spes sje sje sje             eoe spoke eee
89. ogLikeHoodXi    vi    LogLikeHoodX   LogLikeHoodX ndata   StoreLike LoopCount    LogLikeHoodX   StopOrNot   abs PrevLike   LogLikeHoodX    PrevLike   LogLikeHoodX     if StopOrNot  lt  StopCri  break    end   ProLXi   exp LogProL Xi      NewPrior2   ones 1 ndata  ProLXi      NO NEED TO DIVIDE NDATA    NewPrior   NewPrior2 ndata   NewMean   data   ProLXi     repmat NewPrior2 ndatadim  1          Calculate the covariance  for i   1 ncluster  Smeandata   data   repmat NewMean   i  1 ndata    Smeandata   Smeandata    Smeandata   OldMgmm covars i      ProLXi i      Smeandata    NewPrior2 i      Prevent Covariance matrix go to singular  if OldMgmm covars i     lt  eps                88    fprintf       amp   amp  n     OldMgmm covars i      ones 1 ndatadim    end  end      End of covarance calculation        Update the parameters IM  OldMgmm centers   NewMean   OldMgmm priors   NewPrior     end   for the EM whold loop  StoreLike   StoreLike 1 LoopCount    track   track   1 LoopCount        Put the parameters in descent order according to the prior probablities   OldMgmm priors new_in    sort OldMgmm priors    OldMgmm centers   OldMgmm centers   new_in     OldMgmm covars   OldMgmm covars new  in                 89    APPENDIX C     C  CODE FOR THE EM ALGORITHM ON THE PARALLEL DSP PLATFORM             90                                 For the Master DSP  Master c       include  lt stdio h gt     include  lt matrix h gt     include  lt math h gt     include  lt stdlib h gt     includ
90. opOrNot  gt  StopCri  amp  LoopCount  lt  MaxLoop       Get old a  u and E   OldCov   OldMgmm covars    Old E s  OldMean   OldMgmm centers    Old u s  OldPrior   OldMgmm priors   Old a s        Calculate the Probabilty of of each data point Xi in each cluster m   1 L    ProXiL matrix stores the probablity of each data point Xi in each cluster m    Each column is a point of data and each row is a cluster    ProXiL is L   N matrix  LogProXiL   zeros ncluster ndata    log normal   log 2 pi   ndatadim 2    for m   1  ncluster  covTemp   OldCov m      1   detTemp   0 5   sum log OldCov m        for i   1  ndata  poten    1 2  data   1  OldMean   m     data   1  OldMean   m    covTemp   poten   sum poten    LogProXiL m i    poten   detTemp   log normal                    87        Calculate likelihood of each data point     The likelihood of each data point  store in a row vector    LikeHoodXiisl n matrix           Likelihood can not be zero  or the priors will blow up   Use LogLikelihood to avoid zeros    LogLikeHoodXi   Logx LogProXiL OldPrior          End of calculation of likelihood of each data point                                                                                   Calculate the probability of    given Xi ProLXi     ProLXiis L   N matrix   LogProLXi   LogProXiL    log repmat  OldPrior  1 ndata     LogLikeHoodXi ones ncluster  1         LoopCount   LoopCount   1       Log version of likelihood of the whole data set  to judge the convergence  LogLikeHoodX   sum L
91. or address line 10 bits    Change to 30 bits to meet the compiler s requirement  q_addr   quantizer    ufixed   floor     wrap   30 01        These lookup table content should be written in txt file and load into    lut  exp  content   quantize q lut exp load  lut exp content dat     0 4 step  24 8   lut log content   quantize q lut log  load lut log content dat    0 0 5 step  24 11   Iut inv content   quantize q lut inv load lut inv content dat    950 512 step  0 5          covars inv content   quantize q  lut  cov  inv load lut  covars inv content dat     0   0 25 step  2  12    lut_covars_log_content   quantize q_lut_cov_log load    lut_covars_log_content dat        0    0 25 step  24 12    lut_exp_ProLXi   quantize q lut exp load lut exp ProLXi dat      lut  exp  ProLXi   quantize q        exp lut exp ProLXi     lut  covars inv        quantize q lut cov  inv lut  covars inv  tile    lut covars log tile   quantize q lut        log lut covars log tile        Load the input data   indata   load  q_indata dat      indata   quantize q_data indata    indata_1   indata    indata_2   indata    indata_3   indata    indata_4   indata    indata_5   indata                 117    break  flag   0      Load the initial value of the model   centers   quantize q_data zeros 160 5      centers   quantize q_data load    centers dat      centers_1   quantize q_data load    centers_1 dat      centers_2   quantize q_data load    centers_2 dat      centers_3   quantize q_data load    centers_3 d
92. orithms  for the FPGA structures  We used a recently developed tool  namely AccelFPGA  34    to shorten the design time  AccelFPGA can compile Matlab code directly into VHDL  code  By using AccelFPGA  we could focus on optimizing the algorithm in Matlab  without dealing with hardware details inside the FPGA  The design cycle of the  FPGA implementation using this method can be reduced from months to weeks  The  FPGA  implementation through AccelFPGA is not as optimized as directly VHDL  coding  However  for a prototype design  AccelFPGA can provide a time efficient  FPGA solution with reasonably good performance  We will show that our  implementation on a FPGA gives much better performance than on either the PC or  the parallel DSP platforms    The structure of Chapter 5 is as follows  In the next section  the internal structure    of Xilinx 3000 FPGA will be introduced  Then we will introduce the AccelFPGA                52    compiler for FPGAs  In section 5 4  we address some practical problems of FPGA  implementations  A fixed point version of the EM algorithm is  presented and  compared to the floating point version  The transformation of mathematics operations  and parallelism of the algorithm for FPGA implementation are also discussed  In the  last section  the performance of the EM algorithm on the FPGA platform is presented    and compared to the parallel DSP platform and PCs   The FPGA and Its Structure    Field Programmable Gate Arrays    A Field Programmable Gate 
93. our FPGA implementation can be found in Table 5 5              58    The Fixed Point Representation of the EM Algorithm    To use AccelFPGA for the hardware implementation  the first step is to convert  the algorithm from a floating point into a fixed point representation  The main  difference between these two representations is the limited precisions  For the EM  implementation  the problem becomes whether the algorithm will still converge to the  right result as in the floating point domain  We used the quantizer function in Matlab  to iei the algorithm into a fixed point toris and tested the whole algorithm to  make sure it gave the correct results    The most important part of the fixed point implementation is to determine the  word length and binary point of each variable in the algorithm  AccelFPGA can  support up to 50 bit wide words  Although long word length can provide better  precisions  more bits will be needed to store and process each variable  Thus  more  resources on the FPGA will be required and the system performance will be reduced   To maximum the system performance  we should use as short a word length as  possible while still maintaining enough precision for the algorithm  We chose to use  18 bits for normal variables and 30 bits for accumulation results  Variables with 18 bit  word length can be fitted into the BlockKRAM blocks and are easy to feed into the  embedded multiplier in the FPGA  Since the data dimension was high for our  algorithm  it  was 
94. p the iterative process    3  M step  Estimate new model parameters O i 1  using current model  parameters O i  via estimation equations  Replace current model parameters  with new model parameters  which are the new parameters for the next E     step                 E Step M Step    Estimate Q i 1   using Q             Initialization  Q Q 0     Is Likelihood  L i  L i 1  gt threshold  2           Calculate likelihood  of the data using Q      Figure 2 5 Diagram of EM algorithm for GMM parameter estimation     Since the new model parameters obtained from the estimation equations  guarantee a monotonically increasing likelihood function  the algorithm is guaranteed  to converge to a stationary point of the likelihood function  It has been argued that    that the convergence of the EM algorithm to the optimal parameters of the mixture is                24    slow if the component distributions overlap considerably  25   But if the clusters are  reasonably well separated  which is typically the case in spike sorting  the  convergence of the likelihood of the mixture model is rapid and the mixture density    approximates the true density                   25  CHAPTER 3      IMPLEMENTATION IN MATLAB  Introduction    We have shown how to use the      algorithm to estimate the parameters of  Gaussian Mixture Model  In this chapter  we will find that a number of practical  issues have to be  examined closely before we can achieve robust parameter estimates   Moreover  since we will u
95. rds  the likelihood will increase as the model number becomes larger  A  number of methods have been proposed to determine the model number using  maximum likelihood criteria by adding some penalty functions which can compensate  for the monotonically increasing of the log likelihood function of the model number   24   However  it is hard to define the suitable penalty term and not very efficient to  implement    To determine the model number in this paper  we use the method presented in   12   We calculate the likelihood of the data set from clusters ranging in size from 1 to  7  As shown in Figure 3 3  the initial increment of likelihood is large but then  becomes small  This is because as the number of cluster approaches the true number  of the model  the increment of likelihood becomes small  So we picked the point    when the increase  6   in likelihood fell below a certain threshold for 2 continuous    points  Eq  3 1 shows the actual calculation  The threshold 5 was set to 4                 28    log    t 2     log    t 1   lt 5        log    t 1   log  t    lt 6 Eq  3 1  From Figure 3 2  we can see that 5 models is a reasonable choice for the model  number  Also  keep in mind that you can always merge or split clusters using more    sophisticated methods  12 28      250       200              Number selected       Log likelihood          Number of clusters    Figure 3 2 The increase in likelihood due to the increase of cluster numbers  The  amount of the log likelihoo
96. reproduction of this thesis in whole or in parts  may be    granted only by the copyright holder     Signature    2   En    Date  211   FOS                iv    TABLE OF CONTENTS   1  INTRODUCTION pn naniii eR                                1  What is Neural Spike                          etu ta Gone ona une 1  Spike Sorting System nissin siase vet      vani                         3  Gaussian Mixture                                      M d n ROAD DA ridus        Pc e 6  OVERVIEW OF the Thesis                                                 7  2  THE GAUSSIAN MIXTURE MODEL AND THE      ALGORITHM                    9                      MT 9  Mixture Model for Spike Sorting                                  9  The Gaussian Mixture Model             9  GMMs Tor spike SORTIS cos                                                       cen bun ee ce edd 11  Spike sorting using the                               13  The advantage and limitation of the GMM spike sorting system                          15  Maximum Likelihood and the Expectation Maximization Algorithm                      16  Maximum Likelihood ESGIOUOD   oo noise aces idet a doo sotto ev                     deua        17  The Expectation Maximization  EM  algorithm                         eese 18  Expectation Maximization for the Gaussian Mixture Model                                 20  Implementation of the EM algorithm                      esee eee entente      23   3  IMPLEMENTATION IN MATERADB       eti drea     
97. rtex II XC2V3000 FPGA for our system target  The Virtex  II XC2V3000 FPGA has 96 18 bit 2   s complement multipliers and 96 block  memories  By using these embedded hardware resources  we efficiently parallelized  the EM algorithm in the FPGA  The FPGA implementation was 16 4 times faster than  a Pentium IV 3 2 GHz PC and 42 6 times faster than the 4 DSP implementation    The work of this thesis shows the significant promise of using FPGAs for high   speed computational intensive DSP applications  By using FPGAs  a significant  performance improvement was achieved over the PCs and DSP processors without  sacrificing system flexibility  It appears that with intensive I O and parallel    computational ability  FPGAs will become more widely used in the DSP world   Future work    One direction of the future work could be to improve precision of the fixed point  implementation using a larger word length and different binary point positions   However  a larger word length will decrease the system performance of the FPGA  since more hardware resources will be needed  The tradeoff between the word length  and system speed has to be evaluated in order to find the optimal trade off that will    depend on implementation constraints                       75    The FPGA implementation in the thesis was translated directly from Matlab code  using the AccelFPGA compiler  At this stage  the implementation could be improved  by directly hand coding the algorithm in VHDL and even manually rout
98. s   system from multiple unit records  Journal of Neuroscience Methods  8  391   397     Recce M L   amp  J O Keefe  1989  The tetrode  An improved technique for multi   unit extracellular recording  Society for Neuroscience Abstracts  15  1250            15      16      17    18      19    20    21    22      23      24    25    26      27        78    KENNETH D HARRIS  DARRELL A HENZE  JOZSEF CSICSVARI   HAJIME HIRASE   amp  GYORGY BUZS AKI  2000  Accuracy of Tetrode Spike  Separation as Determined by Simultaneous Intracellular and Extracellular  Measurements  J Neurophysiol  84     Lewicki Micheal S   1998  A review of methods for spike sorting  the detection  and classification of neural action potentials  Network  Computation in Neural  Systems     M H Yang  amp  N Ahuja  1999  Gaussian Mixture Model for Human Skin Color  and Its Application in Image and Video Databases  Proc  of the SPIE  3635     Douglas A Reynolds  1995  Speaker identification and verification using  Gaussian mixture speaker models  Speech Communication  17           Salganicoff M  amp  M Sarna  1988  Unsupervised waveform classification for  multi neuron recordings  a real time software based system  J Neurosci Methods   25  181 187     Zouridakis G   amp  D C Tam  2000  Identification of reliable spike templates in  multi unit extracellular recording using fuzzy clustering  Comput Methods  Programs Biomed  61  91 98     Ohberg F   amp  H Johansson  1996  A neural network apporach to real time spike 
99. s inv          14  TO ram  s18  s18 m 14445  AT        ACCEL MEM  MAP 1ut  covars log tile   m  14  TO ram  s18  s18 m 14450  AT 0  for m 14   quantize q int 1 ncluster   covars addr   accel bitshl covars d 7 m 14  12 41 4 addr    covars_inv d_7 m_14    quantize q lut        inv lut  covars  inv  tile covars addr  1 m 14     covars  log d 7 m 14    quantize q lut cov  log lut covars log tile covars addr  1 m 14       end  end             Sum of covars_log to get det_covars  909           gt  gt  gt  gt  gt  gt  gt  gt covars_det   fun_covars_det covars_log         ACCEL SHAPE covars  det 5     ACCEL TILE m_16  for m_16   quantize q int 1 ncluster   covars det m 16    quantize q1 const zero    end    for d 8   quantize q_int  1 ndim     ACCEL TILE m  15  for m 15   quantize q_int 1 ncluster   covars_det m_15    quantize q1 covars_det m_15    covars_log d_8 m_15  0 5    mid_covars d_8 m_15    quantize q_mid_covars const_zero    end  end    id   quantize q_int 1    end                 This part tricks Accelchip to right init the centers blockmem  id_1   quantize q_data centers_1 1     id_2   quantize q_data centers_2 1     id_3   quantize q_data centers_3 1     id_4   quantize q_data centers_4 1                128    id 5   quantize q_data centers_5 1       outdatabuf 1   quantize q1 covars  1     outdatabuf 2   quantize q1 covars  2     outdatabuf_3   quantize q1 covars  3     outdatabuf_4   quantize q1 covars  4     outdatabuf 5   quantize q1 covars  5     outdatabuf 6   quant
100. s of data which assigns data to each  cluster  This process is extremely time consuming and is affected by human bias and  inconsistencies    The development of microtechnology enables multi tip electrode recording  such  as strereotrode  13  or tetrode  14  recording   By adding spatial information of action  potentials  multi electrode recording can improve the accuracy of spike sorting  15    Obviously  the amount of data taken from multi channel recording can be too  overwhelming to deal with manually  To solve this problem  some type of automated  clustering method is required  During the past three decades  various unsupervised  classification methods have been applied to the spike sorting issue  The applications  of general unsupervised clustering methods such as K means  fuzzy C means  and  neural network based classification schemes have achieved some success  Methods  such as Template Matching and Independent Component Analysis have also been    used  A complete review can be seen in  16               In most cases  the classification is done off line  There are two reasons  1  Most  clustering methods need user involvement  2  Most automated clustering methods are  very computational intensive and general hardware  such as personal computers or  work stations  can   t meet the real time requirement of the data streams  However   online spike sorting is needed for many applications of computational neuroscience   8   Implementing a classification method on high spee
101. se the full sampled waveforms as input  the high  dimensional training data will cause numeric problems such as underflow due to  limitations in precision  We will present a revised version of EM algorithm to solve  these problems which we implement in software and hardware    The chapter is organized as follows  First  the training data we used in this paper  will be introduced  The practical problems related to the EM and applications on  GMM spike sorting system are discussed in the next section  These problems include  initialization of the GMM model  cluster number selection and algorithm evaluation   In the weet section  the numeric underflow problem caused by high dimensional input  data will be addressed  We compare three different approaches to handle this problem  and present a log version of the EM algorithm with        covariance matrices   This new EM algorithm gave the best result among the three methods  At the end of  this chapter  we will show the resulting model trained by this log EM algorithm and    its performance on a PC                               26    Neural Signal Data from a Cricket    The signals we use for this thesis were recorded from the cricket cercal sensory  system  27   Data was recorded using an NPI SEC 05L amplifier and sampled at 50  KHz rate with a digital acquisition system running on a Windows 2000 platform  The  stimuli for these experiments were produced by a specially designed device which  generated laminar air currents across t
102. set     int New_extsre       int DSP  offset  B       Check for valid DMA channel  i   if   dma_channel  lt  10      dma_channel  gt  13     return  FALSE      chan_offset    dma_channel   10   lt  lt  3        Prepare DMA for slave DSP  if   int extsrc  gt   int 0x0200000        DSP offset    int  extsrc  amp  OxOF00000   New  extsrc    int  extsrc  amp  OxOOFFFFF     SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC FLUSH y       Flush            SetIOP   DSP_offset II10 chan_offset     int New  extsrc      SetIOP    DSP  offset IMlO 4chan offset    1 y   SetIOP   05   offset ClO chan offset    num to copy      SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC SLA MDO      Enable  Write to extmem           Prepare DMA from external memory to EPBO  via DMAC10               113      SetIOP   DMAC10  dma_channel 10    DMAC FLUSH      Flush  SetIOP   II10  chan offset     int intdest      SetIOP   IM10 chan offset    1 y   SetIOP   C10 chan_offset    num to copy      SetIOP   EI10  chan  offset     int  DSP_offset EPB0 chan_offset        SetIOP       10          offset    0 y  SetIOP   EC10  chan  offset    num to copy     SetIOP   DMAC10  dma_channel 10    DMAC MST MDI       Enable  Read    from extmem        If we need to wait for completion   if  wait         int dma_status_bit   11   lt  lt  dma channel        Wait until dma completed   while  GetlOP  DMASTAT    amp  dma_status_bit     Il while  GetIOP  DSP_offset DMASTAT    amp  dma_status_bit          Disable DMA   SetI
103. sion of the  EM algorithm was implemented  Since the training process of the EM algorithm is  very computationally intensive and runs slowly on Personal Computers  PC  and even  on parallel DSPs  we mapped the EM algorithm to a Field Programmable Gate Array   FPGA   It trained the models without a significant degradation of the model integrity  when using 18 bit and 30 bit fixed point data  The FPGA implementation using a  Xilinx Virtex II 3000 was 16 4 times faster than a 3 2 GHz Pentium 4  It was 42 5       times faster than a parallel floating point DSP implementation using 4 SHARC 21160    DSPs              CHAPTER 1    INTRODUCTION    What is Neural Spike Sorting     Most neurons communicate with each other by means of short local  perturbations in the electrical potential across the cell membrane  called action  potentials or spikes  1   By using extracellular glass pipettes  single etched  sharp   electrodes  or multiple site probes  scientists have been able to record the electrical  activity of neurons as they transmit and process information in the nervous syste It  is widely accepted that information is coded in the firing frequency or firing time of  action potentials  2 3   It is further believed that information is also coded in the joint  activities of large neural ensembles  4   Therefore  to understand how a neural system  transmits and processes information  it is critical to simultaneously record from a  population of neuronal activity as well as to efficie
104. st resort to more elaborate techniques     The Expectation Maximization  EM  algorithm    The maximum likelihood GMM parameters estimates can be found via an  iterative parameter  estimation procedure  which is called the Expectation     Maximization  EM  algorithm  25   The EM algorithm is a general method of finding  the maximum likelihood estimate of the parameters of an underlying distribution from  a given data set when the data is incomplete or has missing values  There are two  main applications of the EM algorithm  The first use of the EM algorithm occurs  when the data indeed has missing values  The second application occurs when  optimizing the likelihood function becomes analytically intractable  This latter  application is used for the GMM case where the analytical solution for the likelihood  equation can not be found    The idea behind the second application of the EM algorithm is to simplify the  likelihood function    by assuming the existence of additional but hidden parameters   We assume that data X is observed and is generated by some distribution  We also  assume that there is another hidden data set Y  which will be defined in section 2 3 3   We call Z    X Y  the complete data jer The density function for the complete data    set is        2  0    p X Y O   p Y  X O p X  O  Eq  2 7                   19    With this new density function  we can define a new likelihood function        0 Z       0  X  Y    p X Y O  Eq  2 8   Which is called the complete da
105. start_count   int stop_count     float proTemp   0   float prepro   0    float logproXiL Temp   float maxTemp    float Ihdata  master     float Ihdatanew   float Ihdataold   float thres     float logproLXiTemp ndata     0    float maxTempProLXi nclusters     0      offset     0   loopCount   0   prepro   0    testlog2   logf 2 71    testlog   nclusters   Ihdataold    100     global    int    GLOBAL   priors start    int    priorsTemp   cando  start    int    candcTemp     SetIOP SEM  GLOBAL 0       global   STOP      FEE sk              oboe sje se eoe      he he                                     mu initialize the covars and priors                                                                   initGmm priors centers covars        Start Timer  timer set  OXFFFFFFFF  OxFFFFFFFE    start count   timer  on        enable timer       for  i 0 i lt 100000 i     tJ             93     BR SR           seo se soo           obese oe sb she e sj      e      be                       Get the start address of blockdata on slave DSPs      risk of soe of se ae      ses oe see oj sje                                                while     GetIOP SEM  LOCAL 1    GeIOP SEM  LOCAL 2               5     LOCAL 3      0      3    blockdata_slave 0     int    GetIOP SEM_LOCAL_1     blockdata_slave 1     int    GetlOP SEM  LOCAL 2     blockdata_slave 2     int   GetlOP SEM  LOCAL 3       SetIOP SEM  LOCAL  1 1    SetIOP SEM  LOCAL 2 1    SetIOP SEM  LOCAL  3 1       Ecke sk                opos sk
106. stem  The log probability for a spike being in a cluster is computed for each  individual cluster and the highest scoring cluster is the identified neuron     Figure 2 4 shows a block diagram of the Bayesian s decision rule applied to  spike sorting using GMM  The input to the system is a vector representing an action  potential  The probability that an action potential belongs to a class is calculated using  Eq  2 3 over all the classes  The vector is classified into a certain class whenever the  vector falls into the class with a high confidence level  If the waveform doesn t  belong to any cluster with high probability  the vector will be considered as noise and  will be thrown away  It is also possible that the waveform results from overlapping    spikes and could be stored for further inverstigation     The advantage and limitation of the GMM spike sorting system    GMM is a statistical approach for the spike sorting problem  It defines a    probabilistic model of the spike waveforms  A big advantage of this framework is that          16    it is possible to quantify the certainty of the classification  By using Bayesian  probability theory  the probability of both the spike form and number of spike shapes  can be quantified  This is used in spike sorting because it allows experimenters to  make decisions about the isolation of spikes  Ate by monitoring the classification  probabilities  experimenters will be able to tell possible changes in experiment  conditions  For e
107. sum blocksums blockdata difflogproLXi block  1 i  offset    vecvadd logcovars block 1 i  blocksums logcovars  block 1 i  ndim           SetIOP SEM  LOCAL 1      stop count   timer offQ   cycle count     float start count   float stop_count  80000000     divide by 80MHz                     108    The library included in both C codes for master and slave processors  lib c       include  lt stdio h gt       include  lt matrix h gt    include  lt math h gt    include  lt stdlib h gt    include  lt vector h gt      include  lt def21160 h gt    include  lt 21060 h gt    include  lt signal h gt      include  speeddsp h    include  diag_DMA_master_2 h     07 Global variables     int dmac_ext_to_int   DMAC_MDI   int dmac_int_to_ext   DMAC_MDO     void initGmm  float priors   nclusters   float centers   ndim   float covars   ndim                                             int i   0   int j   0   float a   0     for G 0 i lt nclusters i       for  j 0 j lt ndim j     covars i  j    1     for  i 0 i  nclusters i       priors 0  i    0          nclusters    a  1        vecsadd priors 0  a priors 0  nclusters         Just for testing                109    float getproXi float centers ndim  float blockdata ndim  float covars ndim      inti 0 j 0    float  meanTemp    float  invcovTemp    0    float  detcovTemp    float proXiTemp    0    float divider   0    float log_normal   0   float proXi        normalization constant   log normal   logf 2 3 1416   ndim 2       invcovTemp    float    ca
108. t     24 2   24 12  end   end    for d_10   quantize q  int 1 ndim   if covars d_10 3   gt  upper_const  covars d_10 3    quantize q_covars upper_const     2  2   24 12  end  end    for d 11   quantize q_int 1 ndim   if covars d_11 4   gt  upper_const  covars d 11 4    quantize q_covars upper_const     2  2   2   12  end   end    ford 12   quantize q int 1 ndim   if covars d  12 5   gt  upper  const  covars d 12 5    quantize q covars upper  const     24 2   24 12  end   end    for d 6 1   quantize q_int  1 ndim   if covars d 6 1 1   lt  const  zero  covars d 6 1 1    quantize q covars const  zero   96 2  2   24 12  end  end    for d 9 1   quantize q int  1 ndim   if covars d 9 1 2     const zero  covars d 9 1 2    quantize q  covars const zero     2  2   2  12  end               for 4_10_1   quantize q int 1 ndim   if covars d_10_1 3   lt  const_zero  covars d 10 1 3    quantize q_covars const_zero     24 2   24 12  end     end    for d_11_1   quantize q_int 1 ndim              127      if covars d_11_1 4     const  zero  covars d 11 1 4    quantize q covars const zero     2  2   2  12  end  End    ford 12 1   quantize q int 1 ndim   if                 12  1 5   lt  const  zero  covars d 12 1 5   quantize q covars const zero   96 2  2   2  12  end  end    for d 7   quantize q_int 1 ndim   9 6 1ACCEL TILE m  14    ACCEL MEM  MAP covars  log   m  14  TO ram 818  s18 m 14435  AT 0     ACCEL MEM  MAP covars inv   m  14  TO ram  818 s18 m 14 40  AT       ACCEL MEM  MAP lut  covar
109. t   int 0x0200000     DSP  offset    int  extdest  amp  0  0  00000   New  intdest    int  extdest  amp  OxOOFFFFF     SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC FLUSH  Flush   SetIOP   DSP  offset  ITlO  chan offset        Gnt New_intdest      SetIOP   DSP  offset IM10 chan offset    1      SetIOP   DSP_offset C10 chan_offset    num to               SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC_SLA_MDI  Enable  Write to extmem             Prepare DMA from external memory to EPBO  via DMAC10     SetIOP   DMAC10  dma_channel 10    DMAC FLUSH       Flush  SetIOP     10           offset     infintsrc 5       SetIOP   IM10 chan_offset    1                   y d                112    SetIOP   C10 chan_offset   num to copy      SetIOP                     offset     int  DSP  offset  EPBO 4chan offset     SetIOP   EM104chan  offset    0 y   SetIOP   EC10 chan_offset    num to copy      SetIOP   DMAC10  dma_channel 10    DMAC MST MDO  j              we need to wait for completion     if  wait               int dma_status_bit   1L  lt  lt  dma_channel        Wait until dma completed   while  GetIOP  DMASTAT    amp  dma_status_bit       while  GetIOP  DSP_offset DMASTAT   4 dma_status_bit            Disable DMA   SetIOP   DMAC10  dma_channel 10    DMAC_FLUSH     SetIOP   DSP_offsett DMAC10  dma_channel 10    DMAC_FLUSH            return  0           int extmem_to_intmem_s2m   int  extsrc  int  intdest  int num  to  copy  int dma  channel  int    wait           int chan_off
110. t version   b  is the  output of the FPGA implementation     Table 5 8 The difference between the floating point output and FPGA output    Error between floating point and FPGA output    GMM parameters         vector   means 0 0170  covariance matrices 0 0019  priors 0 0147    We ran the same training data through the two models to see the classification  results  18 out of 720 data points are misclassified using the FPGA models when  compared to the floating point implementation in Matlab  The classification rate is    97 25   The misclassifications are shown in the confusion matrix in Table 5 9              71    Table 5 9 Confusion matrix of fixed point spike sorting result for 720 nerual spikes  from 5 neurons  Overall correct rate is 97 25  comparing to the floating point result        Cluster 1 Cluster2 Cluster3 Cluster4 Cluster 5       Cluster 1 252 0 0 2 0  Cluster 2 0 182 0 0 0  Cluster 3 0 3 133 1 0  Cluster 4 12 0 0 105 0  Cluster 5 0 0 0 0 30       The confusion matrix shows that the most misclassifications happened between  cluster 4 and cluster 1  14 misclassifications   Figure 5 10 shows that cluster 1 and  cluster 4 are heavily overlapped and probably represent the action potentials from the  same neuron  The same situation exists for cluster 2 and 3 which have 3  misclassifications  Overall  a small number of misclassifications happened between    very close clusters which would not greatly affect the final spike sorting results                           a  F
111. ta likelihood  Note that since Y is a hidden variable   that is unknown and presumably governed by an underlying distribution  the   likelihood finon is also a random variable  The EM algorithm first finds the     expected value of the complete data log likelihood with respect to the unknown data    Y given the observed data X and the current parameter estimates  That is  we define     Q O  O     E log p X Y  O   X O   Eq  2 9  where Q O Og  is the expected value of the log likelihood with respect to the  unknown data Y  O  are the current parameters estimates and O are the new  parameters that we optimize to increase Q  Note that in this equation  X and O  are  constants  and O is a normal valable that we wish to adjust  The evaluation of this  expectation is called the E step of the EM algorithm  The second step is the M step     This step is to maximize the expectation we computed in the first step  That is we find   O    arg max    O 0    Eq  2 10        These two steps are repeated as often as necessary  The maximum likelihood  parameters estimates have some desirable properties such as asymptotic consistency  and efficiency  26   This means that  given a large enough sample of training data  the  model estimates will converge to the true model parameters with probability one  So each    step of iteration in EM algorithm is guaranteed to increase the log likelihood and the    algorithm is guaranteed to converge to a local maximum of the likelihood function  26           
112. termined  the precision was found by dividing the dynamic range using the  LUT size  Then  we ran the training process of the EM algorithm to see with if a  precision caused the EM algorithm to converge or not    If problems occurred and the algorithm didn   t converge  we had to improve the  input precision by increasing the LUT size  There are two ways to do this  One  approach was to increase the LUT size by shortening the output bit width  With a  shorter word width  the BlockRAM can hold more than 1024 elements in a single  bank  But the decreased output precision may not meet the output precision  requirement  Another way was to use more than one bank of block SelectRAM to  form one single LUT  Theoretically  you can use as many BlockRAMs as is  available  to build a LUT  However  additional resources and control circuit will be needed to  linearly address all these memory blocks  These additional circuits will use more area    of the FPGA and decrease the system performance  In our application  the 1024 LUT                62    size worked properly with all the operations so neither of the above methods were  used    An example of how the look up table approach actually worked in the FPGAs is  shown below  We implemented the exponential operation in Eq  5 2 using look up    table approach     p x   0    exp log_          0    Eq  5 2   First  we tracked the input output dynamic range during the whole training  process  Figure 5 4 shows the histogram of the input and ou
113. ternal memory is      SLAVE  offset equals Upper portion of indest and indest     should be the lower portion   if   int intdest  gt   1nt 0x0200000         DSP  offset    int  intdest  amp  OxOF00000   New  intdest    int  intdest  amp  OxOOFFFFF        Prepare DMA from external memory to EPBO  via DMACIO     SetlOP   DSP_offsettDMAC10  dma_channel 10    DMAC FLUSH        Flush  SetIOP   DSP  offset  II10  chan offset     int New_intdest      SetIOP   DSP  offset IM10 chan offset    1      SetIOP   DSP_offset C10 chan_offset    num to copy     SetIOP   DSP  offset  EIlO chan offset     int extsrc      SetIOP   DSP_offset EM10 chan_offset    1 y   SetIOP   DSP_offset EC10 chan_offset    num to copy               111    SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC_MDI     Read from extmem            we need to wait for completion   if  wait    L         status bit   IL  lt  lt         channel      Wait until dma completed   while  GetIOP  DSP  offsett DMASTAT    amp         status  bit     3        Disable DMA   SetIOP   DSP_offset DMAC10  dma_channel 10    DMAC_FLUSH          return  TRUE             Enable     int intmem to extmem m2s  int  intsrc  int  extdest  int num to copy  int dma channel  int    wait        int chan_offset   int DSP_offset   int New_intdest        Check for valid DMA channel    if   dma_channel  lt  10      dma_channel  gt  13     return  FALSE      chan_offset    dma_channel   10   lt  lt  3        Prepare DMA for slave DSP  if   int extdest  g
114. to form smooth approximations    to arbitrarily shaped densities  Figure 2 1 shows a one dimensional example of the                          10    GMM modeling capabilities  An arbitrary non Gaussian distribution  shown in solid  line  is approximated by a sum of three individual Gaussian components  shown by    dashed lines                                                Figure 2 1 Non Gaussian density estimation using a GMM    The complete Gaussian mixture density is parameterized by the mean vectors     u   covariance matrices       and weights       from all component densities which    are represented by the notations       a  4 52   The model is given by Eq  2 1     M  p x O                         2 1  1 1        where p x O  is the probability of x given model     x is a D dimensional random    vector        0    i   1                the component densities and a   i   1     M  are the       11    mixture weights  Each component density is a D variant Gaussian function given by     Eq  2 2       1  lo u y   Gu   p x o        dict 2 Eq  2 2   2     5       where     is the mean vector and 2  is the covariance matrix of each individual    Gaussian model         M  The mixture weights a  satisfy the constraint that           2 20   which             ensures the mixture is a true probability density function              s        be viewed as    the probability of each component   In spike sorting  action potentials generated by each neuron are represented by  each compone
115. tput data  While the input  dynamic range is quite large  0 to  10    the output dynamic range was limited to  0 1    This means quite a large span of input space are mapped to a small range of output  space  This can be more clearly shown in Figure 5 5  The output of the exponential  function decreases with the decrease of the input  While approaching zero  the output  barely changes even though the input keep decreasing  Obviously  the input precision  of large negative values will not affect the output since the output is almost zero   Knowing this  we can decrease the input dynamic range by setting a threshold so that    any input smaller than a particular value will lead the output to be zero          x10 x 10  3           4 35                    k             Figure 5 4 Histogram of the input and output data range for the LUT table   a  The  input   b  The output              63       0 9  H  0 8         0 7      0 6                    0 5  Pin  0 4     0 3      0 2 7 4    0 1    1                Figure 5 5 Input and output curve of the exponential operation     We chose the threshold value to be  4 so that the dynamic range of the input was    4 0   Choosing the LUT size to be 1024 means the input precision can be determined  by  0   4   1024   2    We tested this precision by making sure the EM algorithm  converged  Table 5 3 shows the difference between the floating point output and  LUT outputs  The error was about 1  for each variable which had a small affect on   
116. ualDSP   3 0 to develop the program for the Hammerhead PCI  board  Most of the code was written in C  Small sections  such as the DMA transfer  routine were written in assembly language  All the C functions we used were    optimized for the ADSP 21160   s SIMD mode        Figure 4 1 Block diagram of Hammerhead PCI system             43    Analysis of the EM Algorithm    To parallelize the EM algorithm  we needed to find out where the computational  intensive parts of the EM algorithm were  These parts should be mapped into multiple  DSPs so that each DSP can share part of the computational load  Thus  by handling  these parts simultaneously  we can speed up the whole system and enhance the  performance    We used the profile command in Matlab to analyze the EM algorithm  The profile  function records execution time for each line in the code as well as for each function  call in a Matlab   m  file  By comparing the execution time  we can know which part  of the algorithm Matlab has spent the most time on  The most computational intensive  parts then should be parallelized on multiple DSPs    40  35  30  25    20        14 99        14 16   15      of the entire execution time    10          5b       9    Figure 4 2 The diagram shows most computational intensive parts in the EM  algorithm and their execution percentage in the whole process   a  Calculation of         a     b  Update of means and covariance matrices   c  Calculation of po          and eG   O     d  Rest of the a
117. umeric Considerations in Matlab    The Log EM Algorithm    To reserve all the original information in the spike waveforms  we used the full   sampled spikes as the clustering input  The implementation of the training process of  EM algorithm should be able to handle the 160 dimensional inputs  However  if we  look at the EM algorithm closely  we will find many places where a number of small    numbers  between 0 and 1  are multiplied together  This easily leads to a numerical          31    zero in Matlab  which is defined to be 2 2251e 308 or less for double precision data   The calculation of the likelihood of the whole data set and the determination of  covariance matrix are two places where a numeric zero problem occurs most often  To  compensate for this  we have to transform these operations into the log domain so that  the multiplications turn into additions that prevent the underflow problem  Here we  present a log version of the EM algorithm which solves this problem    Given the current model  we calculate the probability of each data point in each    cluster  The original equations are listed as follows     1 lg   4y x  4 H   p 10      icu  Eq  3 2  1         IX        M          0     gt   a p x   0  Eq  3 3               O  X      p   0    Eq  3 4  i    Notice that in Eq  3 2 and Eq  3 4  there are product operations that will cause    underflow  By changing them into the log domain  the equations are             0     i     4  TG   44     nQz   imi p Eq  3 5  In
118. ural network pase unsupervised classification schemes  21    With some modifications  the GMM based approaches also show promise in solving  the overlap and bursting problems of spike sorting  22     The Expectation Maximization  EM  algorithm is normally used to estimate the  parameters of a Gaussian Mixture Model  EM is an iterative algorithm which updates  mean vectors and covariance matrices of each cluster on each stage  The algorithm is  very computational intensive and runs slowly on PCs or workstations    In this paper  we present a log version of the EM algorithm which can handle  high dimensional inputs  We also implement this revised EM algorithm on three  different platforms  which include the PC  Parallel DSP and FPGA platforms  By  comparing the different implementations  we found that  in terms of speed  the FPGA    implementation gives the best performance   Overview of the Thesis    The thesis will mainly focus on the following three points  the Gaussian Mixture  Model  the Expectation Maximum algorithm  and the hardware implementation of the  EM algorithm  Chapter 2 introduces the Gaussian Mixture Model and how to use the  EM algorithm to estimate the mixture parameters  In Chapter 3  a log version of the  EM algorithm and its performance on a PC is presented  The details of the  implementation of the EM algorithm on a parallel DSP system are described in      Chapter 4  Chapter 5 describes the parallel implementation of the EM algorithm on a             8  
119. xample  a drop in probability may indicate electrode drift or a rise in  noise levels    In spite of the advantages of the GMM approach  there are several inherent  limitations of this statistical model for spike sorting  The Gaussian mixture model  makes the assumption that the noise process is Gaussian and invariant  These two  cases are not necessarily true in spike sorting  For example  with the presence of  burst firing and overlap spikes  the noise may show larger variation at the tails   Another assumption made is that all neurons fire independently and all the noise is  uncorrelated  It is likely that in a local area  neurons have some correlation to each  other  We could use a more complete statistic approach which includes the correlation  between neurons to model the spike generating process  24   However  this is outside    of the scope of this thesis     Maximum likelihood and the Expectation Maximization Algorithm    This section describes an unsupervised maximum likelihood procedure for  estimating the parameters of a Gaussian mixture density from a set of unlabeled  Observations     maximum likelihood parameter estimation procedure is developed for  the GMM  First  a brief discussion of maximum likelihood estimation and the use of    the EM algorithm for GMM parameter estimation is given  This is followed by the             17    derivation of the mixture weight  mean  and variance estimation equations  Finally  a    brief discussion of the complete iterative 
120. y using the  Xilinx Virtex II XC2V3000 FPGA  By implementing the training procedure on a  FPGA  the time to train  models to be used in spite sorting will be shortened which  which will benefit neuroscience research  This will allow more time to be spent on  collecting data    Several practical issues related to the EM algorithm implementation were  discussed  To prevent underflow problems caused by multiplications  we converted  part of the EM algorithm into the log domain  Without prior knowledge of how many  neurons were generating action potentials  we estimated the model number of the         using the likelihood function  The most computational parts of the EM    algorithm were found and parallelized in both the multiple DSP and FPGA  implementations    For the parallel DSP implementation  we used four ADSP 21160 32 bit floating  point DSPs  The bus contention between four DSPs significantly decreased the speed  of the parallel DSP system  To prevent this  we set up a semaphore to exclude the    usage of the cluster bus and used the DMA mode to accelerate the data transfer                   74    among DSPs  However  because of the low DSP clock speed  the parallel DSP  implementation didn   t outperform the PC implementation    For the FPGA implementation  a fixed point version of the EM algorithm was  develope and simulated in Matlab  To implement the exponential function  a block  memory based Lookup Table was used to approximate this functions    We used a Xilinx Vi
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Mise en page 1 - Hydraulique Sans Frontières  Bionaire BH1551-U Owner's Manual  Husqvarna GX560 User's Manual  Alcatel-Lu cent IP Touch 4038 / 4068 Phone & 4039 Digital Phone  Flexor 151 User Manual  カタログPDF  scarica  C3X Lumis  here - Artax    Copyright © All rights reserved. 
   Failed to retrieve file