Home

31762103922173 - ScholarWorks

1. 25 LEE 25 Neural Signal Data from a 26 Practical Problems Related to the EM algorithm eene 27 Model Number SeleefIOB oed iie reae aes ei isc atin eas 27 Initialization of the EM Algorithm sessessesessesesseseressessesessessessessereeseesesesersesese 28 OF the castes cscs 30 Numeric Considerations m Matlab cepe eaa eter ee aie rna 30 The Log AJgortbm 5 s enne eorr 30 Using Diagonal Covariance Matrix eese eene eene eene enne nennen ennt 34 Implementation Result m Matlab cii etie tpee 36 Matlab Performante decido deer d eee ee n di sabe pde 38 4 PARALLEL DSP IMPLEMENTATION eeeeeee ete tnnt tnt 40 HERO URN 40 Hard WATE rre ox vL UR 40 SHARC ADSP 21160M DSP Processor eese eee ene 40 Bittware Hammerhead PCI board and VisualDSP ees 41 Analysis of the EM Algorithm eres eene nnne nnn entente 43 Parallel Transformation of the EM Algorithm eee 44 Performance on the Parallel DSP Platform 49
2. oe e if DMA 1 extmem to intmem s2m int Y PR BLOCK 1 LoopCount DSP BLOCK LENGTH CANDC LENGTH candc start CANDC LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl 100 else for 2 0 i2 lt CANDC_LENGTH i2 candc start i int PR_BLOCK_1 LoopCount DSP_JUMP P_BLOCK_LENGTH CANDC LENGTH J Finish Moving Begin update the covars i for i 0 i lt nclusters i vecvadd logcovars_master i candcTempf i logcovars_master i ndim for i 0 i lt nclusters i vecsmlt logcovars master i ogpriors master O i covars i ndim if DMA 1 intmem_to_extmem_m2s int covars Source buffer int P_BLOCK_1 P_BLOCK_LENGTH CANDC_LENGTH CANDC_LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl intmem to extmem md2s int covars Source buffer int BLOCK 1 P BLOCK LENGTH CANDC LENGTH DSP JUMP LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl intmem to extmem m2s int covars Source buffer int BLOCK 1 BLOCK LENGTH CANDC LENGTH42 DSP JUMP CANDC LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl else for i2 0 i2 lt CANDC_LENGTH i2
3. TABLE OF CONTENTS CONTINUED S FPGA IMPEBMENATION uide rete dee te heir TOi eke 51 ME 51 The FPGA Its edet a Pp kan ee UR edd 52 Field Programmable Gate 52 Structure of Xilinx s Virtex 9 tial 53 hf sek tat aeter duds casas NR OR ER RI Tbv snis Fus Redes 56 The Fixed Point Representation of the EM Algorithm 58 Using Lookup Tables 2 2 ptio oae tni see De pape SAE 59 Parallelism needed for FPGA implementation eene 65 Synthesis and Simulation zie eee v Fre ipa oie 67 FPGA Result Evaluatldl e C tesa 69 FPGA Performante RIP 6 CONCLUSION AND FUTURE WORK reet 73 CONCLUSION mec 73 Em TR 74 76 APPENDICES 80 APPENDIX SIMPLYFYING THE EXPRESSION OF Q O O 81 APPENDIX B FLOATING POINT MATLAB CODE eese 85 APPENDIX C C CODE ON THE PARALLEL DSP PLATFORM 89 APPENDIX D FIXED POINT MATLAB CODE USING ACCELFPGA 114 LIST OF FIGURES FIGURE PAGE 1 1 This extr
4. 12 07 yy yp The form of Q O O can still be greatly simplified We not that for 1 M N Da Fd 0 1x 05 z ly l j 1 M M M M N dD 0 1 1 yy tj j i ER II Q2 6 1x 0 201x 0 p Jal js p I x O M since Po 0 1 we can write Q O O as Jja Q 0 0 log a pix x O i l 11 0 5 10 00015 0 l 1 i l This is the same equation as Eq 2 16 As shown in Chapter 2 from this equation we use Lagrange multiplier and other methods to get the Eq 2 20 2 21 and 2 22 which together form the basis of the EM algorithm for iteratively estimating the parameters of a Gaussian Mixture Model 85 APPENDIX B FLOATING POINT MATLAB CODE FOR THE EM ALGORITHM 86 diaghalfLogEmGmm m function StoreLike LoopCount OldMgmm diaghalfLogEmGmm ncluster data OldMgmm min_std 0 0001 ndatadim ndata size data Initialize the covariance to be diagonal OldMgmm covars ones ncluster ndatadim PrevLike 100 StopCri 0 0001 MaxLoop 100 LoopCount 0 StopOrNot 100 StoreLike zeros 1 MaxLoop range_max 0 range_min 1000 range_per 1000 track zeros 3 1000 The main EM Loop while St
5. Gates Slices Blocks Blocks DOMS pads XC2V3000 3M 14336 96 96 8 720 56 AccelFPGA FPGA programming can be done using a Hardware Descriptive Language HDL such as VHDL or Verilog These HDLs are low level languages which have a very fine control of the actual hardware implementation However for the complex applications the HDL coding process will be tedious and error prone and requires costly debugging iterations Furthermore a lot of low level work is repetitive and can be automated To solve this problem many researchers have focused on how to raise the level of abstraction to a general purpose programming language such as C C 32 or Java 33 Recently a compiler has been developed that can take Matlab code and generate optimized hardware for a FPGA The AccelFPGA Compiler V1 6 was used for this thesis Figure 5 3 shows the basic steps in the process of converting an algorithm in MATLAB into an FPGA implementation using AccelFPGA The first step is to convert the floating point model of the algorithm into a fixed point model that meets the algorithm s signal fidelity criterion Next the user adds compiler directives for AccelFPGA These directives can specify the usage of FPGA s embedded hardware resources sdb as the BIlockRAM and the embedded multiplier blocks The third step is the creation of the RTL model in VHDL or Verilog and the associated testbenches vides automatically created by the AccelFPGA compiler The
6. 48 5 1 Virtex II architecture overview 31 eese tette 54 5 2 Slice Structure of Virtex II FPGA 31 eese tentent 54 5 3 Top down design process using AccelFPGA i H 57 5 4 Histogram of the input and output data range for the LUT table a The mp t b Phe Outputs cuti erts REA 62 5 5 Input and output curve of the exponential operation eee 63 5 6 Diagram of implementing the LUT in block memory 64 5 7 Diagram of the EM algorithm implementation in the FPGA 67 5 8 Floorplan of the EM algorithm implementation on a Virtex II FPGA 69 5 9 The mean vectors of the GMMs a is the floating point version b is the output of the FPGA implementation eee 70 5 10 The clustering results of both floating point and fixed point imiplementatiOris 5 operto ri ettet EIE XA 71 Vili LIST OF TABLES TABLE PAGE 3 1 Performance of the spike sorting system on several 38 4 1 List of semaphores and their functions eese 46 4 2 Execution time for 8 iterations of EM algorithm on single and 4 DSP Sy SECU cis E e Oe NOR 49 4 3 Performance of the EM algorithm on DSP system and 1
7. er Time ms Figure 1 1 This extracellular recording waveform shows different action potentials from an unknown number of local neurons The data were recorded from an adult female cricket s cercal sensory system Data courtesy of Alex 7 Spike sorting provides an alternative to physical isolation for multi neuron recording In this approach no effort is made to isolate a single cell rather the spikes due to several cells are recorded simultaneously and sorted into groups according to their waveforms Each group is presumed to represent a single cell since their waveforms change as a function of position relative to the electrode The closer an electrode is to a particular neuron the greater the amplitude of the signal will be compared with other waveforms Spike Sorting System A typical system that measures and analyses extracellular neural signals is shown in Figure 1 2 There are four stages between the electrode and the identification of spike trains from individual neurons Feature Extractor Classifier Detector Figure 1 2 A typical neural signal recording detecting and spike sorting system In the first stage neural signals are picked up by extracellular electrodes and amplified and then filtered Low pass filters are used to smooth the spike waveforms and provide an anti aliasing filter before sampling The frequency content of the waveform is typically less than 10 KHz Oth
8. Also calculate the likelihood of each data for 0 j lt nclusters j logproXiLTemp logproXiL_master j i maxTemp logproxXiLTemp expf logproXiLTemp logproXiLTemp logproXiLTemp priors 0 j loglhXi_master 0 i loglhXi_master 0 i logproXiL Temp loglhXi_master 0 i logf loglhXi master 0 i maxTemp Ihdata master vsum f D loglhXi master 0 nblockdata ndata Likelihood of data in this block hdatanew Ihdata master 8 8 ae ae kk Calculate the Likelihood of whole data set sie A kkk End and Wait and Start while GetIOP SEM_LOCAL 1 amp GetlOP SEM LOCAL 2 amp GetIOP SEM LOCAL 3 1 96 3 ee for G 0 i lt nprocessors 1 i Ihdatanew Ihdatanew float LH_BLOCK_1 i DSP_JUMP thres Ihdatanew Ihdataold thres fabsf thres if thres lt setThres break Ihdataold Ihdatanew IoopCountAll End and Wait and Start End and Wait and Start SetIOP SEM GLOBAL 1 while GetlOP SEM LOCAL 1 5 LOCAL 2 GetIOP SEM LOCAL 3 0 15 SetIOP SEM GLOBAL 0 ER OR se ope be sp sj oko eoe a ae leet kk bu Calculate probablity of being each cluster ese espe ie sje oj kkk
9. ER KK the priors spe spe for 1 0 i nclusters i4 logpriors_block_1 0 i vsum_sf_DdogproLXi_block_1 i 1 nblockdata logpriors block 1 O0 i logpriors O i ndata sle ste she ake ste ae e sfe e RR ake RR ske Updata the centers a Be af eae sje spese fee ole ke ob sp ojo fe oe A He boe eode bete RR EK for i 0 i lt nclusters i forG 0 j lt ndim j blocksums j 0 block_sum blocksums blockdata logproLXi_block_1 i offset vecvadd logcenters_block_1 i blocksums logcenters_block_1 i ndim ee sk oleo sje eoe seo SH HR kk kkk kk k ekk kek kkk Transfer Priors and centers Get New Priors and centers back 17 ke sk sj spe eae HE End and Wait and Start on SetIOP SEM LOCAL 1 x while GetlOP SEM GLOBAL STOP UJ SetIOP SEM LOCAL 0 107 PRR AR AR sk AR RR oj ok sje HR RR k Updata the oo ok ak for i 0 i lt nclusters i for j 0 j lt nblockdata j vecvsub blockdata j centers i blockdata diff j ndim vecvmlt blockdata diff j blockdata diff j jblockdata diff j ndim for j 0 j lt ndim j blocksums j 0 block
10. 20 Expectation Maximization for the Gaussian Mixture Model In the spike sorting case the GMM model consists of M classes each class represents a neuron that has been recorded Let X x x x Xy be the sequence of Observation vectors i e the measured waveforms In this case we assume the following probabilistic model M D gt a p x 0 Eq 2 11 1 1 i where the parameters are 0 05 05 0 0 0 such that 1 and each 1 1 is a single Gaussian density function Then the log likelihood expression for this density from the data set X x x x x is given by N N M log O X log Y tog Va p x o Eq 2 12 i i Iz which is difficult to optimize because it contains the log summation Now we introduce a new set of data Y Y is a sequence of data Y y yy whose values inform us which neuron generated the observation vector x This variable Y is considered to be a hidden variable If we assume that we know the values of Y the likelihood becomes N N log 0 X Y 2 os PG lo P 0 84213 Since the values of Y are unknown to us we don t actually know which neuron generated each spike However if assume Y is a random vector which follows a known distribution we can still proceed The problem now is to find the distribution of the hidden variable Y First we guess that parameters Of ar
11. READ int MSGR4 in EXTERNAL BASE GLOBAL DECLARATIONS 91 segment seg_global_flag int global_flag 0 segment seg local flag int local flag 0 float priorsTemp 1 nclusters 0 float candcTemp nclusters ndim 0 segment seg_data float data ndata ndim 0 include rescal_720_data_directload dat IE Data to be transmitted to external memory float blockdata nblockdata ndim 0 segment seg pmda float blockdata_diff nblockdata ndim 0 float blockdata diff nblockdata ndim 0 float centers nclusters ndim O include rescal centers directload dat float priors 1 nclusters 0 float covars nclusters ndim 0 float logproXiL master nclusters nblockdata 0 float logproLXi master nclusters nblockdata 0 float blocksums ndim 0 float loglhXi_master 1 nblockdata 0 float logpriors_master 1 nclusters 0 float logcenters master nclusters ndim 0 float logcovars_master nclusters ndim 0 void main 0 0 12 12 int loopCount 0 int loopCountAll 0 float testlog 0 float testlog2 float cycle_count 0 int ntasks 0 int offset 0 int flagsum 0 int offsetTemp 0 int data_offset 0 int LoopCount 0 92 int data_start 0 int priors_start 0 int candc start 0 int global int blockdata_slave 3 0 int
12. f ___________ 0 Eq 2 17 o a M 1 Eq 2 18 We introduce the Lagrange multiplier 4 and solve the following equation gt 1 1 0 4 1 1 1 m 121 0 Eq 2 19 ola and we get the expression id T p l x 0 Eq 2 20 i To get the expression of and 2 we have to go through a similar procedure but the computation is much more complicated To calculate these two parameters for the Gaussian distribution function we ll need to take derivatives of a function of matrices The detailed procedure can be found in 25 and here we just give the results N x 08 E Eq 2 21 gt Pl x 0 izl N gt P x 08 x ur x uy I Y 2d Eq 2 22 N x 0 i l Collectively Eq 2 20 2 21 and 2 22 form the basis of the EM algorithm for iteratively estimating the parameters of a GMM The detailed steps to implement these equations will be discussed in the next section 23 Implementation of the EM algorithm As shown in Figure 2 5 the procedure to implement the iterative EM algorithm consists of the following steps 1 Initialization Initialize the model parameters O 0 2 E step Calculate the expected likelihood of the given data based on the current parameters O i Calculate the increase of likelihood If the difference is below a certain threshold sto
13. for i 0 i lt nclusters i for j 0 j lt nblockdata j logproLXi master i j logproXiL_master i j logf priors O i loglhXi_master 0 j logproLXi_master i j expf logproLXi master i j Use float instead of log OR ee He sese A HH kkk de Updata the priors i Si zb sk k k kk k k kk for G 0 i lt nclusters i logpriors_master 0 i vsum_sf_D ogproLXi_master i 1 nblockdata priors 0 i logpriors O i ndata Ek ke AR obe bebe sese e RA A k RE 97 Updata the centers T ese sk sk seo sk sie eoe soe ee a for i 0 i nclusters i for j 0 j lt ndim j blocksums j 0 block_sum blocksums blockdata logproLXi_master i offset vecvadd logcenters master i blocksums logcenters master i ndim ES serbe be ebbe k k k Update Priors and centers he ke sje ke he ke ope ce of ae oe ae oh ae oe k while GetlIOP SEM_LOCAL_1 amp GetlOP SEM_LOCAL_2 amp GetIOP SEM LOCAL 3 1 if DMA 1 extmem to intmem s2m int BLOCK 1 LoopCount DSP JUMP Source buffer priors start Destination buffer P BLOCK LENGTH CANDC LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl else for i2 0 i2 lt P_BLOCK_LENGTH CANDC_LENGTH i
14. Figure 5 2 Slice Structure of Virtex FPGA 31 55 The Virtex II devices incorporate large amouts of 18 Kbit block memories called Block SelectRAM or BlockRAM Each Virtex blockRAM is an 18 Kbit true dual pot RAM with two independently controlled synchronous ports that access a common storage area The blockRAM supports various configurations including single and dual port RAM and various data address aspect ratios Supported memory configurations are shown in Table 5 1 Table 5 1 Supported configuration of the block SelectRAM 16K x 1 bit 2K x 9 bits 8K x 2 bits 1K x 18 bits 4K x 4 bits 512 x 36 bits The Virtex II devices also incorporate a number of embedded multiplier blocks These multiplier blocks are 18 bit by 18 bit 2 s complement signed multipliers Some of the interconnection to the Switch Matrix is shared between the BIockRAM and the embedded multiplier This sharing of the interconnection is optimized for an 18 bit wide BlockRAM feeding the multiplier and allows a fast implementation of a multiplier accumulator MAC function which is commonly used in DSP applications The Virtex II family is comprised of 11 members ranging from 40K to 8M system gates The FPGA we used for the thesis was the XC2V3000 which features 3M system gates Table 5 2 shows the specifications of the XC2V3000 FPGA Table 5 2 The specifications of the Virtex II XC2V3000 FPGA Multiplier SelectRAM Max I O
15. MONTANA STATE UNIVERSITY Using FPGAS to accelerate the training process of a Gaussian mixture model based spike sorting system by Yongming Zhu A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in Electrical Engineering Montana State University Copyright by Yongming Zhu 2003 Abstract The detection and classification of the neural spike activity is an indispensable step in the analysis of extracellular signal recording We introduce a spike sorting system based on the Gaussian Mixture Model GMM and show that it can be implemented in Field Programmable Gate Arrays FPGA The Expectation Maximization EM algorithm is used to estimate the parameters of the GMM In order to handle the high dimensional inputs in our application a log version of the EM algorithm was implemented Since the training process of the EM algorithm is very computationally intensive and runs slowly on Personal Computers PC and even on parallel DSPs we mapped the EM algorithm to a Field Programmable Gate Array FPGA It trained the models without a significant degradation of the model integrity when using 18 bit and 30 bit fixed point data The FPGA implementation using a Xilinx Virtex II 3000 was 16 4 times faster than a 3 2 GHz Pentium 4 It was 42 5 times faster than a parallel floating point DSP implementation using 4 SHARC 21160 DSPs USING FPGAS TO ACCELERATE THE TRAINING PROCESS OF A GAUSSIAN MIX
16. Xilinx Virtex If FPGA and compares the performance of FPGA with the previous implementations Finally Chapter 6 summarizes the work of this thesis and suggests possible future research directions CHAPTER 2 THE GAUSSIAN MIXTURE MODEL AND THE EM ALGORITHM Introduction This chapter introduces the Gaussian Mixture Model GMM and a maximum likelihood oere the Expectation Maximization EM algorithm which is used to estimate the GMM parameters The discussion in this chapter will focus on the motivation advantage and limitation of using this statistical approach Several considerations for implementing the EM algorithm on actual hardware will be discussed at the end of this chapter The chapter is organized as follows Section 2 2 serves to describe the form of the GMM and the motivations to use the GMM in clustering the action potentials from different neurons In section 2 2 first a description of the Gaussian Mixture Model is presented Then several assumptions and limitations of using the GMM in spike sorting are discussed Section 2 3 introduces the clustering of the action potentials based on Bayesian decision boundaries Finially the EM algorithm and derivation of the GMM parameters estimation procedure are presented Mixture Model for Spike Sorting The Gaussian Mixture Model GMM A Gaussian mixture density is a weighted sum of a number of component densities The Gaussian Mixture Model has the ability
17. int GLOBAL_P P_BLOCK_LENGTH CANDC_LENGTH i2 int covars i2 101 Finish Moving data stop count timer off cycle_count float start_count float stop_count 80000000 divide by 80 gt 102 For the Slave DSPs Slave c include lt stdio h gt include lt matrix h gt include lt math h gt include lt stdlib h gt include lt vector h gt include lt 21160 h gt include lt def21160 h gt include speeddsp h include diag DMA_slave_2 h define DMA 1 Use DMA channel for external memory data transaction defme processorID 3 define START 1 define STOP 0 defme MASTER BASE 0x0100000 defme MULTI BASE 0x0700000 defme BASE 0x0800000 defme SEM GLOBAL MSGRO define SEM LOCAL MSGR1 define SEM_DMA_WRITE nt MSGR4 in EXTERNAL BASE segment seg_local_flag int local_flag 0 segment seg global flag global flag 0 segment seg lh block float Ihdata block 1 0 segment seg prior block floatlogpriors block 1 l1 nclusters 0 segment seg centers block float logcenters block l nclusters ndim 0 segment seg covars block floatlogcovars block l nclusters ndim 0 segment seg p block float centers nclusters ndim 0 segment seg_p_block float covars nclusters ndim 0 segment seg_p_block fl
18. multi variant Gaussian model of the spike waveforms generated from a particular neuron The solid line shows the mean vector of this model while the dashed line shows the three boundary according to the covariance matrix All the waveforms in this model are shown in the background Furthermore if we assume that action potentials from different neurons are independent and identically distributed 16 the observation vector distribution can be given by a Gaussian Mixture Model Eq 2 1 Thus the GMM results directly from a probabilistic modeling of the underlying spike generating process Spike sorting using the GMM In the last section we describe how to use the GMM to model the spike generating process From the perspective of spike sorting the goal is to estimate the parameters of the GMM and classify each action potential with respect to each neuron 14 class in which the action potential is generated from To do this the first step is to estimate all the parameters including all the priors probabilities a together with the means and covariance of each Gaussian component Maximum likelihood ML parameter estimation is the method we will use to find these parameters with a set of training data It is based on finding the model parameters that are most likely to produce the set of observed samples The maximum likelihood GMM parameter estimates can be found via an iterative parameter estimation procedure namely th
19. quantize q_int I ncluster centers d 3 m 9 quantize q data mid centers d 3 m 9 mid priors inv m 9 mid centers d 3 m 9 quantize q mid centers const zero end end Function to get the log of priors gt gt gt gt gt gt gt gt gt gt gt priors_log fun_lut_priors_log priors ACCEL SHAPE mid priors log addr 1 ACCEL SHAPE priors log 5 for m 10 quantize q int 1 ncluster mid priors log addr acc l_bitshl priors m_10 11 q1 q_addr priors log m 10 quantize q lut log lut log content mid priors log addr 1 end loop flag quantize q intjloop flag 1 id quantize q_int 1 LogLikeHoodX 0 end else loop_flag 2 Function to get the new covars mid covars fun_accum_cov indatabuf mid_covars centers ProLXi id ACCEL SHAPE sub_cov_1 ncluster 1 ACCEL SHAPE sub cov 2 ncluster ACCEL SHAPE square cov ncluster ACCEL SHAPE mult cov ncluster ACCEL SHAPE mid covars temp ncluster ACCEL REGISTER sub cov 1 ACCEL REGISTER sub_cov_2 ACCEL REGISTER square_cov 125 ACCEL REGISTER mult cov ACCEL REGISTER mid covars temp for d 4 quantize q int 1 ndim ACCEL TILE m i 3 ACCEL USE embedmults 30 m i 3 embedmults 35 m i 3 for m_i_3 quantize q int 1 ncluster sub_cov_1 m_i_3 quantize q_small indatabuf d_4 m_i_3 centers d_4 m_i_3 sub_cov_2 m_i_3 quantize q_small indatabuf d_4 m
20. 1 O i if logproXiL block 1116 gt maxTemp maxTemp logproXiL block 1 j i Normalize the pro to 0 1 Also calculate the likelihood of each data for j 0 j lt nclusters j logproXiL Temp logproXiL block 1 j i maxTemp logproXiLTemp expf logproXiLTemp logproXiLTemp logproXiLTemp priors O j ur loglhXi block 1101111 loglh Xi block 1 O i JogproXiL Temp loglhXi_block_1 0 i logf loglhXi block 1 0 i maxTemp lhdata_block_1 vsum f D loglhXi block 1 0 nblockdata ndata Likelihood of data in this block sk se ses se sk se sese ses ses f Transfer likelihood of this block If necessary continue the calculation ke ke oo OK A ae boe kk End and Wait and Start SetIOP SEM LOCAL 1 while GetIOP SEM GLOBAL STOP 3 106 SetIOP SEM_LOCAL 0 m ee oR a he ae a oe k k k Calculate probablity of being each cluster ae seo eos of k eek for i 0 i lt nclusters i for j 0 j lt nblockdata j logproLXi_block_1 i j logproXiL_block_i i j logf priors O i loglhXi block 1 0 j logproLXi_block_1 i j expf logproLXi block 1 i j Use float instead of log
21. 3 d 0 119 indatabuf d_0 4 indatabuf_4 d_0 indatabuf d_0 5 indatabuf_5 d_0 end ACCEL SYNTHESIS OFF indatabuf 1 indatabuf_1 indatabuf 2 indatabuf_2 indatabuf 3 indatabuf_3 indatabuf 4 indatabuf 4 indatabuf 5 indatabuf_5 ACCEL SYNTHESIS ON Memory map for look up table MEM MAP lut exp content to ram_s18_s18 56 at 0 ACCEL MEM MAP lut log content to ram 518 s18 58 at 0 969 1ACCEL MEM MAP lut inv content to ram_s18_s18 57 at 0 ACCEL MEM MAP indatabuf 1 to ram 518 s18 1 at 0 ACCEL MEM MAP indatabuf 2 to ram 818 s18 2 at 0 ACCEL MEM MAP indatabuf 3 to ram 518 518 3 at 0 ACCEL MEM MAP indatabuf 4 to ram 818 s18 4 at 0 ACCEL MAP indatabuf 5 to ram 818 s18 5 at 0 This part tricks accelchip to initialize the blockmem ACCEL MAP centers 1 to ram_s18_s18 16 at 0 MEM MAP centers_2 to ram 518 s18 17 at 0 ACCEL MEM MAP centers_3 to ram_s18_s18 18 at 0 ACCEL MEM MAP centers_4 to ram_s18_s18 19 at 0 ACCEL MEM MAP centers 5 to ram_s18_s18 20 atO if loop_flag Function to calculate the log version probability in each cluster gt gt gt gt LogProXiL i gt gt gt gt fun_logproxil indatabuf covars_inv covars_det centers lt lt lt lt lt lt ACCEL SHAPE acc 1 1 ACCEL SHAPE 2 1 SHAPE acc_3 1 AC
22. 7 Diagram of the EM algorithm implementation in the FPGA Synthesis and Simulation After generating the VHDL code from AccelFPGA the next step was to simulate and synthesize the VHDL modules The RTL VHDL code could be simulated in the simulation tools such as Modelsim using the testbenches generated by AccelFPGA They could also be put into synthesis tools like ISE from Xilinx to generate the bit stream to be downloaded into the FPGAs The toolchain used can be seen in Table 5 5 Table 5 5 The toolchain used in the FPGA implementation Toolchain Version Matlab V6 5 Release 13 AccelFPGA V1 6 ISE V5 2 02i Modelsim Starter 5 6e 68 First a RTL level simulation was made to verify that the VHDL code was working correctly Then a post synthesis model was generated by the ISE synthesis tool from Xilinx This post synthesis model is simulated again in the simulation tool Modelsim the simulation results were the same as the output of the fixed point Matlab implementation This shown that the FPGA implementation of the EM algorithm was working correctly The system performance of the FPGA implementation can be seen in the synthesis report from the ISE tool The highest clock rate of this EM system in Xilinx Virtex 3000 platform is 62MHz Table 5 7 The hardware resources used is shown in Table 5 5 and the actual floorplan inside the FPGA is shown in Figure 5 8 Table 5 6 Device utilization summary of the Xilinx Virtex I
23. A R Martin amp BG Wallace 1992 From Neuron to Brain Sinauer Associates Inc Sunderland Massachusetts M Abeles 1991 Corticonics Cambridge University Press Cambridge Evarts EV 1968 A technique for recording activity of subcortical neurons in moving animals Electroencephalogy Clin Neurophysiol Georgopoulos AP Taira M amp Lukashin 1993 Cognitive neurophysiology of the motor cortex Science 260 Alexander G Dimitrov 2002 Spike sorting the other way Neurocomputing 2003 Carroll Michael 2002 Spike Sorting for Neurosurgical Localization Learning From Data Winter 2002 Kyung Hwan Kim amp Sung June Kim 2000 Neural Spike Sorting Under Nearly 0 dB Signal to Noise Ratio Using Nonlinear Energy Operator and Artificial Neural Network Classifier IEEE Transactions on Biomedical Engineering 47 Susumu Takahashi Yuichiro Anzai amp Yoshio Sakurai 2002 Automatic Sorting for Multi Neuronal Activity Recorded With Tetrodes in the Presence of Overlapping Spikes J Neurophysiol 89 Abeles M amp Goldstein MH 1977 Multispike train analysis Proc IEEE 65 162 133 Kyung Hwan Kim amp Sung June Kim 2003 Method for Unsupervised Classification of Multiunit Neural Signal Recording Under Low Signal to Noise Ratio IEEE Transactions on Biomedical Engineering 50 MeNaughton B L J O Keefe amp C A Barnes 1983 The stereotrode a new technique for simultaneous isolation of several single units in central nervou
24. TO ram 818 51 8 _9 15 ACCEL USE embedmults m_9 25 for m_9 quantize q_int 1 5 centers d_3 m_9 quantize q_data mid_centers d_3 m_9 mid_priors_inv m_9 mid_centers d_3 m_9 quantize q_mid_centers const_zero end In the FPGA all the computations among the five clusters were parallelized and ran at the same time A lot of BlockRAMs were used to ensure that the multiple use of a single resource were possible The embedded dedicated 18 bit 2 s complement multipliers were used wherever multiplication is needed to give the highest system performance Registers were added to pipeline the operations to obtain a high system clock rate The training data was streamed into the FPGA to implement the training process of the EM algorithm The data has to be steamed in twice for each iteration step since both the E step and M step need the training data for the computation The advantage of the data is to make the algorithm independent of the data size since there is relatively little chip memory compared to PCs The final results will were stored in the block memories inside FPGA Figure 5 7 67 Likelihood improvement gt threshold Cluster 2 X 8 lt data 720 8 E data 720 E 5 Stop the interation Eg nv and store the current E E raph parameters to the 5 data 2 block memories 9 data 1 Figure 5
25. discrimination during simultaneous recording from several multi unit nerve filaments J Neurosci Methods 64 181 187 Susumu Takahashi Yuichiro Anzai amp Yoshio Sakurai 2003 Automatic Sorting for Multi Neuronal Activity Recorded With Tetrodes in the Presence of Overlapping Spikes J Neurophysiol 89 2245 2258 Robin Hanson John Stutz amp Peter Cheeseman 1996 Bayesian Classification Theory Technical Report FIA 90 12 7 01 Sahani M 1999 Latent Variable Models for Neural Data Analysis Ph D Thesis Richard A Redner amp Homer F Walker 1984 Mixture Densities Maximum Likelihood and EM Algorithm SIAM Review 26 195 239 Lei Xu amp Michael Jordan 1995 On Convergence Properties of the EM Algorithm for Gaussian Mixtures C B C L Paper No 111 Alexander G Dimitrov John P Miller Tomas Gedeon Zane Aldworth amp Albert E Parker 2000 Analysis of neural coding using quantization with an information based distortion measure Network Computation in Neural Systems 79 28 Paul M Baggenstoss 2002 Statistical Modeling Using Gaussian Mixtures and HMMs with MATLAB 29 Lewicki Micheal S 1995 Bayesian Modeling and Classification of Neural Signals Neural Computation 30 Dominique Lavenier 2000 FPGA implemenation of the K means clusting algorithm for hyperspectral images Los Alamos National Laboratory LAUR 00 3079 31 2002 Virtex I Platform FPGAs Introduction and Overview 32
26. eoe HH initialize the covars and priors ui Sos e sob se pos AK HE kk k initGmm priors centers covars local flag 1 Start Timer timer set OXFFFFFFFF 0xFFFFFFFF start_count timer_on enable timer while START End and Wait and Start while GetlOP SEM_GLOBAL 0 3 SetlOP SEM_LOCAL 0 oe ke sk sk spo a Clear all the log version matrix TE seo kk HK k k kk for i 0 i lt nclusters i for j 0 j lt ndim j logcovars block 11 0 Clear the logcovars matrix logcenters block 1l i 0 the logcenters matrix eR Ree be bebe ok soe a ie kk EB ae TS Calculate the probablity in each cluster sek sek e sk ok kk for i 0 i lt nblockdata i logproXiL_block_1 j i getproXi centers j blockdata i covars j for j 0 j lt nclusters j 105 Calculate the liklihood of each data in this model fefe sje obe ok kk for 0 i lt nblockdata i Clear the store matrix loglhXi_block_1 0 i 0 Get the max number of each column for j 0 j lt nclusters j if j 0 else maxTemp logproXiL block
27. fourth step is to synthesize the VHDL or Verilog models using a logic synthesis tool The gate level netlist is then simulated to ensure functional correctness against the system specification Finally the gate level netlist is placed and routed by the place and route appropriate for the FPGAs used in the implementation 57 One advantage of AccelFPGA is that bit true simulations can be done in the Matlab environment Testbenches are automatically generated along with the simulations You can use these testbenches later in the hardware simulation and compare the results with the Matlab simulation As a result it is very easy to know whether your hardware implementation is correct or not Since Matlab is a high level language the compiled implementation may not give as good performance as direct HDL coding However by using proper directives you can specify the parallelism of your algorithm and maximize the usage of the on chip hardware resources Hence you can get a reasonably good performance of your implementation Matlab Design in floating point representation Matlab Convert to fixed point representation Apply AccelFPGA complier directives AccelFPGA AccelFPGA generates RTL models and testbench Logic synthesis to map RTL models to FPGA primitives Synthesys tool Logic simulation for bit true verification Figure 5 3 Top down design process using AccelFPGA The versions of all the tools used in
28. means for each cluster Data Transferred New_priors L Size 5 per Node New_centers L Size 800 per Node Phase 5 Calculate middle results for updating the ovariance 5 SARIS matrices Data Transferred Phase 6 Mid_covars L Size 800 per Node Update the covariance matrices for each cluster Data Transferred New covars L Size 800 per Node Figure 4 3 Diagram of multiple DSP implementation of EM algorithm 49 Performance on the Parallel DSP Platform Table 4 1 gives the execution time for 8 iterations of the EM algorithm until it converged Performance on a single DSP and multiple DSPs are both shown Also different data transfer methods are compared The results show that the computational speed of the 4 DSP system was almost 4 times better than single DSP Also the DMA transfer sped up the whole system and gave the best performance Notice that the bus contention auto resolution method in the multiple DSP system significantly slowed down the performance of the whole system Using semaphores was necessary to accelerate the data transferring process Table 4 2 Execution time for 8 iterations of EM algorithm on single and 4 DSP system Data Transfer Methods Single DSP sec Four DSP sec Normal Read 14 5 13 5 DMA transfer 15 9 3 4 Bus Contention N A gt 7000 Table 4 3 shows the performance of the same EM application between the DSP and Personal Computer PC configurations perfor
29. oe o ojo obe seo e oo oe WR Use DMA move data to internal and cal k data start int data offset 0 spe o peo spe ak ake Read data from SRAM to internal memory soe sk oko oe seo e spo spo ope ee ae k k kk i Moving Data if 1 extmem_to_intmem data_start Source buffer int blockdata Destination buffer Blocksize Buffer size 10 DMA Channel TRUE Wait for DMA to compl extmem to intnem int data_start Blocksize Source buffer int blockdata_slave 0 SLAVE_BASE_1 Blocksize Buffer size 10 DMA Channel TRUE Wait for DMA to compl extmem to intmem int data start 2 Blocksize Source buffer int blockdata slave 1 SLAVE BASE 2 Blocksize Buffer size 10 DMA Channel TRUE Wait for DMA to compl extmem to intmem Gnt data start 3 Blocksize Source buffer 94 int blockdata_slave 2 SLAVE_BASE_3 Blocksize Buffer size 10 DMA Channel TRUE Wait for DMA to compl else for i2 0 i2 lt nblockdata i2 for j2 0 j2 lt ndim j2 5 blockdata i2 j2 data offset 121121 I Finish Moving data while loopCountAll lt maxLoop En
30. of all the clusters are set to be equal All the EM training processes in this paper start from this same initialization so that all the results are comparable 1 5 0 5 ane t L 1 1 0 20 40 60 80 100 120 140 160 Figure 3 3 Initialization of the EM algorithm 30 Evaluation of the Algorithm One hard problem of spike sorting is that it is hard to evaluate the clustering results Signals from different cells may be very similar and difficult to distinguish even for trained neurophysiologists It is nearly impossible to create reliable expert based validation data sets Hence simulated neural data are often used for evaluating the spike sorting algorithm Recently simultaneously intro and extra cellular recording technology has provided the possibility to evaluate the algorithm on real data In this paper we focus on how to implement the GMM based spike sorting algorithm on hardware Since evaluations for both simulated and real data show that the GMM approach gives very good results 16 29 we just concentrated on the software and hardware implementations When targeting FPGAs the use of fixed point data types is important because of the increase in speed that can be attained Since using fixed point data will lose a certain amount of precision as compared to floating point data we evaluated the FPGA implementation with respect to the floating point implementations N
31. p x OD ae Glo Eq 3 6 iz N In 0 X Y Intp O Eq 3 7 After changing into the log domain the products Eq 3 2 and Eq 3 4 are turned into additions and subtractions Exponential operations are also eliminated 32 except in Eq 3 6 Since all the e 6 terms are very small this can still cause underflow problems 0 To handle this we need to add another normalization process to calculate Eq 3 5 We do a normalization across all the clusters by subtracting all the In p x 0 terms by the maximum value across the clusters We then store the max In p x 0 for furtive use After subtraction the exponential value el will be guaranteed between 0 1 We then performed the multiplications and additions The log version of the middle results can be obtained by applying In to it The last step is to add the max In p x o back to the middle result This way without changing the final result we go around the numerical zero problems and get the same results In p x O 8 3 8 B max 0 Eq 3 9 After getting the likelihood of the whole data and before updating the parameters the probability of being in one cluster given input x has to be calculated This probability can be calculated by Baye s rule in Eq 3 10 E pos 0 0 Eq 3 10 p o Sap 0 p x k l However this equatio
32. q1 acc 4 log const covars_det 4 acc 5 quantize q1 acc 5 log const covars det 5 LogProXiL 1 quantize ql acc 1 LogProXiL 2 quantize q1 acc 2 LogProXiL 3 quantize q1 acc 3 LogProXiL 4 quantize q1 acc 4 LogProXiL 5 quantize q1 acc 5 Function to calculate the log likelihood of one data gt gt gt gt LogLikeHoodXi gt gt gt gt fun_loglikehoodxi LogProXiL priors lt lt lt lt lt lt lt lt lt ACCEL REGISTER maxa 121 ACCEL SHAPE 1 ACCEL SHAPE a 5 maxa quantize q1 max LogProXiL ACCEL REGISTER a NOTE This loop will not after synthesis in ISE a 1 quantize q1 LogProXiL 1 maxa a 2 quantize q1 LogProXiL 2 maxa 3 quantize q1 LogProXiL 3 maxa a 4 quantize q1 LogProXiL 4 maxa a 5 quantize q1 LogProXiL 5 maxa for m 2 quantize q int 1 ncluster ifa m 2 gt 4 a m 2 quantize q1 3 998 end end for m 3 quantize q_int 1 ncluster ACCEL SHAPE a addr 1 addr accel bitshl a m 3 8 41 4 addr Use shift left to get the address ACCEL SHAPE content 1024 ACCEL SHAPE a_lut_out 5 a_lut_out m_3 quantize q_lut_exp lut_exp_content a_addr 1 end ACCEL SHAPE c 1 quantize q1 0 ACCEL REGISTER c_temp ACCEL REGISTER c ACCEL USE embedmults 10 embedmults 11 embedmults 12 embedmults 13 embedmults 14 for m_4 qu
33. the convergence of the EM algorithm Table 5 3 The error between the original floating point implementation and the LUT simulations The fixed point LUT output Error per data 0 0126 Figure 5 6 shows how this LUT was implemented in the FPGA hardware The input variables which start from bit 4 represented the least precision 2 and are used 64 as the input to the LUT The bits after bit 4 are neglected since these bits are not required for the input The content in the LUT is determined by Eq 5 3 LUT k exp kx2 Eq 5 3 in which k represented the decimal address of the block memory Notice that the inputs of the LUT were positive which are the inverse of the real inputs This was done in the previous calculation without additional overhead Another thing to notice is the last element in the LUT is 0 instead of the result from Eq 5 3 The zero accounts for all the outputs with the inputs less than 4 A 18 bit LUT input A bank of Block SeceletRAM 18bit 1024 8 2712 HEX DEC exp 0 2 9 1 0000 0 1 278 0 9961 0001 1 2 2 0 9922 0002 exp k 279 exp 1022 279 0 0184 05FD 1022 _ Jost ioo Figure 5 6 Diagram of implementing the LUT in block memory 10 bits of the input connect Binary Point to the LUT address line INPUT 14 4 0 6 4PPv ssauday All the other LUTs were determined and tested in M
34. the structure and components on the Hammerhead board will be described The software we use to develop the system will also be introduced In the rest of Chapter 4 we focus on how to parallelize the EM algorithm and how to effectively implement it on a parallel DSP system In the end of this Chapter the performance of the EM implementation on this DSP system will be presented Hardware SHARC ADSP 21160M DSP Processor On the Hammerhead PCI board there are four high performance 32 bit floating point DSP processors namely the Analog SHARC ADSP 21160s This DSP features 41 4 Mbits on chip dual ported SRAM and the Super Harvard Architecture with the Single Instruction Multiple Data SIMD computational engine Running at 80MHz ADSP 21 160 can perform 480 million math operations peak per second The ADSP 21160 has two independent parallel computation units Each unit consists of an ALU multiplier and shifter In SIMD mode the parallel ALU and multiplier operations occur in both processing elements The ADSP 21160 has two independent memories one for data and the other for program instructions Two independent address generators DAG and a program sequencer supply address for memory access ADSP 21160 can access both data memory and program memory while fetching an instruction In addition it has a zero overhead loop facility with a single cycle setup and exit The on chip DMA controller allows zero overhead data transf
35. values of covariance matrices are approximately 107 so during the training process the matrices will easily become singular because of limited precision We tried three different approaches to solve this problem The first method was to eidion the covariance matrices whenever a singular matrix appears Every time when there was a numeric zero in the covariance matrix we added a very small number 10 to prevent it from being zero This way the algorithm could go on to the next step until it converged In the second approach we reduced the data dimension by using Principal Copanenr Analysis As discussed in Chapter 1 PCA can find the directions that represent the most variation of the original data We used the first 10 components that could represent 98 variation of the data set By reducing the dimension to 10 there was much less chance that full covariance matrix could become singular The last method was to use a diagonal matrix instead of the full covariance matrix Using a diagonal matrix can enabled log calculations of the determination and prevented numerical zeros By using the diagonal covariance matrix we made the assumption that there was no correlation between each dimension in the 160 dimensional space Making this assumption can cause a loss of information in the original data set However the first two methods also lost information by recondition and reducing the vector dimensions We tested all these methods and the best metho
36. was about 700 times slower than the other approaches because of the bus contention issue We then set a semaphore to exclude the use of the cluster bus By reading this semaphore only one DSP could access the bus at any time This gave a much better performance than the first approach Finally we used the DMA transfer mode to transfer parameters between the DSPs The master processor was in charge of the DMA controllers of all the processors Whenever a data transfer was needed the master first made sure that no DSP was using the bus The master then wrote into the registers of both the transceiver and receiver s DMA controller After the DMA transfer is initialized no other DSP could access the cluster bus until this DMA transfer finished This sped up the process and gave the best performance The performance comparison will be shown in the next section 48 Phase 1 1 Calculate p xj I 2 Calculate log likelihood of each data point Data Transferred Phase 2 LogLIke Xi L Size 180 per Node Ja seyy SABIS 1 Calculate log likelihood of the whole data set 2 Test whether to stop the interation or not Data Transferred LogPro L Xi Size 180 per Node Phase 3 Calculate middle results for updating the priors and means Data Transferred Phase 4 Mid_priors L Size 5 per Node Mid_centers L Size 800 per Node Update the priors and Mastar
37. while the average speed was around 3 seconds This is quite slow comparing to the average neural spike period The slow speed is due to the intensive computation needed for the 39 training of the Gaussian Mixture Model based on the EM algorithm We needed the training process to be as fast as possible so that minimum time is spent in the training phase As a result PCs are not as fast as we would like in order to do this In the rest of the thesis we will focus on how to speed up the training process of the EM algorithm In next chapter the parallel DSP implementation will be introduced Later an FPGA implementation will also be presented 40 CHAPTER 4 PARALLEL DSP IMPLEMENTATION Introduction To speed up the GMM based spike sorting system the first approach we used was to implement the EM algorithm on a parallel DSP system with 4 DSPs Digital Signal Processors have become more posed with the increase in speed and size of the on chip memory Furthermore some modern DSPs are optimized for multiprocessing The Analog ADSP 21160M DSP we used for our project has two types of integrated support which the link ports and cluster bus We used Bittware s Hammerhead PCI board to test our implementation The board contains four Analog ADSP 21160 floating point DSPs We wrote a parallel version of the EM algorithm to take advantage of the multiprocessing capability of the board In the next section
38. wl int 18 fp int 0 quantize for pure fractional number wl fra 18 fp fra 17 This word length for lut binary point can be varied due to the precision requirement 95 Most fixed points are the same but in case I will change one later justO leave it as different fp wl lut 18 fp cov inv 3 fp lut log 13 fp exp 13 fp lut log 13 fp prior 13 fp lut inv 13 116 q data quantizer floor wl_data fp_data ql quantizer floor wrap w fp q1 int quantizer floor wrap wl 0 q lut cov inv quantizer floor wrap w _lut fp lut cov inv ut cov log quantizer floor wrap wl lut fp lut log q lut exp quantizer floor wrap wl_lut fp Jut exp q lut log quantizer floor wrap w lut fp 104 log lut prior quantizerz floor wrap wl lut fp lut prior q int quantizer floor wrap wl_int fp int q fra quantizer floor wrap wl_fra fp fra q lut inv quantizer floor wrap w l lut fp lut inv q mid centers quantizer floor wrap 18 8 q mid priors quantizer floor wrap 18 8 q small quantizer floor wrap 18 13 q square cov quantizer floor wrap 18 151 q covars q square cov q priors q covars q_mid_covars quantizer floor wrap 18 12 q loglike quantizer floor wrap 18 7 fixed point f
39. 2 priors_start 12 int PR_BLOCK_1 LoopCount DSP_JUMP i2 Finish Moving data lt I Begin update the canen ed for i 0 i lt nclusters i logpriors_master 0 i logpriors_master 0 i priorsTemp 0 i 98 vecvadd logcenters master i candcTemp i logcenters master i ndim for i 0 i lt nclusters i priors O ij logpriors master O i ndata logpriors master O i l logpriors master O i vecsmlt logcenters master i logpriors master O i centers i ndim End update the centers Moving Data to slave dsps if DMA 1 intmem to extmem m2s int priors Source buffer nt P BLOCK 1 Destination buffer P BLOCK LENGTH CANDC LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl intnem to extmem m2s int priors Source buffer nt P BLOCK 1 DSP JUMP Destination buffer P BLOCK LENGTH CANDC LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl intnem to extmem m2s int priors Source buffer int P BLOCK 142 DSP JUMP Destination buffer P BLOCK LENGTH CANDC LENGTH Buffer size 10 DMA Channel TRUE Wait for DMA to compl else for 2 01
40. 2 lt BLOCK LENGTH CANDC LENGTH i2 Gnt GLOBAL P i2 priors 12 End and Wait and Start SetIOP SEM GLOBAL 1 while GetIOP SEM LOCAL 1 GetlOP SEM LOCAL 2 5 LOCAL 3 0 3 SetlOP SEM_GLOBAL 0 p oj oe che sje obe e sb sje obe se be sje desde k kk data the covars forG 0 i lt nclusters i for j 0 j lt nblockdata j vecvsub blockdata j centers i blockdata diff j ndim vecvmlt blockdata diff j blockdata diff j blockdata diff j ndim forG 0 j lt ndim j blocksums j 0 block sum blocksums blockdata diff logproL Xi master i offset vecvadd logcovars master i blocksums logcovars master 1 ndim e see se o se eo se se o sje ok spe esee ese k kk Update the covars sese sje sk oe oe eoe sje eoe oe oe fe sje kk che End and Wait and Start while GetIOP SEM_LOCAL_1 amp GetIOP SEM LOCAL 2 amp GetIOP SEM_LOCAL_3 1 6 lt for LoopCount 0 LoopCount lt nprocessors 1 LoopCount Get mid results of priors and centers from slaves Moving
41. 4 4 0 49 j 5 1 Supported of the block ene E 55 5 2 The of Virtex XC2V3000 FPGA 402 5 3 The error between the original floating point implementation and tie bU T coitu Qo et atas 63 5 4 List of directives can be used in AccelFPGA eee 65 5 5 The toolchain used in the FPGA implementation eene 65 5 6 Device utilization summary of the Xilinx Virtex II FPGA a 68 5 7 The post synthesis timing report russo cernerent tonne eno 68 5 8 The difference between the floating point output and FPGA 5 9 Confusion matrix of fixed point spike sorting result for 720 nerual spikes from 5 neurons Overall correct rate is 97 25 comparing tothe Floating pomt 5 10 Performance comparison between all the platforms 72 ABSTRACT The detection and classification of the neural spike activity is an indispensable step in the analysis of extracellular signal recording We introduce a spike sorting system based on the Gaussian Mixture Model GMM and show that it can be implemented in Field Programmable Gate Arrays FPGA The Expectation Maximization EM algorithm is used to estimate the parameters of the GMM In order to handle the high dimensional puis in our application a log ver
42. A Ghosh J Kunkel amp S Liao 1999 Hardware Synthesis from C C Proc Design Automation and Test in Europe Conference and Exhibition 33 R Helaihel amp K Olukotun 1997 Java as a Specification Language for Hardware Software Systems Proc International Conference on Computer Aided Design 690 697 34 2003 AccelFPGA User s Manual V1 7 80 APPENDICES 81 APPENDIX A SIMPLIFYING THE EXPRESSION OF Q 0 0 82 83 The procedure of how to estimate the parameters of the Gaussian Mixture Model using Expectation Maximization algorithm is introduced in Chapter 2 In this section we will present the mathematic details about how to simplify the expression of the expected value of the log likelihood based the whole data set and current model parameters We define it as Q O O To find the equations to iteratively estimate parameters of a GMM one critical step is to simplify the expression of Q O O From Eq 2 9 we know that Q 0 0 p X Y O X O Combine the above equation with Eq 2 12 Eq 2 13 Eq 2 14 and Eq 2 15 we can get the following equation Q 0 0 log O X Y pQY X O YeY N 2 Jog a p x 0 XL EG 13 09 YeY 2 0 y Py 0 PO x 0 jal Me Ms Mer Me M N 3 5 log 9 00 0 Me s Ms y A il ja M N log a p x lo Y Y
43. Array is an ASIC where logic functions can be changed to suit a user s needs The circuit elements within these devices can be configured repeatedly because the current configuration is stored in SRAM cells 31 The FPGAs can be configured to operate in a nearly unlimited number of user controlled ways by downloading various configuration bit streams The FPGA devices are usually organized into generic logic blocks that can perform a limited number of logic functions The specific function being performed is determined by the configuration bits preloaded into the hardware These logic blocks are connected by programmable routing which is also determined by the configuration bits The particular structures of the logic blocks and routing are different inside the FPGAs from different manufacturers The FPGA we use for our application is the Xilinx Virtex II series FPGAs The structure of this FPGA is described in the following section 53 Structure of Xilinx s Vittex II FPGA The Xilinx Virtex II family is a FPGA developed for high speed high density logic designs with low power consumption As shown in Figure 5 1 the programmable device is comprised of input out blocks IOBs and internal configurable logic blocks CLBs In additional to the configurable logic blocks there are three major elements 1 Block SelectRam memory provide 18 Kbit of dual port RAMs 2 Embedded 18 x 18 bit dedicated multipliers 3 Digital C
44. CEL SHAPE 4 1 ACCEL SHAPE acc 5 1 ACCEL SHAPE LogProXiL ncluster ACCEL SHAPE acc ncluster ACCEL SHAPE sub 1 ncluster ACCEL SHAPE sub_2 ncluster ACCEL SHAPE mult ncluster ACCEL SHAPE squa ncluster ACCEL SHAPE sub_temp ncluster acc 1 quantize q1 0 120 acc 2 quantize q1 0 acc 3 quantize q1 0 acc 4 quantize q1 0 acc 5 quantize q1 0 ACCEL REGISTER sub 1 ACCEL REGISTER sub 2 ACCEL REGISTER squa ACCEL REGISTER mult ACCEL REGISTER acc for d_1 quantize q_int 1 ndim ACCEL TILE m_i_0 ACCEL MEM MAP indatabuf m_i_0 TO ram 818 s18 m i 04 58 ATO ACCEL USE embedmults 1 m_i_0 embedmults 4 m_i_0 form i 0 quantize q_int 1 ncluster sub l m i 0 quantize q small indatabuf d 1 i O 4 1 m i 0 sub 2 m i 0 quantize q small ndatabuf d 1 m i 0 centers d 1 m i 0 squa m i 0 quantize q_small sub_1 m_i_0 sub 2 m i 0 mult m i 0 quantize q1 squa m i 0 covars inv d 1 m i 0 acc m i 0 quantize q1 acc m i 0 mult m i 0 end end 1 quantize q1 acc 1 2 acc 2 quantize q1 acc 2 2 acc 3 quantize q1 acc 3 2 acc 4 quantize q1 acc 4 2 5 quantize g1 acc 5 2 acc 1 quantize q1 acc 1 log const covars_det 1 acc 2 quantize q1 acc 2 log const covars det 2 acc 3 quantize q1 acc 3 log const covars_det 3 acc 4 quantize
45. Clustering result for the training data Five clusters are labeled using different color 38 Matlab Performance Since we want to implement a high speed spike sorting system where the models are trained as fast as possible we have to know how fast this algorithm can run on a PC As described in Chapter 2 the whole spike sorting process can be divided into two phases the model training phase and the spike classification phase We tested the algorithm using the newest version Version 6 5 Release 13 of Matlab on different PCs The test results are shown in Table 3 1 Table 3 1 Performance of the spike sorting system on several PCs Execution Time Pentium AMD AMD Pentium M12G 1 37G _1 2GDuo _1v3 3G _Aversse_ Best Training s 833 3 42 1 45 1 31 322 1 31 Classification 720 0 42 0 14 0 05 0 04 0 16 0 04 data points s Classification per 0 58 0 19 0 06 0 05 0 21 0 05 data point ms 1 The Pentium 1 26 and AMD 1 376 systems both have 512 MB DDRAM 2 The AMD 1 2G Duo system has 4 GB DDRAM 3 The Pentium IV 3 36 system has 1GB DDRAM The performance of making classifications on a PC is pretty good With the existing models a spike waveform can be classified within less than 0 2 ms This training process can meet the real time requirement since the typical neural spike period is around 2 3 ms However the training process ran much slower than the classification process The best performance we got was 1 31 seconds
46. EL SHAPE LogProLXi_addr 5 LogProLXi_addr m_6_1 accel_bitshl LogProLXi m_6_1 8 q1 q_addr end ACCEL SHAPE lut exp ProLXi 1024 ncluster ACCEL TILE 6 905 1ACCEL MEM MAP ProLXi m 6 TO ram s18 s18 m 6 5 AT 0 ACCEL MEM MAP lut exp ProLXi m 6 TO ram 818 s18 m 6410 AT 0 for m 6 quantize q_int 1 ncluster ACCEL SHAPE ProLXi ndata ncluster ProLXi id m 6 quantize q lut exp lut exp ProLXi LogProLXi 6 1 m_6 end 96 1ACCEL SHAPE mid priors 5 961ACCEL TILE m 7 for m 7 quantize q int 1 ncluster mid priors m 7 quantize q priors mid priors m 7 ProLXi id m 7 end ACCEL SHAPE mid centers ndim ncluster ACCEL SHAPE mid centers temp 1 ncluster 123 ACCEL SHAPE mid_centers_temp_2 ncluster REGISTER mid_centers_temp_1 ACCEL REGISTER mid centers temp 2 for d 2 quantize q int 1 ndim ACCEL TILE m i 1 ACCEL USE embedmults 15 m 1 1 form i 1 quantize q_int 1 ncluster mid centers temp 1 m i 1 quantize q1 indatabuf d 2 m i 1 ProLXi id m i 1 mid centers temp 2 m i 1 quantize q1 mid centers temp 1 m i l mid centers d 2 m 1 mid centers d 2 m i 1 quantize q mid centers mid centers temp 2 m i 1 end end id quantize q_int id 1 if id quantize q_int ndata 1 Here is a 30 bit multiplier step_incr quantize q1 LogLikeHoodX newLikeHood To calculate the increase of the likelihood newL
47. I FPGA FPGA resources Number of used Number of Percentage of the units available units available resources Slices 6378 14336 44 Slice Flip Flops 4790 28672 16 4 input LUTs 11473 28672 40 bonded IOBs 293 720 40 Embedded Multiplier 42 96 43 BlockMems 58 96 60 GCLK 1 16 6 Table 5 7 The post synthesis timing report System Parameters Performance Minimum Period 15 987 ns Maximum Frequency 62 551 MHz Maximum input arrival time 9 901 ns Maximum output required time 7 333 ns 69 amp Array EBR Figure 5 8 Floorplan of the EM algorithm implementation on a Virtex II FPGA FPGA Result Evaluation Since we used fixed point data in the FPGA implementation the precision of the output was not as good as the floating point implementation The LUT application for the math operations furthermore decreased the precision of the output We compared the final GMM parameters from the floating point implementation and the fixed point FPGA implementation Figure 5 9 b shows the plot of the result means of each cluster from the FPGA implementation The difference is pretty small comparing to the floating point results 70 Figure 5 9 a Table 5 8 shows the average errors between the FPGA implementation and floating point implementation We can see that the differences are small 15 1 15r T 1 0 5 0 5 Figure 5 9 The mean vectors of the GMMs a is the floating poin
48. OP DMAC10 dma_channel 10 DMAC FLUSH SetIOP DSP_offset DMAC10 dma_channel 10 DMAC_FLUSH return TRUE int SetSem int sem while set_semaphore int sem 1 1 0 3 return 0 int ClearSem int sem while set semaphore int sem 0 1 0 3 return 0 RRR A AER BR HE End of lib c i Rede eek tbe bee RR OR RS A SR E RR RR A A A A EE 114 APPENDIX D FIXED POINT MATLAB CODE USING ACCELFPGA 115 AccelChip m warning off Directive to indicate the target for synthesis ACCEL TARGET XC2V3000 package_quantizer_constant q data 18 6 for data input ql 30 13 for all the accumulator and middle results q_lut_cov_inv 18 3 for inv look up table g_lut_cov_log 18 13 for log look up table q lut 18 13 for first exp look up table q lut log 18 13 for log look up table q lut prior 18 13 for prior inv look up table q int 18 0 for integer q fra 18 17 for fractional number Define the width of the word to 30 and binary point is 13 Just simplify the whole process using one fixed point length May be able to devide algorithm to different precision wl 30 fp 13 This word length is for the input data 18 bit can be fit in the block memory wl data 18 fp data 6 quantize for integer
49. TURE MODEL BASED SPIKE SORTING SYSTEM by Yongming Zhu A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in Electrical Engineering MONTANA STATE UNIVERSITY Bozeman Montana November 2003 COPYRIGHT by Yongming Zhu 2003 All Rights R served ii 373 APPROVAL Z 2 of a thesis submitted by Yongming Zhu This thesis has been read by each member of the thesis committee and has been found to be satisfactory regarding content English usage format citations bibliographic style and consistency and is ready for submission to the College of Graduate Studies Dr Ross K Snider Kos 24 oe 2 1 07 Signature Date Approved for the Department of Electrical and Computer Engineering Dr James Peterson 2 2 Date Approved for the College of Graduate Studies Dr Bruce T m Signature 6 04 Date STATEMENT OF PERMISSION TO USE In presenting this thesis in partial fulfillment of the requirements for a master s degree at Montana State University I agree that the Library shall make it available to borrowers under rules of the Library If I have indicated my intention to copyright this thesis by including a copyright notice page copying is allowable only for scholarly purposes consistent with fair use as prescribed in the U S Copyright Law Requests for permission for extended quotation from or
50. _i_3 centers d_4 m_i_3 square cov m i 3 quantize q_square_cov sub_cov_1 m_i_3 sub_cov_2 m_i_3 mult cov m i 3 quantize ql square cov m i 3 ProLXi id m 1 3 mid covars temp m i 3 quantize q mid covars mid covars d 4 m i 3 mult cov m i 3 mid covars d 4 m i 3 mid covars temp m i 3 end i end id quantize q_int id 1 if id quantize q_int ndata 1 loop_flag quantize q_int loop_flag 1 id quantize q_int 1 for d 5 quantize q_int 1 ndim ACCEL SHAPE mid covars ndim ncluster ACCEL TILE m 12 ACCEL MAP mid covars m 12 TO ram 518 s18 m 12430 ACCEL MEM MAP covars m 12 TO ram s18 s18 m 12425 AT 0 965 ACCEL USE embedmults m 124 40 for m 12 quantize q int 1 ncluster covars d_5 m_12 quantize q_covars mid_covars d_5 m_12 mid_priors_inv m_12 covars_test m_12 quantize q1 covars d_5 m_12 end end Look up table for inv and log of covars gt gt gt gt gt gt gt gt gt covars_inv covars_log fun_lut_covars covars SHAPE covars_addr 1 ACCEL SHAPE covars_inv ndim ncluster ACCEL SHAPE covars_log ndim ncluster for d 6 quantize q_int 1 ndim 126 if covars d_6 1 gt upper_const covars d_6 1 quantize q_covars upper_const 2 2 24 12 end end ford 9 quantize q_int 1 ndim if covars d_9 2 gt upper_const covars d_9 2 quantize q_covars upper_cons
51. acellular recording waveform shows different action potentials from an unknown number of local neurons The data were recorded from an adult female cricket s cercal sensory system Data courtesy of Alex 7 NER 2 1 2 A typical neural signal recording detecting and spike sorting SYSUOM 20 0 ccsesscssessserscssnscvsseeeessssceeessosseeeessosseesaesseseasesssceesessoeessseesosseseseneseeees 3 2 1 Non Gaussian density estimation using GMM s 10 2 2 Illustration of the spike generating process essei eee 12 2 3 A multi variant Gaussian model of the spike waveforms generated from a particular neuron The solid line shows the mean vector of this model while the dashed line shows the three 9 boundary according to the covariance matrix All the waveforms in this model are shown in the 13 2 4 Block Diagram of the Bayesian s decision rule used for spike sorting system The log probability for a spike being in a cluster is computed for each individual cluster and the highest scoring Cluster is the identified 4 eee tenete 15 2 5 Diagram of EM algorithm for GMM parameter estimation 23 3 1 Plot of all the 720 neural spike waveforms cessere 26 3 2 The increase in likelihood due to the increase of cluster
52. ally include more waveforms in this cluster where some spike shapes are noticeably different Using the diagonal covariance matrix approach the small variance was captured and a small cluster was generated to represent this specific spike shape For our data set the diagonal covariance approach lost the least amount of information in the original spike waveforms We chose this approach for all hardware implementations Implementation Result in Matlab Using the initialization condition mentioned in section 3 3 1 the training process for 720 160 dimensional inputs took 8 iterations to converge The mean waveform and priors of each cluster are show in Figure 3 6 Figure 3 7 shows the cluster results of the 720 training data using the GMM As you can see in Figure 3 7 all the main clusters have been recognized Cluster 1 and Cluster 2 are the clusters with the most data Cluster 3 and Cluster 4 can be regarded as the noisy versions of Cluster 1 and Cluster 2 Their mean vectors are very close to Cluster 1 and Cluster 2 But their waveforms are noisier thus the covariance of these two clusters are bigger Cluster 5 has its own specific spike shape It has the least number of spikes among the five clusters so it has the smallest prior probability Overall this 5 component Gaussian Mixture Model can pretty well model the whole data set from visual inspection 37 0 041667 0 16245 0 1903 0 25275 0 35282 Figure 3 7
53. antize q int 1 ncluster c temp quantize q1 priors m 4 a_lut_out m_4 quantize q1 c c temp end ifc gt 0 5 quantize q1 0 4995 end 1 ACCEL REGISTER c_addr ACCEL REGISTER c lut out ACCEL SHAPE c_addr 1 c_addr accel_bitshl c 11 q1 q_addr ACCEL SHAPE lut_log_content 1024 ACCEL SHAPE c_lut_out 1 lut quantize q_lut_log lut_log_content c_addr 1 122 ACCEL REGISTER LogLikeHoodXi ACCEL SHAPE LogLikeHood 1 ACCEL SHAPE LogLikeHoodX 1 LogLikeHoodXi quantize q_loglike c_Iut_out maxa ACCEL REGISTER LogLikeHoodX_temp ACCEL SHAPE LogLikeHoodX_temp 1 ACCEL USE embedmults 15 LogLikeHoodX_temp quantize q1 LogLikeHoodX LogLikeHoodXi ndata_inv LogLikeHoodX quantize q1 LogLikeHoodX_temp ACCEL SHAPE LogProLXi 5 LogProLXi 1 quantize q1 LogProXiL 1 priors_log 1 LogLikeHoodXi LogProLXi 2 quantize q1 LogProXiL 2 priors_log 2 LogLikeHoodXi LogProLXi 3 quantize q1 LogProXiL 3 priors log 3 LogLikeHoodXi LogProLXi 4 quantize q1 LogProXiL 4 priors_log 4 LogLikeHoodX1 LogProLXi 5 quantize q1 LogProXiL 5 priors log 5 LogLikeHoodXi for m_5_2 quantize q int 1 5 if LogProLXi m 5 2 gt 4 LogProLXi m 5 2 quantize q1 3 998 end end for m 5 3 quantize q_int 1 5 if LogProLXi m 5 3 0 LogProLXi m 5 3 quantize q1 0 end end for m 6 1 quantize q int 1 ncluster ACC
54. at centers 4 quantize q_data load centers_4 dat centers 5 quantize q_data load centers_5 dat centers 1 centers_1 centers 2 centers_2 centers 3 centers_3 centers 4 centers 4 centers 5 centers 5 Define constant log_const quantize q1 147 0302 log_const log 2 pi ndim 2 ndim quantize q_int 160 ndata quantize q_int 5 ncluster quantize q_int 5 stop_thre quantize q_int 6 25 MaxLoop quantize q_int 27 LoopCount quantize q_int 1 loop_flag quantize q_int 1 id quantize q_int 1 centers_test zeros 1 5 covars_test zeros 1 5 upper const quantize q fra 0 2498 covars quantize q_covars zeros ndim ncluster 1 priors quantize q priors zeros 1 ncluster 1 0 2 ndata inv quantize q fra 0 0014 const zero quantize q_int 0 The middle result we need indatabuf zeros 1 ndim LogProXiL quantize q1 zeros 1 ncluster LogLikeHoodXi quantize q_loglike 0 LogLikeHoodX quantize q1 0 ProLXi quantize q_lut_exp zeros ndata ncluster mid_priors quantize q_mid_priors zeros 1 ncluster mid centers quantize q_mid_centers zeros ndim ncluster mid quantize q mid covars zeros ndim ncluster mid inv quantize q inv zeros 1 ncluster 118 newLikeHood quantize q1 0 covars_inv quantize q_lut_cov_inv zeros ndim ncluster 1 initialize the cov_i
55. atlab in the similar way A list of the LUTS is found in Appendix D 65 Parallelism Needed for FPGA Implementation To fully use the computational power of the FPGA we needed to parallelize the EM algorithm In Matlab programs are executed sequentially according the program sequence AccelChip offers directives which can tell the compiler when to parallelize the algorithm and when to use the embedded FPGA resources The following are some directives that have been used extensively in the Matlab program Table 5 4 List of directives can be used in AccelFPGA REGISTER 2 hardware registers for specific variables to add pipeline MAP Map the array variable to a specific block memory on the FPGA TILE Specify an array of computations to happen concurrently Specify the hardware resources such as embedded multiplier for HSE the particular operations The code in List 5 1 shows how to actually using these directives in the Matlab code The operation in the for loop is parallelized using TILE directive Block memories are used on each array variable and an embedded multiplier is used for each multiplication By using these directives the multiplication operations applying to all 5 clusters can be executed in one clock cycle 66 List 5 1 Example codes for using directives in Matlab ACCEL TILE m 9 ACCEL MAP mid_centers m_9 TO ram_s18_s18 m_9 20 ATO MEM MAP centers m 9
56. d and Wait and Start SetlIOP SEM_GLOBAL 1 while GetIOP SEM LOCAL 1 GetlOP SEM_LOCAL_2 GetlOP SEM_LOCAL_3 0 E SetIOP SEM_GLOBAL 0 see so se eo sk ooo sese Rok ok kk kk ae Clear all the log version matrix sk spen see e sk e ore eo oe oh e for i 0 i lt nclusters i for j 0 j lt ndim j logcovars_master i j 0 Clear the logcovars matrix logcenters_master i j 0 Clear the logcenters matrix se ee ee as eo sk oko he see ee eoe oe Calculate the probablity in each cluster sk e sek espe k a oH a kkk for i 0 i lt nblockdata i for j 0 j lt nclusters j 95 logproXiL_master j i getproXi centers j blockdata i covars j kk se sk oj obe ojo Calculate the liklihood of each data this model Phe He ease oh sk sj obe khe bebe eae for G 0 i lt nblockdata i Clear the store matrix logihXi_master 0 i 0 Get the max number of each column for j 0 j lt nclusters j if j 0 else maxTemp logproXiL master O i if logproXiL master j i gt maxTemp maxTemp logproXiL master j i Normalize the pro to 0 1
57. d hardware will reduce the training needed to develop sophisticated models which will allow more time to be devoted to data collection This is important in electrophysiology where limited recording times are the norm We implement a Gaussian Mixture Model based spike sorting system on both a DSP system and a Field Programmable Gate Array FPGA platform The FPGA implementation speeds up the system dramatically which will allow an online spike sorting system to be implemenated Gaussian Mixture Model If we view the measured spike signal as the combination of the original action potentials with the addition of random background noise we can use a Statistical process to model the spike signal generating process Clustering can then be viewed as a model of the statistical distribution of the observed data The whole data set can then be modeled with a mixture model with each cluster being a multivariate Gaussian distribution Gaussian mixture models have been found very useful in many signal processing applications such as image processing speech signal processing and pattern recognition 17 18 For the spike sorting problem a number of studies based on the Gaussian Mixture Model have also been tried to provide statistically plausible complete solutions These studies have shown that better performance can be obtained by using a GMM than other general clustering methods such as K means 19 fuzzy c means 20 and ne
58. d increase is shown in the bar plot The dashed line shows the threshold value Initialization of the EM Algorithm The selection of initial model parameters will affect the final results This is because the likelihood surface associated with GMM tends to exhibit multiple local maxima Thus different initialization points will lead to different local maximum results A generally used method is to apply the K means clustering algorithm first and then use its output as the initial value of the mixture model s parameters 16 29 However the K means method itself is also sensitive to initialization and does not guarantee to get the right initialization for the mixture model Furthermore K means algorithm is computationally expensive to implement Another more straight forward method is to randomly select initialization from the data set It has been shown that this method gives the same performance as the K means method 24 In this paper we initialize the means to randomly chosen data points and pick a single covariance matrix for all the clusters We set values of the initialized covariance matrix to be reasonably large so that each cluster can see all the data at the beginning of the training We randomly picked 100 starting points from the data set and chose the result that provided the largest likelihood Figure 3 3 shows the means for the 5 initialization clusters The covariance matrix X and prior probabilities
59. d was the diagonal matrix as shown in the next section 35 Y 20 40 6 80 100 120 140 160 PCA belore full covariance matrix with first 5 components Figure 3 5 Results of three different approach The diagonal covariance approach was the only method that captured the small cluster As we mentioned in the previous section it was hard to validate the cluster result since we didn t actually know the typical spike shapes that came from individual neurons However by looking at the mean vectors of each cluster we had a good idea about the ability of each approach to capture the clusters among the data Figure 3 5 shows all the cluster means of the three methods All the spikes are in yellow and the mean vector of each cluster is shown in a different color While all of the three approaches caught the main clusters the diagonal matrix approach did the best in catching the smallest cluster red line The other two methods failed to catch this smallest cluster Notice that in the recondition approach the prior probabilities for the small clusters tend to be big This is because adding a number to the covariance prevents the cluster to be small Also notice that the biggest cluster in the 36 reconditioned and PCA method is different from the diagonal covariance matrix method This is due to the inability of these two methods to distinguish small variance in the low amplitude waveforms They actu
60. e Expectation Maximization EM algorithm EM algorithms have been widely used in estimating the parameters of a stochastic process from a set of observed samples Details about ML and EM algorithm will be discussed in the next section After estimation of the model parameters classification is by the Bayesian decision rule The probability that each observation x belongs to a particular classe 0 is calculated by using Bayes rule Eq 2 3 saloa ple po Eq 2 3 i M k 1 Each individual density p o is determined by the means and covariance of each component density using Eq 2 2 The prior probability p o is identical to the weights a in Eq 2 1 Since the cluster membership is probabilistic the cluster boundaries can be computed as a function of a confidence level The spikes that don t fall into any clusters will be considered as noise or overlapping spikes that can be dealt with later 15 Mathematically it can be shown that if the model is accurate the Bayesian decision boundaries will be optimal resulting in the fewest number of misclassifications 23 log Cluster 1 04 log Cluster 2 02 22 42 05 Max Cluster selection Result log Cluster M Figure 2 4 Block Diagram of the Bayesian s decision rule used for spike sorting sy
61. e lt vector h gt include lt 21160 h gt include lt def21160 h gt include speeddsp h include diag master 2 h define DMA 1 Use DMA channel for external memory data transaction define GLOBAL 0x0750000 Global Flag define DSP_TUMP 0 0100000 Memory adress jump between DSPs define LH_BLOCK_1 0x0250002 Base address of Likelihood of 1st block define LC FLAG 1 0x0250001 Base address of local flag fidefine MULTI BASE 0x0700000 define EXTERNAL BASE 0x0800000 define SLAVE BASE 1 0x0200000 define SLAVE BASE 2 0x0300000 define SLAVE BASE 3 0x0400000 define START 1 Define start for glocal flag define STOP 0 Define stop for global flag define nprocessors 4 How many processors in the system define PR_BLOCK_1 0x0250004 Base address for middle results of DSP 1 define BLOCK 1 0x025064C Base address for transfer results to DSP 1 define GLOBAL P 0x075064C Global address for transfer results to all DSPs define PRIOR LENGTH 3 Length of priors array define LENGTH 800 Length of centers and covars array defme P BLOCK LENGTH 1606 Length of the Parameters Block define SEM GLOBAL Gnt MSGRO int MULTI BASE define SEM LOCAL 1 Gnt MSGR1 int SLAVE BASE 1 define SEM LOCAL 2 Gnt MSGR2 int SLAVE BASE 2 define SEM LOCAL 3 Gnt MSGR3 in SLAVE BASE 3 define SEM DMA WRITE 0 5 4 in EXTERNAL BASE define SEM
62. e appropriate for the likelihood Given O we can easily 21 compute p x O for each data vector x for each component density y In addition the mixing parameters can be thought of as prior probabilities of each mixture component that is p 0 Therefore using Bayes s rule we can compute 2 8 x Oy M PCy x 0 Eq 2 14 gt Py x 0 ye N pv X 0 pty Eq 2 15 i l Here we assume that each neuron fires independently from each other Now we have obtained the desired marginal density by assuming the existence of the hidden variables and making a guess at the initial parameters of their distribution We can now start the E step in the EM algorithm and using this marginal density function to calculate the expected value of the log likelihood of the whole data The expected value is found to be The derivation can found in Appendix A M N M N Q 0 0 Y logia p 1 O 9 V logo x 0 p x 0 Eq 2 16 11 11 With the expression for the expected value of the likelihood and we can go to the next M step to maximize this expected likelihood with respect to the parameters 0 4 2 We can maximize the term containing 0 and the term containing O independently since they are not relai To find the expression for we have to solve Eq 2 17 with the constraint of Eq 2 18 22 M N 90 loga p x 0
63. e status of the Master processor Each slave processor had its own local semaphore which told the Master DSP their status Table 4 1 List of semaphores and their functions Function of each semaphore SET 1 CLEAR 0 Master Semaphore Master Processor is running Master task finished Local Semaphore 1 3 Slave Processors is running Slave tasks finished DMA Semaphore DMA bus available DMA bus occupied The system diagram is shown in Figure 4 3 The algorithm was divided into 6 phases The phases in shaded blocks were implemented on multiple DSPs Notice that the master processor acts also like a slave while doing the parallel computation Most of the computational intensive parts which have been discussed in section 4 2 were 47 implemented on multiple DSPs The parts implemented solely on the master processor were just basically additions So during the whole training process of the algorithm most of the time all four processors were running at full speed Another significant amount of time was spent on transferring data between DSPs The cluster bus ran at SOMHz and was shared by the all four DSPs as well as the on board SDRAM We first let the DSPs resolve the bus contention automatically The on chip arbitration logic of ADSP 21160 processors determined which DSP was the current bus master based on the Bus Request BR pins of the DSPs Multiple DSP implementations can be done in this mode However the approach
64. er methods such as wavelet denoising 8 can also be used to remove the recording noise The second stage is spike detection This is usually achieved by a simple threshold method in which spikes are picked up when the maximum amplitude is bigger than a manually set threshold value However other methods including non linear energy operators 9 wavelet based detectors 10 and slope shape detectors 8 have been used to improve the detection performance especially in the low SNR situations A new approach in which a noise model is first generated and then the distance from this noise model is computed to identify spikes 7 is used to obtain the spikes for this paper However even in this approach a threshold must be chosen Once the data has been collected features must be extracted to used in classification The features can be simple user defined features such as peak spike amplitude spike width slope of initial rise etc Another widely used method for feature extraction is Principle Components Analysis PCA 11 PCA is used to find an ordered set of orthogonal basis vectors that can capture the directions in the data of largest variation The first K principal components will describe the most variation of the data and can be used as features to classify the spikes Wavelet decomposition has also been used to define spike features in a combined time frequency domain Under high background noise another linear trans
65. er mode the cluster bus is still the bottle neck of speed in the multiple DSP system Performance on the DSP system could probably be enhanced by more manual tuning on the assembly language level Using DSPs with high core speed or clustering more DSPs can also improve the performance However the improvement will be very limited and too much time and effort will be required So to gain better performance for the real time spike sorting system we turned to another solution namely Field Programmable Gate Arrays FPGA In the next Chapter we will implement our EM algorithm on a Xilinx Virtex II FPGA system To implement the EM algorithm on FPGAs a fixed point version of the algorithm was first developed Also specialized math operations have to be transformed to facilitate the implementation of the algorithm on a FPGA These problems will be addressed in the next Chapter and the implementation results will be presented 51 CHAPTER 5 FPGA IMPLEMENATION Introduction Field Poe ane Gate Arrays FPGAs are reconfigurable hardware that can be changed in electronic structure either statically or dynamically Because the implementation is on the actual hardware FPGAs can obtain a significant performance improvement over the software implementations based on DSP chips or PCs 30 One drawback of a FPGA implementation is that for complex algorithms like the EM algorithm it usually takes a long time to design and optimize the alg
66. ers without processor intervention The ADSP 21160 offers powerful features to implement multiprocessing DSP systems The external bus supports a unified address space that allows direct interprocessor accesses of each of the ADSP 2116078 internal memory Six link ports provide a second method of multiprocessing Based on the link ports a large multiprocessor system can be constructed in a 2D or 3D fashion Bittware Hammerhead PCI board and VisualDSP The structure of Bittware Hammerhead PCI board is shown in Figure 4 1 This board has four Analog Device s ADSP 21160 processors up to 512 MB of SDRAM 2 MB of Flash Memory and two PMC mezzanine sites Bittware s FIN ASIC is used to interface the ADSP 21160 DSPs to the 64 bit 66 MHz PCI bus the Flash memory and a peripheral bus The Hammerhead PCI s four DSPs can communicate with the 42 host PC through the PCI bus Four of the link ports are connected to other DSPs which allow the DSPs to communicate through the link ports Each processor is also connected to a common 50MHz 64 bit ADSP 21160 cluster bus which gives it access to the other three processors and to up to 512 MB of SDRAM In our application we used all four ADSP 21160 processors on the board as well as the 256 MB SDRAM We used the cluster bus to communicate between four DSPs and SDRAM The cluster bus approach is relatively easy to implement It was also faster than using the link ports We used Vis
67. estimation procedure is given Maximum Likelihood Estimation Maximum likelihood parameter estimation is a general and powerful method for estimating the parameters of a stochastic process from a set of observed samples We have a density function p x O that is governed by the set of parameters 4 2 We also have a data set of size N supposedly drawn from this distribution ie X x 3x x xy We assume that these data vectors are independent and identically distributed i i d with distribution p x O Therefore the resulting probability for all the samples is N p X 10 I p 0 7 0 X Eq 2 4 This function O X is called the likelihood of the parameters given the data or the likelihood function The likelihood is treated as a function of the parameters O where the data X is fixed For maximum likelihood parameters estimation our goal is to find the O that maximizes O X That is we wish to find O where O argmax O X Eq 2 5 If p x O is simply a single Gaussian distribution and O 07 Then we can set the derivative of the function to zero and solve the likelihood Eq 2 6 directly for and O 0 Eq 2 6 30 18 However in our case using the GMM p x O is a weighted sum of several single Gaussian distributions Attempting to solve Eq 2 6 directly for the GMM parameters O 2 4 2 i 1 M does not yield a closed form solution 25 Thus we mu
68. forming method based on entropy maximization has been reported to generate good features for classification 12 People use extracted features instead of the full sampled waveform because of the Curse of Dimensionality which means the computational costs and the amount of training data needed grow exponentially with the dimension of the problem In high dimensional space more data will be needed to estimate the model densities Hence with limited computational ability and training data high dienen problems are practically impossible ii solve especially in regards to statistical modeling A person s inability to view high dimensional data is another reason to decrease the data dimension to two or three for manual clustering In this paper we present a log version of the Expectation Maximization EM algorithm which solves the numerical problem arising from the high dimensional input based on the Gaussian Mixture Model In this case we ciae the full sampled spike waveforms which preserve all the information of the neural spikes as the classifier input The last stage is the clustering of spikes In this stage spikes with similar features are grouped into clusters Each cluster represents the spikes that arise from a single neuron In most laboratories this stage is done manually with the help of visualization software All the spikes are plotted in some feature ses and the user simply draws ellipses or polygons around set
69. he specimen s bodies Details about the experiments can be found in 27 After raw data was collected neural spikes were detected by first identifying the segments of having no spikes A noise model was generated by assuming that the noise was Gaussian Then the spikes were detected if the distance between the recorded data and the noise model was bigger than a manually set threshold 7 After the threshold was set we got rid of any waveforms that were obviously not action potentials and left 720 spikes for training of the EM algorithm The 720 spikes are show in Figure 3 1 Each spike lasts for 3 2ms and has 160 sampling points This set of 720 160 dimensional vectors was used in the EM algorithm Figure 3 1 Plot of all the 720 neural spike waveforms 27 Practical Problems Related to the EM algorithm Model Number Selection The number of Gaussians for the mixture model must be selected before using the EM method However in spike sorting we do not have a priori information of how many neurons are producing action potentials So we have to estimate the model number using the whole data set Based on the probabilistic fundado of GMM we can view the choice of the optimal cluster number as a problem of fitting the data by the best model This problem can be solved by applying the maximum likelihood estimate However the likelihood is a monotonically increasing function of the number of parameters In other wo
70. ikeHood quantize q1 LogLikeHoodX ACCEL SYNTHESIS OFF newLikeHood bin 2 num2bin q1 newLikeHood newLikeHood num bin2num q1_int newLikeHood_bin fprintf d LH is Improve is 96d Num is d Bin is s n LoopCount newLikeHood step_incr newLikeHood_num num2str newLikeHood_bin ACCEL SYNTHESIS if step incr 0 step incr step incr end Function to get the inv of the mid_priors gt gt gt gt gt gt gt mid_priors_inv gt gt gt gt gt gt gt fun_lut_priors_mid_inv mid_priors ACCEL SHAPE mid_priors_inv_addr 1 96 ACCEL SHAPE mid priors inv 5 95 1ACCEL SHAPE inv content 1024 for m 8 quantize q int 1 ncluster mid priors inv addr accel_bitshl mid_priors m_8 1 q1 q_addr mid priors inv m 8 quantize q lut inv lut inv content mid priors inv 4 1 96 Change to 2 for simulation 124 end ACCEL TILE m 11 76 ACCEL USE embedmults m_11 20 for m_11 quantize q_int 1 ncluster priors m 11 quantize q_priors mid_priors m_11 ndata_inv mid_priors m_11 quantize q_mid_priors const_zero end q_test_mult quantizer floor wrap 36 21 for d 3 quantize q int 1 ndim 961ACCEL TILE m 9 ACCEL MAP mid centers m 9 TO ram 518 s18 m 9 20 ACCEL MEM MAP centers m 9 TO ram 518 s18 m 9 15 AT 0 USE embedmults m 94 25 96 ACCEL SHAPE centers test 5 for m 9
71. imized for high speed fixed point addition subtraction and multiplication However a number of other math operations such as division or exponentiation have to be customized Using Taylor series is an option which can transform the complicated operation into simple additione and multiplications However for the EM algorithm the dynamic ranges of these operations are quite large so that it will require a very high order Taylor series to obtain good approximations Another way is to use a look up table LUT Knowing the range of these math operations we can use a to one look up table to approximate them By dividing the dynamic range input into 60 substantially smaller intervals this approach will gain enough precision for the algorithm to converge The BlockKRAM on the Virtex II FPGA can be ised to store the content of the look up tables and their address bus can be used as the look up table inputs The look up table operations can be done within one system clock cycle based on the BlockRAMs Nonlinear mapping and output interpolation can be used to improve the LUT performance Here we just used a single linear LUT approach in which we divided the input data space evenly into small spaces and mapped every point in the input space to the output space Two factors input precision and the LUT size have to be determined for each LUT depending on the input dynamic range These two factors have the following relat
72. ing inside the FPGA However this would take much more work Another way to speed up the current system is to find the critical path of the FPGA implementation and add more registers By increasing the pipeline depth of the critical path we can increase the clock rate of the whole system This could be explored in AccelFPGA by rewriting the Matlab code to produce more intermediate results and adding more compiler directives The current implementation on the Virtex II XC2V3000 FPGA used about half of the embedded hardware resources More parallelism of the EM algorithm could be exploited by using more hardware resources in the FPGA For example in the training process of the each cluster the whole training data set could be divided into smaller blocks The computation of all the data blocks could be parallelized on the FPGA Also the use of multipe FPGAs could enable the more complex implementations requiring more hardware resources than what is contained on a single FPGA 76 BIBLIOGRAPHY 1 2 3 4 5 6 7 8 9 10 11 12 13 14 77 Hodgkin amp A L Huxley 1952 A quantitative description of membrane current and its application to conduction and excitation in nerve Jourial of Physiology 117 500 544 F Rieke D Warland R de Ruyter van Steveninck amp W Bialek 1997 Spikes Exploring the Neural Code MIT press Cambridge Massachusetts J G Nicholls
73. ionship Dynamic Range LUT size Precision Eq 5 1 Ideally we want the dynamic range to be as large as possible while the precision as small as possible which leads to an infinitely large LUT size However in most of the real world applications the input dynamic range is limited to a certain range The precision requirement may vary for different applications For the EM algorithm all we needed was to estimate the parameters for the GMM and be able to classify the neural spikes As long as the EM algorithm converges to the right solution the precision of each variable is not extremely important Hence a reasonable size of the LUT can be used for each math operation in the EM algorithm and can be implemented on the Virtex II FPGA platform 61 We built the LUTs using the BlockRAM in the Xilinx Virtex II FPGA As described in section 5 2 2 the BlockKRAM are 18 Kbit block memories Each 18 Kbit block memory can be configured into various address data aspect ratios as shown in Table 5 1 We chose the 1K 18bit configuration for our LUT implementation The 18bit output bit width gave enough precision for the LUT outputs Furthermore the output width matched the embedded multipliers input width which maximized the usage of the embedded multiplier resources After choosing the LUT size which was 1024 1K in our case we found the input precision by tracking the input dynamic range Once the input dynamic range was de
74. ize q1 priors 1 outdatabuf 7 quantize q1 priors 2 outdatabuf 8 quantize q1 priors 3 outdatabuf 9 quantize q1 priors 4 outdatabuf 10 quantize q1 priors 5 outdatabuf 11 quantize q1 newLikeHood outdata_1 n quantize ql outdatabuf 1 outdata 2 n quantize q1 outdatabuf_2 outdata 3 n quantize q1 outdatabuf_3 outdata 4 n quantize q1 outdatabuf 4 outdata 5 n quantize q1 outdatabuf 5 outdata_6 n quantize ql outdatabuf 6 outdata 7 n quantize q1 outdatabuf 7 outdata 8 n quantize ql outdatabuf 8 outdata_9 n quantize ql outdatabuf 9 outdata_10 n quantize q1 outdatabuf 10 1 outdata 11 n quantize q1 outdatabuf 11 ACCEL END HARDWARE outdata l outdata 2 0utdata 3 outdata 4 outdata S outdata 11 end end 762 10392217 3
75. lgorithm 44 Figure 4 2 shows the most computational part in the EM algorithm The calculation of p x 0 and the update of means and covariance matrices take about 70 of all the execution time These two parts should be implemented on multiple DSPs The calculation of p o x and x O and rest the algorithm is relatively less computationally intensive A close look at parts a and b shows that most of the computational operations are multiply and multiply accumulation type of instructions Thus the floating point MAC inside the ADSP 21160 will be able to speed up the operations Parallel Transformation of the EM Algorithm To fully use the computational power of the four DSPs on the Bittware board the EM algorithm was transformed to a parallel version The most straightforward way is to map the computation of each cluster onto each single DSP However there were several drawbacks of this approach First there are five clusters in our model while there are only 4 DSPs on the Hammerhead board Hence we have to put the computation of two clusters into one DSP creating an uneven computational load This will reduce the speed of the whole system since three DSPs have to wait while the other DSP deals with the extra computation for the second cluster Secondly the update of parameters is not independent among all the clusters Before updating the parameters of each cluster the probability of each data in the current
76. likely to generate a large number after many Using 30 bit word length prevented overflow problems for the accumulation results After deciding on the word length the next step was to choose the binary point for each variable This step actually determined the precision we got from the FPGA implementation Since some parts of the algorithm have been transformed into the log 59 domain the dynamic ranges of each variable are quite different So with the same word length each variable will need its own binary point to meet the precision requirement This would be quite hard to implement in a FPGA if an HDL coding approach was used Whenever a math operation occurs between variables with different binary points a shift is needed to align these variables A lot of work has to be done in the low level HDL programming to make sure every math operation has the binary point aligned correctly But in AccelFPGA the compiler will automatically align the binary point This allows the binary point of each variable to be easily changed and the algorithm tested to make sure the fixed point implementation is sod correctly The newest version of AccelFPGA V2 0 can automatically find the suitable ond length and binary point for your algorithm However at the time the thesis was done this feature was not available and all the binary points were tuned manually in AccelFPGA V1 6 Using Lookup Tables The structure of the FPGAs is opt
77. lloc ndim sizeof float meanTemp float calloc ndim sizeof float detcovTemp float calloc ndim sizeof float 1 for j 0 j lt ndim j 1 invcovTemplj 1 covars j log version of the diagonal covar vlog DD detcovTemp covars ndim det prod cov logdet sum logcov divider vsum sf D detcovTemp 1 ndim divider divider 0 5 vecvsub blockdata centers meanTemp ndim vecvmlt meanTemp meanTemp meanTemp ndim vecvmlt meanTemp invcovTemp meanTemp ndim proXiTemp vsum sf D meanTemp 1 ndim proXiTemp 0 5 proXiTemp proXi proXiTemp divider log_normal free invcovTemp free mean Temp free detcovTemp return proXi void block_sum float blocksums float blockdata ndim float logproLXi int offset int i j 0 for G 0 i lt ndim i 110 for 0 2 lt nblo ckdata j blocksums i blocksums i blockdata j i logproLXi j int extmem_to_intmem int extsrc int intdest int num to copy int dma channel int wait int chan_offset int DSP_offset int New_intdest int dma status bit int testwait if internal memory is in MASTER no offset and intdest need not change offset 0 5 New intdest int intdest Check for valid DMA channel if dma channel lt 10 dma channel gt 13 return FALSE chan offset dma channel 10 lt lt 3 if in
78. loating point result b Fixed point result Figure 5 10 The clustering results of both floating point and fixed point implementations 72 EPGA Performance In the post synthesis simulation it took 60M clock cycles to finish one step of the EM algorithm in the FPGA Running at 60MHz it took 60M 1 60M 8 0 08 second to finish the training process Table 5 10 shows that the FPGA implementation is 16 4 times faster than the fastest PC and 42 5 times faster than the parallel DSP implementation Table 5 10 Performance comparison between all the platforms Single Four PC 2 3 DSP___DSP__Average _ PCBes FPGA Execution Time s 13 5 3 4 3 22 1 31 0 08 Speedup Related to the 0 05 04 04 1 164 best PC times l Average from following four systems 1 AMD 1 37GHz 512 MB memory 2 Pentium 1 0 GHz 512 MB memory 3 AMD 1 2GHz Duo Processor 4 GB memory 4 Pentium IV 3 2 GHz 1 GB memory 2 Best performances from Pentium IV 3 2 GHz system with 1GB memory 3 AccelFPGA V1 6 ISE V5 2 02i and Modelsim Starter 5 6e were used in the FPGA implementation 73 CHAPTER 6 CONCLUSION AND FUTURE WORK Conclusion This thesis presents the implementation results of a Gaussian Mixture Model training process on three different platforms These platforms include PCs multiple DSPs and an FPGA The Expectation Maximization algorithm was used to estimate the parameters of the GMM The fastest implementation was obtained b
79. lock Manager DCM blocks The functionality of these elements will be introduced later All of these elements are interconnected by routing resources Th general routing matrix is an array of routing switches Each vlog atis element is tied to a switch matrix allowing multiple connections to the general routing matrix All programmable elements including the routing resources are controlled by values stored in static memory cells These values are loaded in memory cells during configuration and can be reloaded to change the functions of the programmable elements Configurable Logic Blocks CLB are the basic functional units of Xilinx FPGAs Theses blocks are organized in an array and are used to build combinational and synchronous logic designs Each CLB element comprises 4 similar slices The four slices are split in two columns with two independent carry logic chains and one comman shift chain Figure 5 2 shows the details of a slice inside one CLB Each slice includes two 4 input function generators carry logic arithmetic logic gates wide function multiplexers and two storage elements Each 4 input function generator 54 can be programmed as a 4 input LUT 16 bits of distributed SelectRAM memory or a 16 bit variable tap shift register Each CLB is wired to the switch matrix to access the general routing matrix to be interconnected to other CLBs
80. mances on the PC were from Matlab implementations The performance of the parallel DSP implementation was at the same level as the average PC configuration but not as good as the high end PCs Table 4 3 Performance of the EM algorithm on DSP system and PCs Single DSP Four DSP PC Average PC Best Execution Times 13 5 3 4 3 22 131 Average from following four systems 1 AMD 1 37GHz 512 MB memory 2 Pentium III 1 0 GHz 512 MB memory 3 AMD 1 2GHz Duo Processor 4 GB memory 4 Pentium IV 3 2 2 1 GB memory Best performances from Pentium IV 3 2 GHz system with 1GB memory VisualDSP 3 0 was used in the DSP implementation 50 There are two reasons for why the parallel DSP method was similar if not slower than the PCs First the core speed of DSP is much less than the high end PC CPUs The core speed of ADSP 21160 is running at 80MHz while the core clock of Pentium IV is running at 3 3 GHz The high end PC is about 40 times fast than the DSP In spite of the optimal structure of DSP core for signal processing operations the overall performance of floating point operations on a high end PC is much better than on individual DSPs Another reason is the bus speed Since the lack of on chip memory of DSPs the data has to be transferred between DSPs through cluster bus The speed of cluster bus is at 5 0MHz which is much slower comparing to the front bus on the PCs which is up to 133MHz Although using the DMA transf
81. model has to be calculated In Eq 4 1 the probability of each data in each cluster has to be added together across all the clusters So at this point all the middle results have to be sent to one DSP Last but not the least 45 the limited on chip memory requires the DSP to read data from external memory If you look closely at the algorithm for the calculation of p x 0 as well as the update of the means and covariance the whole data set is needed In other words during these calculations the DSPs have to get the value of the whole data and put them into ALU for computing However with the limitation of the internal memory size it is impossible to store all the data set inside a single DSP The only option is to read the data set from external RAM i e the on board SDRAM These operations will require extra overhead and will slow down the process They can also disable the SIMD capability of DSP Since all the DSPs and SDRAM share the same 50MHz cluster bus multiple simultaneous uses of the bus will cause extra delays and even bus contention Due to these drawbacks we decided to choose another approach We divide the whole data set into small blocks and each processor handles one block of data This way we eliminated the overhead of transferring data from external memory With all the data inside the DSPs we fully used the SIMD mode of the DSPs to accelerate the process This also allowed us to divide the computatio
82. n can also exhibit underflow problems if all 0 78 are very small which will occur if a spike doesn t belong to any clusters So we still need to convert Eq 3 10 into the log domain as follows 33 In p o x 1 p x 0 In 0 In p x Eq 3 11 Since we have already completed In from Eq 3 8 it is fairly easy to get the result of Eq 3 11 We can use exponentiation to get p 0 x back into the normal domain and begin the parameter updates All the parameter updates of are done in the normal domain since no underflow problems will occur in this process The new parameters will be used to calculate the likelihood for the whole data set again in the log domain to determine whether the iterative process should stop The whole process can be seen in the Figure 3 4 M Step Estimate Q i 1 using Q E Step p o Initialization Q Q 0 log O X Calculate log likelihood of the data using Q 5 log likelihood L i L i 1 gt threshold Figure 3 4 Diagram of the revised EM algorithm The E step implemented in the log domain is shown inside the dashed rectangle 34 Using A Diagonal Covariance Matrix While computing Eq 3 2 we need the determinant of the covariance matrices Xi However it is difficult to get the determinant of the full covariance matrix since the data dimension is large All the
83. n evenly for each DSP regardless of the cluster number By being independent on the cluster number this approach will allow scaling to more DSPs We still need to transfer data between DSPs due to the dependence of updating the parameters But the transfer is limited to results of the parameters which is much smaller in size comparing to the whole data set This transmission can be monitored by one master DSP DMA mode was used to accelerate the parameter transfers 46 We divided the whole data set into four blocks Each block contained 180 data points These blocks were fitted into a single DSP s internal memory One of the four DSPs was considered to be the master while other three were slaves The master did the same computation tasks as the slaves except for two additional tasks cross data calculation and the management of DMA transfers Whenever cross data calculation was needed all the middle results were sent to the master from the slaves through the cluster bus When the master finished the calculation the results were sent back to the slaves If the DMA mode was used the master was the only DSP that had the control of all the DSP s DMA controllers Synchronization between the DSPs was handled via semaphores Semaphores were also used to prevent DMA bus contention Table 4 1 shows all the semaphores that were used in the program The master semaphore was a global semaphore that all slave processors read to know th
84. nt in a Gaussian Mixture Model Details about using GMMs to model the spike sorting process will be described in the next section GMMs for spike sorting Since action potentials are observed with no labeling from neurons they arise from spike generating can be considered to be a hidden process To see how the hidden process leads itself to modeling the spike waveforms consider the generative process in Figure 2 2 12 The characteristic White Noise spike shapes No 1 US White Noise A No 2 neuron White Noise si Song Spike Train ystem No 3 neuron White Noise No 4 neuron Figure 2 2 Illustration of the spike generating process The neural spike train is considered to be generated from a hidden stochastic process Each spike is generated from one of the neurons according to an a priori discrete probability distribution a for selecting neuron i If we assume that the action potentials from each neuron has its own characteristic waveform shape mean value and addition of white noise with different variance covariance observations from each neuron can be modeled by a multi variable Gaussian model as seen in Figure 2 3 The observation sequence thus consists of feature vectors drawn from different statistical populations each neuron in a hidden manner 13 5 ein ER 071 20 40 60 80 100 120 140 160 Figure 2 3
85. ntly isolate action potentials arising from individual neurons Single nerve cell recording is possible by using intracellular electrodes Glass pipettes and high impedance sharp electrodes are used to penetrate a particular nerve cell and monitor the electrical activities directly This has been used to understand the mechanism of action potentials and many other neuronal phenomena 5 6 However these electrodes are not practical in multi neuron recording For intact free moving animals a small movement of the intracellular electrodes will easily damage the nerve cell tissue Furthermore it is very difficult to isolate a large number of neurons in a small local region In awake animals the isolation sometimes lasts for a very short period of time Fortunately since all that we need is the timing of action potential it is possible to use extracellular electrodes to acquire this information With larger tips than intracellular electrodes extracellular electrodes can simultaneously record signals of a small number 3 to 5 of neurons from a local region Figure 1 shows the waveform comprised of action potentials recorded by one extracellular microelectrode Each voltage spike in the waveform is the result of an action potential from one or more neurons near the electrode The process of identifying and distinguishing these spikes that arise from different neurons is called spike sorting 1 L 20 25 35 40
86. numbers The amount of the log likelihood increase is shown in the bar plot The dashed line shows the threshold value eese 28 3 3 Initialization of the EM algorithm esee eee 29 3 4 Diagram of the revised EM algorithm The E step implemented in the log domain is shown inside the dashed rectangle 33 3 5 Results of three different approach 35 3 6 Clustering result from Matlab 37 3 7 Clustering result for the training data Five clusters are labeled using different color eee eese eee essen Vii LIST OF FIGURES CONTINUED FIGURE PAGE 4 1 Block diagram of Hammerhead PCI system ssessssssssssssssssssessssasensssesees 42 4 2 The diagram shows most computational intensive parts in the EM algorithm and their execution percentage in the whole process a Calculation of x 9 probability of each data point in each cluster based on the current parameters b Update of means and covariance matrices c Calculation of p o and 2 0 probability of being in each cluster given individual input data point and likelihood of each data point in the current Gaussian Mixture Model d Rest of the algorithm eere 43 4 3 Diagram of multiple DSP implementation of EM algorithm
87. nv covars_log quantize q_lut_cov_log zeros ndim ncluster covars_det quantize q1 zeros 1 ncluster priors_log quantize q_lut_log zeros 1 ncluster 1 1 6094 result_covars quantize q_covars zeros ndim ncluster result_centers quantize q_small zeros ndim ncluster result_priors quantize q_priors zeros 1 ncluster BUFSIZE 160 one at a time Revise Use frame input buffer 160 point at once To see how it works NUMSAMPS 160 720 18860 input data point Main LOOP Fo Fo Fo Fo Fo Fo Fo Fo Fo Fo Vo Vo Yo Fo Fo Fo Fo Fo To Vo Vo To To To To o Po Po PP while 1 data steam in ACCEL STREAM n for n 1 ndim NUMSAMPS ndim 41 section targeted for hardware starts here ACCEL BEGIN HARDWARE indata_1 indata 2 indata_3 indata_4 indata_5 ACCEL SHAPE indatabuf_1 ndim ACCEL SHAPE indatabuf_2 ndim ACCEL SHAPE indatabuf_3 ndim ACCEL SHAPE indatabuf_4 ndim ACCEL SHAPE indatabuf_S ndim indatabuf_1 quantize q_data indata_1 n n ndim 1 indatabuf_2 quantize q_data indata_2 n n ndim 1 indatabuf_3 quantize q_data indata_3 n n ndim 1 indatabuf_4 quantize q_data indata_4 n n ndim 1 indatabuf_5 quantize q_data indata_5 n n ndim 1 ACCEL SHAPE indatabuf ndim ncluster for d 0 quantize q int 1 ndim indatabuf d_0 1 indatabuf l d 0 indatabuf d 0 2 indatabuf 2 d 0 indatabuf d 0 3 indatabuf
88. oat priors 1 nclusters 0 segment seg_data float data ndata ndim O include rescal_720_data_directload dat Data to be transmitted to external memory float blockdata nblockdata ndim 0 segment seg_pmda float blockdata difffnblockdata ndim 0 103 float logproXiL_block_1 nclusters nblockdata 0 float logproLXi_block_1 nclusters nblockdata 0 float blocksums ndim 0 float cycle_count 0 float loglhXi_block_1I 1 nblockdata 0 y E oes ee as Tee re Processor Identifications pR EEEE TAEAE EAE T C void main inti 0 j 0 int 12 2 int loopCount 0 int loopCountAll 0 float testlog 0 float testlog2 int ntasks 0 int offset 0 int flagsum 0 int offsetTemp 0 int data_offset 0 int data start 0 float start_count float stop_count int proTemp 0 int prepro 0 float logproXiLTemp float maxTemp int global float logproLXiTemp ndata 0 float maxTempProLXi nclusters 0 SetIOP SEM LOCAL 0 offset 0 loopCount 0 prepro 0 testlog2 logf 2 71 104 testlog nclusters kso OB Re Get the address of blockdata and transfer to Master rk rk e sje spo spe espe obe spe oe ok esee sek f SetlOP SEM LOCAL int blockdata spe spes sje sje sje eoe spoke eee
89. ogLikeHoodXi vi LogLikeHoodX LogLikeHoodX ndata StoreLike LoopCount LogLikeHoodX StopOrNot abs PrevLike LogLikeHoodX PrevLike LogLikeHoodX if StopOrNot lt StopCri break end ProLXi exp LogProL Xi NewPrior2 ones 1 ndata ProLXi NO NEED TO DIVIDE NDATA NewPrior NewPrior2 ndata NewMean data ProLXi repmat NewPrior2 ndatadim 1 Calculate the covariance for i 1 ncluster Smeandata data repmat NewMean i 1 ndata Smeandata Smeandata Smeandata OldMgmm covars i ProLXi i Smeandata NewPrior2 i Prevent Covariance matrix go to singular if OldMgmm covars i lt eps 88 fprintf amp amp n OldMgmm covars i ones 1 ndatadim end end End of covarance calculation Update the parameters IM OldMgmm centers NewMean OldMgmm priors NewPrior end for the EM whold loop StoreLike StoreLike 1 LoopCount track track 1 LoopCount Put the parameters in descent order according to the prior probablities OldMgmm priors new_in sort OldMgmm priors OldMgmm centers OldMgmm centers new_in OldMgmm covars OldMgmm covars new in 89 APPENDIX C C CODE FOR THE EM ALGORITHM ON THE PARALLEL DSP PLATFORM 90 For the Master DSP Master c include lt stdio h gt include lt matrix h gt include lt math h gt include lt stdlib h gt includ
90. opOrNot gt StopCri amp LoopCount lt MaxLoop Get old a u and E OldCov OldMgmm covars Old E s OldMean OldMgmm centers Old u s OldPrior OldMgmm priors Old a s Calculate the Probabilty of of each data point Xi in each cluster m 1 L ProXiL matrix stores the probablity of each data point Xi in each cluster m Each column is a point of data and each row is a cluster ProXiL is L N matrix LogProXiL zeros ncluster ndata log normal log 2 pi ndatadim 2 for m 1 ncluster covTemp OldCov m 1 detTemp 0 5 sum log OldCov m for i 1 ndata poten 1 2 data 1 OldMean m data 1 OldMean m covTemp poten sum poten LogProXiL m i poten detTemp log normal 87 Calculate likelihood of each data point The likelihood of each data point store in a row vector LikeHoodXiisl n matrix Likelihood can not be zero or the priors will blow up Use LogLikelihood to avoid zeros LogLikeHoodXi Logx LogProXiL OldPrior End of calculation of likelihood of each data point Calculate the probability of given Xi ProLXi ProLXiis L N matrix LogProLXi LogProXiL log repmat OldPrior 1 ndata LogLikeHoodXi ones ncluster 1 LoopCount LoopCount 1 Log version of likelihood of the whole data set to judge the convergence LogLikeHoodX sum L
91. or address line 10 bits Change to 30 bits to meet the compiler s requirement q_addr quantizer ufixed floor wrap 30 01 These lookup table content should be written in txt file and load into lut exp content quantize q lut exp load lut exp content dat 0 4 step 24 8 lut log content quantize q lut log load lut log content dat 0 0 5 step 24 11 Iut inv content quantize q lut inv load lut inv content dat 950 512 step 0 5 covars inv content quantize q lut cov inv load lut covars inv content dat 0 0 25 step 2 12 lut_covars_log_content quantize q_lut_cov_log load lut_covars_log_content dat 0 0 25 step 24 12 lut_exp_ProLXi quantize q lut exp load lut exp ProLXi dat lut exp ProLXi quantize q exp lut exp ProLXi lut covars inv quantize q lut cov inv lut covars inv tile lut covars log tile quantize q lut log lut covars log tile Load the input data indata load q_indata dat indata quantize q_data indata indata_1 indata indata_2 indata indata_3 indata indata_4 indata indata_5 indata 117 break flag 0 Load the initial value of the model centers quantize q_data zeros 160 5 centers quantize q_data load centers dat centers_1 quantize q_data load centers_1 dat centers_2 quantize q_data load centers_2 dat centers_3 quantize q_data load centers_3 d
92. orithms for the FPGA structures We used a recently developed tool namely AccelFPGA 34 to shorten the design time AccelFPGA can compile Matlab code directly into VHDL code By using AccelFPGA we could focus on optimizing the algorithm in Matlab without dealing with hardware details inside the FPGA The design cycle of the FPGA implementation using this method can be reduced from months to weeks The FPGA implementation through AccelFPGA is not as optimized as directly VHDL coding However for a prototype design AccelFPGA can provide a time efficient FPGA solution with reasonably good performance We will show that our implementation on a FPGA gives much better performance than on either the PC or the parallel DSP platforms The structure of Chapter 5 is as follows In the next section the internal structure of Xilinx 3000 FPGA will be introduced Then we will introduce the AccelFPGA 52 compiler for FPGAs In section 5 4 we address some practical problems of FPGA implementations A fixed point version of the EM algorithm is presented and compared to the floating point version The transformation of mathematics operations and parallelism of the algorithm for FPGA implementation are also discussed In the last section the performance of the EM algorithm on the FPGA platform is presented and compared to the parallel DSP platform and PCs The FPGA and Its Structure Field Programmable Gate Arrays A Field Programmable Gate
93. our FPGA implementation can be found in Table 5 5 58 The Fixed Point Representation of the EM Algorithm To use AccelFPGA for the hardware implementation the first step is to convert the algorithm from a floating point into a fixed point representation The main difference between these two representations is the limited precisions For the EM implementation the problem becomes whether the algorithm will still converge to the right result as in the floating point domain We used the quantizer function in Matlab to iei the algorithm into a fixed point toris and tested the whole algorithm to make sure it gave the correct results The most important part of the fixed point implementation is to determine the word length and binary point of each variable in the algorithm AccelFPGA can support up to 50 bit wide words Although long word length can provide better precisions more bits will be needed to store and process each variable Thus more resources on the FPGA will be required and the system performance will be reduced To maximum the system performance we should use as short a word length as possible while still maintaining enough precision for the algorithm We chose to use 18 bits for normal variables and 30 bits for accumulation results Variables with 18 bit word length can be fitted into the BlockKRAM blocks and are easy to feed into the embedded multiplier in the FPGA Since the data dimension was high for our algorithm it was
94. p the iterative process 3 M step Estimate new model parameters O i 1 using current model parameters O i via estimation equations Replace current model parameters with new model parameters which are the new parameters for the next E step E Step M Step Estimate Q i 1 using Q Initialization Q Q 0 Is Likelihood L i L i 1 gt threshold 2 Calculate likelihood of the data using Q Figure 2 5 Diagram of EM algorithm for GMM parameter estimation Since the new model parameters obtained from the estimation equations guarantee a monotonically increasing likelihood function the algorithm is guaranteed to converge to a stationary point of the likelihood function It has been argued that that the convergence of the EM algorithm to the optimal parameters of the mixture is 24 slow if the component distributions overlap considerably 25 But if the clusters are reasonably well separated which is typically the case in spike sorting the convergence of the likelihood of the mixture model is rapid and the mixture density approximates the true density 25 CHAPTER 3 IMPLEMENTATION IN MATLAB Introduction We have shown how to use the algorithm to estimate the parameters of Gaussian Mixture Model In this chapter we will find that a number of practical issues have to be examined closely before we can achieve robust parameter estimates Moreover since we will u
95. rds the likelihood will increase as the model number becomes larger A number of methods have been proposed to determine the model number using maximum likelihood criteria by adding some penalty functions which can compensate for the monotonically increasing of the log likelihood function of the model number 24 However it is hard to define the suitable penalty term and not very efficient to implement To determine the model number in this paper we use the method presented in 12 We calculate the likelihood of the data set from clusters ranging in size from 1 to 7 As shown in Figure 3 3 the initial increment of likelihood is large but then becomes small This is because as the number of cluster approaches the true number of the model the increment of likelihood becomes small So we picked the point when the increase 6 in likelihood fell below a certain threshold for 2 continuous points Eq 3 1 shows the actual calculation The threshold 5 was set to 4 28 log t 2 log t 1 lt 5 log t 1 log t lt 6 Eq 3 1 From Figure 3 2 we can see that 5 models is a reasonable choice for the model number Also keep in mind that you can always merge or split clusters using more sophisticated methods 12 28 250 200 Number selected Log likelihood Number of clusters Figure 3 2 The increase in likelihood due to the increase of cluster numbers The amount of the log likelihoo
96. reproduction of this thesis in whole or in parts may be granted only by the copyright holder Signature 2 En Date 211 FOS iv TABLE OF CONTENTS 1 INTRODUCTION pn naniii eR 1 What is Neural Spike etu ta Gone ona une 1 Spike Sorting System nissin siase vet vani 3 Gaussian Mixture M d n ROAD DA ridus Pc e 6 OVERVIEW OF the Thesis 7 2 THE GAUSSIAN MIXTURE MODEL AND THE ALGORITHM 9 MT 9 Mixture Model for Spike Sorting 9 The Gaussian Mixture Model 9 GMMs Tor spike SORTIS cos cen bun ee ce edd 11 Spike sorting using the 13 The advantage and limitation of the GMM spike sorting system 15 Maximum Likelihood and the Expectation Maximization Algorithm 16 Maximum Likelihood ESGIOUOD oo noise aces idet a doo sotto ev deua 17 The Expectation Maximization EM algorithm eese 18 Expectation Maximization for the Gaussian Mixture Model 20 Implementation of the EM algorithm esee eee entente 23 3 IMPLEMENTATION IN MATERADB eti drea
97. rtex II XC2V3000 FPGA for our system target The Virtex II XC2V3000 FPGA has 96 18 bit 2 s complement multipliers and 96 block memories By using these embedded hardware resources we efficiently parallelized the EM algorithm in the FPGA The FPGA implementation was 16 4 times faster than a Pentium IV 3 2 GHz PC and 42 6 times faster than the 4 DSP implementation The work of this thesis shows the significant promise of using FPGAs for high speed computational intensive DSP applications By using FPGAs a significant performance improvement was achieved over the PCs and DSP processors without sacrificing system flexibility It appears that with intensive I O and parallel computational ability FPGAs will become more widely used in the DSP world Future work One direction of the future work could be to improve precision of the fixed point implementation using a larger word length and different binary point positions However a larger word length will decrease the system performance of the FPGA since more hardware resources will be needed The tradeoff between the word length and system speed has to be evaluated in order to find the optimal trade off that will depend on implementation constraints 75 The FPGA implementation in the thesis was translated directly from Matlab code using the AccelFPGA compiler At this stage the implementation could be improved by directly hand coding the algorithm in VHDL and even manually rout
98. s system from multiple unit records Journal of Neuroscience Methods 8 391 397 Recce M L amp J O Keefe 1989 The tetrode An improved technique for multi unit extracellular recording Society for Neuroscience Abstracts 15 1250 15 16 17 18 19 20 21 22 23 24 25 26 27 78 KENNETH D HARRIS DARRELL A HENZE JOZSEF CSICSVARI HAJIME HIRASE amp GYORGY BUZS AKI 2000 Accuracy of Tetrode Spike Separation as Determined by Simultaneous Intracellular and Extracellular Measurements J Neurophysiol 84 Lewicki Micheal S 1998 A review of methods for spike sorting the detection and classification of neural action potentials Network Computation in Neural Systems M H Yang amp N Ahuja 1999 Gaussian Mixture Model for Human Skin Color and Its Application in Image and Video Databases Proc of the SPIE 3635 Douglas A Reynolds 1995 Speaker identification and verification using Gaussian mixture speaker models Speech Communication 17 Salganicoff M amp M Sarna 1988 Unsupervised waveform classification for multi neuron recordings a real time software based system J Neurosci Methods 25 181 187 Zouridakis G amp D C Tam 2000 Identification of reliable spike templates in multi unit extracellular recording using fuzzy clustering Comput Methods Programs Biomed 61 91 98 Ohberg F amp H Johansson 1996 A neural network apporach to real time spike
99. s inv 14 TO ram s18 s18 m 14445 AT ACCEL MEM MAP 1ut covars log tile m 14 TO ram s18 s18 m 14450 AT 0 for m 14 quantize q int 1 ncluster covars addr accel bitshl covars d 7 m 14 12 41 4 addr covars_inv d_7 m_14 quantize q lut inv lut covars inv tile covars addr 1 m 14 covars log d 7 m 14 quantize q lut cov log lut covars log tile covars addr 1 m 14 end end Sum of covars_log to get det_covars 909 gt gt gt gt gt gt gt gt covars_det fun_covars_det covars_log ACCEL SHAPE covars det 5 ACCEL TILE m_16 for m_16 quantize q int 1 ncluster covars det m 16 quantize q1 const zero end for d 8 quantize q_int 1 ndim ACCEL TILE m 15 for m 15 quantize q_int 1 ncluster covars_det m_15 quantize q1 covars_det m_15 covars_log d_8 m_15 0 5 mid_covars d_8 m_15 quantize q_mid_covars const_zero end end id quantize q_int 1 end This part tricks Accelchip to right init the centers blockmem id_1 quantize q_data centers_1 1 id_2 quantize q_data centers_2 1 id_3 quantize q_data centers_3 1 id_4 quantize q_data centers_4 1 128 id 5 quantize q_data centers_5 1 outdatabuf 1 quantize q1 covars 1 outdatabuf 2 quantize q1 covars 2 outdatabuf_3 quantize q1 covars 3 outdatabuf_4 quantize q1 covars 4 outdatabuf 5 quantize q1 covars 5 outdatabuf 6 quant
100. s of data which assigns data to each cluster This process is extremely time consuming and is affected by human bias and inconsistencies The development of microtechnology enables multi tip electrode recording such as strereotrode 13 or tetrode 14 recording By adding spatial information of action potentials multi electrode recording can improve the accuracy of spike sorting 15 Obviously the amount of data taken from multi channel recording can be too overwhelming to deal with manually To solve this problem some type of automated clustering method is required During the past three decades various unsupervised classification methods have been applied to the spike sorting issue The applications of general unsupervised clustering methods such as K means fuzzy C means and neural network based classification schemes have achieved some success Methods such as Template Matching and Independent Component Analysis have also been used A complete review can be seen in 16 In most cases the classification is done off line There are two reasons 1 Most clustering methods need user involvement 2 Most automated clustering methods are very computational intensive and general hardware such as personal computers or work stations can t meet the real time requirement of the data streams However online spike sorting is needed for many applications of computational neuroscience 8 Implementing a classification method on high spee
101. se the full sampled waveforms as input the high dimensional training data will cause numeric problems such as underflow due to limitations in precision We will present a revised version of EM algorithm to solve these problems which we implement in software and hardware The chapter is organized as follows First the training data we used in this paper will be introduced The practical problems related to the EM and applications on GMM spike sorting system are discussed in the next section These problems include initialization of the GMM model cluster number selection and algorithm evaluation In the weet section the numeric underflow problem caused by high dimensional input data will be addressed We compare three different approaches to handle this problem and present a log version of the EM algorithm with covariance matrices This new EM algorithm gave the best result among the three methods At the end of this chapter we will show the resulting model trained by this log EM algorithm and its performance on a PC 26 Neural Signal Data from a Cricket The signals we use for this thesis were recorded from the cricket cercal sensory system 27 Data was recorded using an NPI SEC 05L amplifier and sampled at 50 KHz rate with a digital acquisition system running on a Windows 2000 platform The stimuli for these experiments were produced by a specially designed device which generated laminar air currents across t
102. set int New_extsre int DSP offset B Check for valid DMA channel i if dma_channel lt 10 dma_channel gt 13 return FALSE chan_offset dma_channel 10 lt lt 3 Prepare DMA for slave DSP if int extsrc gt int 0x0200000 DSP offset int extsrc amp OxOF00000 New extsrc int extsrc amp OxOOFFFFF SetIOP DSP_offset DMAC10 dma_channel 10 DMAC FLUSH y Flush SetIOP DSP_offset II10 chan_offset int New extsrc SetIOP DSP offset IMlO 4chan offset 1 y SetIOP 05 offset ClO chan offset num to copy SetIOP DSP_offset DMAC10 dma_channel 10 DMAC SLA MDO Enable Write to extmem Prepare DMA from external memory to EPBO via DMAC10 113 SetIOP DMAC10 dma_channel 10 DMAC FLUSH Flush SetIOP II10 chan offset int intdest SetIOP IM10 chan offset 1 y SetIOP C10 chan_offset num to copy SetIOP EI10 chan offset int DSP_offset EPB0 chan_offset SetIOP 10 offset 0 y SetIOP EC10 chan offset num to copy SetIOP DMAC10 dma_channel 10 DMAC MST MDI Enable Read from extmem If we need to wait for completion if wait int dma_status_bit 11 lt lt dma channel Wait until dma completed while GetlOP DMASTAT amp dma_status_bit Il while GetIOP DSP_offset DMASTAT amp dma_status_bit Disable DMA SetI
103. sion of the EM algorithm was implemented Since the training process of the EM algorithm is very computationally intensive and runs slowly on Personal Computers PC and even on parallel DSPs we mapped the EM algorithm to a Field Programmable Gate Array FPGA It trained the models without a significant degradation of the model integrity when using 18 bit and 30 bit fixed point data The FPGA implementation using a Xilinx Virtex II 3000 was 16 4 times faster than a 3 2 GHz Pentium 4 It was 42 5 times faster than a parallel floating point DSP implementation using 4 SHARC 21160 DSPs CHAPTER 1 INTRODUCTION What is Neural Spike Sorting Most neurons communicate with each other by means of short local perturbations in the electrical potential across the cell membrane called action potentials or spikes 1 By using extracellular glass pipettes single etched sharp electrodes or multiple site probes scientists have been able to record the electrical activity of neurons as they transmit and process information in the nervous syste It is widely accepted that information is coded in the firing frequency or firing time of action potentials 2 3 It is further believed that information is also coded in the joint activities of large neural ensembles 4 Therefore to understand how a neural system transmits and processes information it is critical to simultaneously record from a population of neuronal activity as well as to efficie
104. st resort to more elaborate techniques The Expectation Maximization EM algorithm The maximum likelihood GMM parameters estimates can be found via an iterative parameter estimation procedure which is called the Expectation Maximization EM algorithm 25 The EM algorithm is a general method of finding the maximum likelihood estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values There are two main applications of the EM algorithm The first use of the EM algorithm occurs when the data indeed has missing values The second application occurs when optimizing the likelihood function becomes analytically intractable This latter application is used for the GMM case where the analytical solution for the likelihood equation can not be found The idea behind the second application of the EM algorithm is to simplify the likelihood function by assuming the existence of additional but hidden parameters We assume that data X is observed and is generated by some distribution We also assume that there is another hidden data set Y which will be defined in section 2 3 3 We call Z X Y the complete data jer The density function for the complete data set is 2 0 p X Y O p Y X O p X O Eq 2 7 19 With this new density function we can define a new likelihood function 0 Z 0 X Y p X Y O Eq 2 8 Which is called the complete da
105. start_count int stop_count float proTemp 0 float prepro 0 float logproXiL Temp float maxTemp float Ihdata master float Ihdatanew float Ihdataold float thres float logproLXiTemp ndata 0 float maxTempProLXi nclusters 0 offset 0 loopCount 0 prepro 0 testlog2 logf 2 71 testlog nclusters Ihdataold 100 global int GLOBAL priors start int priorsTemp cando start int candcTemp SetIOP SEM GLOBAL 0 global STOP FEE sk oboe sje se eoe he he mu initialize the covars and priors initGmm priors centers covars Start Timer timer set OXFFFFFFFF OxFFFFFFFE start count timer on enable timer for i 0 i lt 100000 i tJ 93 BR SR seo se soo obese oe sb she e sj e be Get the start address of blockdata on slave DSPs risk of soe of se ae ses oe see oj sje while GetIOP SEM LOCAL 1 GeIOP SEM LOCAL 2 5 LOCAL 3 0 3 blockdata_slave 0 int GetIOP SEM_LOCAL_1 blockdata_slave 1 int GetlOP SEM LOCAL 2 blockdata_slave 2 int GetlOP SEM LOCAL 3 SetIOP SEM LOCAL 1 1 SetIOP SEM LOCAL 2 1 SetIOP SEM LOCAL 3 1 Ecke sk opos sk
106. stem The log probability for a spike being in a cluster is computed for each individual cluster and the highest scoring cluster is the identified neuron Figure 2 4 shows a block diagram of the Bayesian s decision rule applied to spike sorting using GMM The input to the system is a vector representing an action potential The probability that an action potential belongs to a class is calculated using Eq 2 3 over all the classes The vector is classified into a certain class whenever the vector falls into the class with a high confidence level If the waveform doesn t belong to any cluster with high probability the vector will be considered as noise and will be thrown away It is also possible that the waveform results from overlapping spikes and could be stored for further inverstigation The advantage and limitation of the GMM spike sorting system GMM is a statistical approach for the spike sorting problem It defines a probabilistic model of the spike waveforms A big advantage of this framework is that 16 it is possible to quantify the certainty of the classification By using Bayesian probability theory the probability of both the spike form and number of spike shapes can be quantified This is used in spike sorting because it allows experimenters to make decisions about the isolation of spikes Ate by monitoring the classification probabilities experimenters will be able to tell possible changes in experiment conditions For e
107. sum blocksums blockdata difflogproLXi block 1 i offset vecvadd logcovars block 1 i blocksums logcovars block 1 i ndim SetIOP SEM LOCAL 1 stop count timer offQ cycle count float start count float stop_count 80000000 divide by 80MHz 108 The library included in both C codes for master and slave processors lib c include lt stdio h gt include lt matrix h gt include lt math h gt include lt stdlib h gt include lt vector h gt include lt def21160 h gt include lt 21060 h gt include lt signal h gt include speeddsp h include diag_DMA_master_2 h 07 Global variables int dmac_ext_to_int DMAC_MDI int dmac_int_to_ext DMAC_MDO void initGmm float priors nclusters float centers ndim float covars ndim int i 0 int j 0 float a 0 for G 0 i lt nclusters i for j 0 j lt ndim j covars i j 1 for i 0 i nclusters i priors 0 i 0 nclusters a 1 vecsadd priors 0 a priors 0 nclusters Just for testing 109 float getproXi float centers ndim float blockdata ndim float covars ndim inti 0 j 0 float meanTemp float invcovTemp 0 float detcovTemp float proXiTemp 0 float divider 0 float log_normal 0 float proXi normalization constant log normal logf 2 3 1416 ndim 2 invcovTemp float ca
108. t 24 2 24 12 end end for d_10 quantize q int 1 ndim if covars d_10 3 gt upper_const covars d_10 3 quantize q_covars upper_const 2 2 24 12 end end for d 11 quantize q_int 1 ndim if covars d_11 4 gt upper_const covars d 11 4 quantize q_covars upper_const 2 2 2 12 end end ford 12 quantize q int 1 ndim if covars d 12 5 gt upper const covars d 12 5 quantize q covars upper const 24 2 24 12 end end for d 6 1 quantize q_int 1 ndim if covars d 6 1 1 lt const zero covars d 6 1 1 quantize q covars const zero 96 2 2 24 12 end end for d 9 1 quantize q int 1 ndim if covars d 9 1 2 const zero covars d 9 1 2 quantize q covars const zero 2 2 2 12 end for 4_10_1 quantize q int 1 ndim if covars d_10_1 3 lt const_zero covars d 10 1 3 quantize q_covars const_zero 24 2 24 12 end end for d_11_1 quantize q_int 1 ndim 127 if covars d_11_1 4 const zero covars d 11 1 4 quantize q covars const zero 2 2 2 12 end End ford 12 1 quantize q int 1 ndim if 12 1 5 lt const zero covars d 12 1 5 quantize q covars const zero 96 2 2 2 12 end end for d 7 quantize q_int 1 ndim 9 6 1ACCEL TILE m 14 ACCEL MEM MAP covars log m 14 TO ram 818 s18 m 14435 AT 0 ACCEL MEM MAP covars inv m 14 TO ram 818 s18 m 14 40 AT ACCEL MEM MAP lut covar
109. t int 0x0200000 DSP offset int extdest amp 0 0 00000 New intdest int extdest amp OxOOFFFFF SetIOP DSP_offset DMAC10 dma_channel 10 DMAC FLUSH Flush SetIOP DSP offset ITlO chan offset Gnt New_intdest SetIOP DSP offset IM10 chan offset 1 SetIOP DSP_offset C10 chan_offset num to SetIOP DSP_offset DMAC10 dma_channel 10 DMAC_SLA_MDI Enable Write to extmem Prepare DMA from external memory to EPBO via DMAC10 SetIOP DMAC10 dma_channel 10 DMAC FLUSH Flush SetIOP 10 offset infintsrc 5 SetIOP IM10 chan_offset 1 y d 112 SetIOP C10 chan_offset num to copy SetIOP offset int DSP offset EPBO 4chan offset SetIOP EM104chan offset 0 y SetIOP EC10 chan_offset num to copy SetIOP DMAC10 dma_channel 10 DMAC MST MDO j we need to wait for completion if wait int dma_status_bit 1L lt lt dma_channel Wait until dma completed while GetIOP DMASTAT amp dma_status_bit while GetIOP DSP_offset DMASTAT 4 dma_status_bit Disable DMA SetIOP DMAC10 dma_channel 10 DMAC_FLUSH SetIOP DSP_offsett DMAC10 dma_channel 10 DMAC_FLUSH return 0 int extmem_to_intmem_s2m int extsrc int intdest int num to copy int dma channel int wait int chan_off
110. t version b is the output of the FPGA implementation Table 5 8 The difference between the floating point output and FPGA output Error between floating point and FPGA output GMM parameters vector means 0 0170 covariance matrices 0 0019 priors 0 0147 We ran the same training data through the two models to see the classification results 18 out of 720 data points are misclassified using the FPGA models when compared to the floating point implementation in Matlab The classification rate is 97 25 The misclassifications are shown in the confusion matrix in Table 5 9 71 Table 5 9 Confusion matrix of fixed point spike sorting result for 720 nerual spikes from 5 neurons Overall correct rate is 97 25 comparing to the floating point result Cluster 1 Cluster2 Cluster3 Cluster4 Cluster 5 Cluster 1 252 0 0 2 0 Cluster 2 0 182 0 0 0 Cluster 3 0 3 133 1 0 Cluster 4 12 0 0 105 0 Cluster 5 0 0 0 0 30 The confusion matrix shows that the most misclassifications happened between cluster 4 and cluster 1 14 misclassifications Figure 5 10 shows that cluster 1 and cluster 4 are heavily overlapped and probably represent the action potentials from the same neuron The same situation exists for cluster 2 and 3 which have 3 misclassifications Overall a small number of misclassifications happened between very close clusters which would not greatly affect the final spike sorting results a F
111. ta likelihood Note that since Y is a hidden variable that is unknown and presumably governed by an underlying distribution the likelihood finon is also a random variable The EM algorithm first finds the expected value of the complete data log likelihood with respect to the unknown data Y given the observed data X and the current parameter estimates That is we define Q O O E log p X Y O X O Eq 2 9 where Q O Og is the expected value of the log likelihood with respect to the unknown data Y O are the current parameters estimates and O are the new parameters that we optimize to increase Q Note that in this equation X and O are constants and O is a normal valable that we wish to adjust The evaluation of this expectation is called the E step of the EM algorithm The second step is the M step This step is to maximize the expectation we computed in the first step That is we find O arg max O 0 Eq 2 10 These two steps are repeated as often as necessary The maximum likelihood parameters estimates have some desirable properties such as asymptotic consistency and efficiency 26 This means that given a large enough sample of training data the model estimates will converge to the true model parameters with probability one So each step of iteration in EM algorithm is guaranteed to increase the log likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function 26
112. termined the precision was found by dividing the dynamic range using the LUT size Then we ran the training process of the EM algorithm to see with if a precision caused the EM algorithm to converge or not If problems occurred and the algorithm didn t converge we had to improve the input precision by increasing the LUT size There are two ways to do this One approach was to increase the LUT size by shortening the output bit width With a shorter word width the BlockRAM can hold more than 1024 elements in a single bank But the decreased output precision may not meet the output precision requirement Another way was to use more than one bank of block SelectRAM to form one single LUT Theoretically you can use as many BlockRAMs as is available to build a LUT However additional resources and control circuit will be needed to linearly address all these memory blocks These additional circuits will use more area of the FPGA and decrease the system performance In our application the 1024 LUT 62 size worked properly with all the operations so neither of the above methods were used An example of how the look up table approach actually worked in the FPGAs is shown below We implemented the exponential operation in Eq 5 2 using look up table approach p x 0 exp log_ 0 Eq 5 2 First we tracked the input output dynamic range during the whole training process Figure 5 4 shows the histogram of the input and ou
113. ternal memory is SLAVE offset equals Upper portion of indest and indest should be the lower portion if int intdest gt 1nt 0x0200000 DSP offset int intdest amp OxOF00000 New intdest int intdest amp OxOOFFFFF Prepare DMA from external memory to EPBO via DMACIO SetlOP DSP_offsettDMAC10 dma_channel 10 DMAC FLUSH Flush SetIOP DSP offset II10 chan offset int New_intdest SetIOP DSP offset IM10 chan offset 1 SetIOP DSP_offset C10 chan_offset num to copy SetIOP DSP offset EIlO chan offset int extsrc SetIOP DSP_offset EM10 chan_offset 1 y SetIOP DSP_offset EC10 chan_offset num to copy 111 SetIOP DSP_offset DMAC10 dma_channel 10 DMAC_MDI Read from extmem we need to wait for completion if wait L status bit IL lt lt channel Wait until dma completed while GetIOP DSP offsett DMASTAT amp status bit 3 Disable DMA SetIOP DSP_offset DMAC10 dma_channel 10 DMAC_FLUSH return TRUE Enable int intmem to extmem m2s int intsrc int extdest int num to copy int dma channel int wait int chan_offset int DSP_offset int New_intdest Check for valid DMA channel if dma_channel lt 10 dma_channel gt 13 return FALSE chan_offset dma_channel 10 lt lt 3 Prepare DMA for slave DSP if int extdest g
114. to form smooth approximations to arbitrarily shaped densities Figure 2 1 shows a one dimensional example of the 10 GMM modeling capabilities An arbitrary non Gaussian distribution shown in solid line is approximated by a sum of three individual Gaussian components shown by dashed lines Figure 2 1 Non Gaussian density estimation using a GMM The complete Gaussian mixture density is parameterized by the mean vectors u covariance matrices and weights from all component densities which are represented by the notations a 4 52 The model is given by Eq 2 1 M p x O 2 1 1 1 where p x O is the probability of x given model x is a D dimensional random vector 0 i 1 the component densities and a i 1 M are the 11 mixture weights Each component density is a D variant Gaussian function given by Eq 2 2 1 lo u y Gu p x o dict 2 Eq 2 2 2 5 where is the mean vector and 2 is the covariance matrix of each individual Gaussian model M The mixture weights a satisfy the constraint that 2 20 which ensures the mixture is a true probability density function s be viewed as the probability of each component In spike sorting action potentials generated by each neuron are represented by each compone
115. tput data While the input dynamic range is quite large 0 to 10 the output dynamic range was limited to 0 1 This means quite a large span of input space are mapped to a small range of output space This can be more clearly shown in Figure 5 5 The output of the exponential function decreases with the decrease of the input While approaching zero the output barely changes even though the input keep decreasing Obviously the input precision of large negative values will not affect the output since the output is almost zero Knowing this we can decrease the input dynamic range by setting a threshold so that any input smaller than a particular value will lead the output to be zero x10 x 10 3 4 35 k Figure 5 4 Histogram of the input and output data range for the LUT table a The input b The output 63 0 9 H 0 8 0 7 0 6 0 5 Pin 0 4 0 3 0 2 7 4 0 1 1 Figure 5 5 Input and output curve of the exponential operation We chose the threshold value to be 4 so that the dynamic range of the input was 4 0 Choosing the LUT size to be 1024 means the input precision can be determined by 0 4 1024 2 We tested this precision by making sure the EM algorithm converged Table 5 3 shows the difference between the floating point output and LUT outputs The error was about 1 for each variable which had a small affect on
116. ualDSP 3 0 to develop the program for the Hammerhead PCI board Most of the code was written in C Small sections such as the DMA transfer routine were written in assembly language All the C functions we used were optimized for the ADSP 21160 s SIMD mode Figure 4 1 Block diagram of Hammerhead PCI system 43 Analysis of the EM Algorithm To parallelize the EM algorithm we needed to find out where the computational intensive parts of the EM algorithm were These parts should be mapped into multiple DSPs so that each DSP can share part of the computational load Thus by handling these parts simultaneously we can speed up the whole system and enhance the performance We used the profile command in Matlab to analyze the EM algorithm The profile function records execution time for each line in the code as well as for each function call in a Matlab m file By comparing the execution time we can know which part of the algorithm Matlab has spent the most time on The most computational intensive parts then should be parallelized on multiple DSPs 40 35 30 25 20 14 99 14 16 15 of the entire execution time 10 5b 9 Figure 4 2 The diagram shows most computational intensive parts in the EM algorithm and their execution percentage in the whole process a Calculation of a b Update of means and covariance matrices c Calculation of po and eG O d Rest of the a
117. umeric Considerations in Matlab The Log EM Algorithm To reserve all the original information in the spike waveforms we used the full sampled spikes as the clustering input The implementation of the training process of EM algorithm should be able to handle the 160 dimensional inputs However if we look at the EM algorithm closely we will find many places where a number of small numbers between 0 and 1 are multiplied together This easily leads to a numerical 31 zero in Matlab which is defined to be 2 2251e 308 or less for double precision data The calculation of the likelihood of the whole data set and the determination of covariance matrix are two places where a numeric zero problem occurs most often To compensate for this we have to transform these operations into the log domain so that the multiplications turn into additions that prevent the underflow problem Here we present a log version of the EM algorithm which solves this problem Given the current model we calculate the probability of each data point in each cluster The original equations are listed as follows 1 lg 4y x 4 H p 10 icu Eq 3 2 1 IX M 0 gt a p x 0 Eq 3 3 O X p 0 Eq 3 4 i Notice that in Eq 3 2 and Eq 3 4 there are product operations that will cause underflow By changing them into the log domain the equations are 0 i 4 TG 44 nQz imi p Eq 3 5 In
118. ural network pase unsupervised classification schemes 21 With some modifications the GMM based approaches also show promise in solving the overlap and bursting problems of spike sorting 22 The Expectation Maximization EM algorithm is normally used to estimate the parameters of a Gaussian Mixture Model EM is an iterative algorithm which updates mean vectors and covariance matrices of each cluster on each stage The algorithm is very computational intensive and runs slowly on PCs or workstations In this paper we present a log version of the EM algorithm which can handle high dimensional inputs We also implement this revised EM algorithm on three different platforms which include the PC Parallel DSP and FPGA platforms By comparing the different implementations we found that in terms of speed the FPGA implementation gives the best performance Overview of the Thesis The thesis will mainly focus on the following three points the Gaussian Mixture Model the Expectation Maximum algorithm and the hardware implementation of the EM algorithm Chapter 2 introduces the Gaussian Mixture Model and how to use the EM algorithm to estimate the mixture parameters In Chapter 3 a log version of the EM algorithm and its performance on a PC is presented The details of the implementation of the EM algorithm on a parallel DSP system are described in Chapter 4 Chapter 5 describes the parallel implementation of the EM algorithm on a 8
119. xample a drop in probability may indicate electrode drift or a rise in noise levels In spite of the advantages of the GMM approach there are several inherent limitations of this statistical model for spike sorting The Gaussian mixture model makes the assumption that the noise process is Gaussian and invariant These two cases are not necessarily true in spike sorting For example with the presence of burst firing and overlap spikes the noise may show larger variation at the tails Another assumption made is that all neurons fire independently and all the noise is uncorrelated It is likely that in a local area neurons have some correlation to each other We could use a more complete statistic approach which includes the correlation between neurons to model the spike generating process 24 However this is outside of the scope of this thesis Maximum likelihood and the Expectation Maximization Algorithm This section describes an unsupervised maximum likelihood procedure for estimating the parameters of a Gaussian mixture density from a set of unlabeled Observations maximum likelihood parameter estimation procedure is developed for the GMM First a brief discussion of maximum likelihood estimation and the use of the EM algorithm for GMM parameter estimation is given This is followed by the 17 derivation of the mixture weight mean and variance estimation equations Finally a brief discussion of the complete iterative
120. y using the Xilinx Virtex II XC2V3000 FPGA By implementing the training procedure on a FPGA the time to train models to be used in spite sorting will be shortened which which will benefit neuroscience research This will allow more time to be spent on collecting data Several practical issues related to the EM algorithm implementation were discussed To prevent underflow problems caused by multiplications we converted part of the EM algorithm into the log domain Without prior knowledge of how many neurons were generating action potentials we estimated the model number of the using the likelihood function The most computational parts of the EM algorithm were found and parallelized in both the multiple DSP and FPGA implementations For the parallel DSP implementation we used four ADSP 21160 32 bit floating point DSPs The bus contention between four DSPs significantly decreased the speed of the parallel DSP system To prevent this we set up a semaphore to exclude the usage of the cluster bus and used the DMA mode to accelerate the data transfer 74 among DSPs However because of the low DSP clock speed the parallel DSP implementation didn t outperform the PC implementation For the FPGA implementation a fixed point version of the EM algorithm was develope and simulated in Matlab To implement the exponential function a block memory based Lookup Table was used to approximate this functions We used a Xilinx Vi

31762103922173 - ScholarWorks

Contents

Download Pdf Manuals

Related Search

Related Contents