Home

Image Processing in Hardware - Department of Computer Engineering

1. Aso WW uJg9 Us HN aonlagnssuunitiezsaglinasqana uaznasalsauga S 9 Y hu lanaanuazannluman 9 Y VINHNIINAAVDIWU lAadNgnAas Ma d d 2 ANIMES IVOANISUTLZNIANAT IAL IWINVY lii Acknowledgement The project couldn t be completed without our project advisor Kurt T Rudahl M Sc He spends his precious time for project consultations every week and many good advises which help us solving hard problems We are very pleased to say thank you for him here Not only our advisor but there are also many organizations and people that supporting this project Asst Prof Surapont Toomnark who lent us the experimental tools from the Bellab of KMUTT APEX Instrument Co Ltd which advises us the design techniques and Design Gateway Co Ltd which lent us the True PCI for the experiment without any payment In addition this project couldn t reach the end if there 1s no encouragement and support from our parents and our friends There are also assistances from the department staffs in coordinating with teachers and other department members Finally we would like to thank Asst Prof Tiranee Achalakul Ph D who allows us to be in the CAST Laboratory where this project is settled We have been warmly taken care of as 1f we are members of the lab and we are very happy being in the lab s environment _ jv Contents Chapter 1 I
2. Send Address Send Data Write Falling Edge Write Rising Edge Figure 2 8 Write operation timing diagram Source True PCI User Manual rev 1 2 Design Gateway Co Ltd page 15 Write operation signals are shown in Figure 2 8 The figure shows protocol of writing data 0x00007777 to address 0x0003 After the address and data out signals are changed for 30 ns the chip select and write signals will go low and the user design core must read the data from the bus within 60 ns 19 This section reduces the user task in learning and implementing the PCI communication protocol in his her own design 2 3 2 2 Computer side Section There are three components in this section PCI Interface Driver and Dynamic Link Library PCI Interface provides physical communication in the PCI protocol This component is already in the ordinary computer system It consists of PCI port internal PCI bus PCI Bus Controller Memory Controller A driver is required by the True PCI card because pciif32 is a custom design IP core The driver handles the low level functional operation in communicating with the core A True PCI driver is already provided by the manufacturer however the driver can be used only in Microsoft Windows Win32 system The Dynamic Link Library DLL is a file that contains functions which will be used in controlling the driver to do some certain operations True PCI package includes a DLL working properly with the Tru
3. tb pt iph srs vhd uut u ity 0 pt sys vhdzuut u quu me me addr 00000 2 pt iph SYS V TES TER th pt iph sys vhd uut u itu mes cel n Mth pt iph sys vhd uut u it u metar we n Ab pt iph sus vhd uut u it u oe n m th pt iph sus vhdzuut u it mest cea Ath pt iph sus ehdzuut u itu mest addr FI 1FFFF pt iph sys vhd uut u it u mc bf dir stb pt iph sy s vhd uut u it u m C dba mc st 3 FEN PP 28 tb pt iph sys v hduut u mc mc data n Atb pt iph sys vhdzuut u mc sr data ae 00 iu I 3 2264 25512178 ns Figure A 4 Timing diagram of Memory Clear Operation End 93 The end of the operation is shown in Figure A 4 After the last address 0x1 FFF is cleared the operation is end with a high done at cursor A A 2 Process Controller A 2 1 Clear Interrupt Operation Ab pk iph sys vhd uut u itiu perpe clk Ab pt iph sys vhd uut u it u perb es n th pt iph sys vhd uut u it u wr n pt iph sps vhd uut uit rd n pt iph sys vhd uut u it u pc lb addr pt iph ses vhd uut u itry pc lb data out T LE Atb pt iph sys shd uut u peib int pt iph sys vhd uut u it u pc lb data in Figure A 5 Timing diagram of Clear Interrupt Operation A write request of pciif32 with the data 0x00000002 to the interrupt register is shown in Figure A 5 The interrupt is cleared
4. 2 Permit changes by application programmers or researchers Disadvantages of using FPGAs in processing l Requires digital hardware knowledge to maximize the efficiency of the designed system 2 Trying new algorithms in the FPGA is more inconvenient than in the computer program according to the hardware limitation and design process 3 A bottleneck in transferring data between the host computer and the FPGA board 15 created 2 1 4 Hardware Description Language HDL Basic digital circuit designing can be done manually But it can t be done manually or takes a lot of time when the circuit becomes larger and more complex Because of this languages have been developed to describe the behavioral model of the circuit Hence it is possible to use a computer to synthesize a circuit which will have the desired behavior These languages are called Hardware Description Languages Once describing the behavioral of the system with HDLs is completed the HDL files will be analyzed and the circuit 1s synthesized by synthesis software in a computer When the circuit 1s generated the HDLs complete their responsibility Moreover HDLs are useful in verification of the designed system They re used in behavioral simulations of the system before the circuit 1s mapped to the technology to verify correctness of results and basic timing diagrams There are 2 HDLs that mostly used They are Verilog HDL and VHDL They are a little different
5. Process Controller lowers mi en signal and sends an interrupt to the host computer at cursor B A 2 10 Read 1 2 or 3 Moment into Result Register Operation pt iph sys whd uut u t u pepe clk Ab pt iph sys vhd uutzu it u pepo rst n tb pt iph sys vhd uut u itiu pc pc req Ab pt iph sus vhdzuut u it u pe pc ant tb pt iph sus whd uut u it u perb cs n tb pt iph sus vhd uut u itiu pe lb wr n th pt iph sys vhd TER u itu perb rd n pt iph sys vhd uut u it u pezb addr 0001 Dann 0001 es Ib pt iph sys vhd uut u itu pc lb data aut FEEEFEEESTIE 1 122000000 EFFFFFFF Fe tb pt iph sus vhd uut u pc lb int tb pt iph sus vhd uut u it u pc mi en pt iph sus vhdzuut u itu pe mi pt iph sys vhd uut u it u pc mi output sel CA pt iph sys vhd uut u it pc mi data out E Mow 151185 ns 14887 ns 44375 nz Figure A 21 Timing diagram of Read 1 Moment into Result Register Operation 105 In order to read 1 moment statistics value Read 1 Moment into Result Register is written to the instruction register at cursor A in Figure A 21 At cursor B Porcess Controller selects the value by changing mi output sel to 0x1 Suddenly mi data out is change to 1 moment value Then the host computer is interrupted at cursor C A 2 11 Initialize Image Operation pt iph sys vhd uut u it u perpe elk pt iph sys
6. The result processing time and images 1s obtained The processing time shows that when increasing the number of pixels the processing time increases linearly The output statistics images show that sections of image which have the same pattern will be shown in the result images by the same gray level The 2 moment statistics images are different from the 1 and 3 moment They bring contrast of the patterns in the input image into sight 29 Chapter 4 Designs After research amp study and experiment have been done the system was designed to resolve problems facing in software The first goal of designing the system 1s to speed up the processing time of 1mage processing operations and the GLCM statistic 1mage generation was selected as an example The second is to generalize image processing operations into building blocks which functions independently so that the system can be easily expanded or modified later All designs in this chapter are based on hardware devices listed below 1 Prototyping Board Design Gateway True PCI 2 SRAM AMIC LP621024D 3 Bi directional Buffer 150 2 Resistors 4 1 Top level Design of the System As discussed above the system 1s divided into functional blocks Thus the system is composed of blocks and buses There are two buses in the system the 17 bit Address bus and the 8 bit Data bus Length of each bus is defined by the external SRAM used 30 4 1 1 Block Diagram
7. lbwrb 1 Output Active low Write Signal When this signal goes low it shows that there 1s a write request from the computer and pcii1f32 will send out the data through bdataout during the time that this signal remaining low lbcsb Output Active low Chip Select Signal When this signal goes low it shows that the user IP core is selected to be active bint Input Interrupt Signal Interrupt Signal coming from user IP core to be sent as an internal interrupt to the CPU The designer of pciif32 also defined a communication protocol between the user IP core and the pciif32 by timing diagrams below 18 Local Bus Ibaddr 0006 lbdatain I 1 rwr Too Ibdatanut i Ibesb Ibwrb Ibint Now Send Address 00 ps HESS nm Read Falling Edge Read Rising Edge Figure 2 7 Read operation timing diagram Source True PCI User Manual rev 1 2 Design Gateway Co Ltd page 14 Read operation signals are shown in Figure 2 7 The operation writes data 0x00007777 to the address 0x0003 The Figure shows that after the address signal is changed for 30 ns the chip select and read signals will go low and the data should be sent to bdatain bus during 60 ns after those signals goes low The data will be read by pciif32 and sent to the PCI bus master Local Bus agos 0002 Ibdatain lbdataout 00007777 o Bwb Ibrdb Ibint
8. vhd uut u it u pc pc rst Ath pt iph sys vhd uut u it u pe pe req pt iph sys vhd uut u itu pe pc grit th_pt_iph_sys_vhd Auut u u Ib c sn pt iph ET vhd TIS uit w th pt iph sys vhd uut u itu pc lb rd n Ab pt iph sus ehd uut u addr pt iph sys vhd uut u itu pe lb data aut 00050004 JFFFFFFF pt iph sys vhd uut u it u int EH Ab pt ph sys vhd uut u it u pc img width T 0 tb pt iph sys vhd uut u it u pe img height 00 How Figure A 22 Timing diagram of Initialize Image Operation The image information is initialized by a write request to the instruction register with the corresponding instruction Cursor A in Figure A 2 shows the instruction with 0x0005 width argument and 0x000A height argument Process Controller accepts those values and change img_width and img_height to the specified values respectively at cursor B And an interrupt is initialized at cursor C 106 A 2 12 Read from a Register Operation pt iph sus vhd uut u it u pt iph sus vhd uut u itu pe pc rst n tb pt iph sus vhd uut u itu pc pc req tb pt iph sys vhd uut u itu perpe ont pt iph sus vhd uut u it u pc lb es n pt ip h sus vhd uut u it u pe lb wr th pt iph sys v hd uut u it u pe lb rd n 0 pt iph sys vhd uut u_it upe lb_addr 0000 th pt iph sus vhd uut u it u pc lb data out Atb pt iph sus vhd uut u iteu p c l
9. B 835257 ps Ub 1432749 ps 1788538 ps a as OR CEA N Figure A 31 Timing diagram of Matrix Voter Operation Figure A 31 shows the operation of Matrix Voter The operation is enabled at cursor A by rising en The mc data OxAB is read from memory is received at cursor B for increasing the value by one data OxAC It is written into the memory at cursor C And the operation is done at cursor D when mv_done 1s high A 10 Matrix Integrator mi tbwmi clk fro rst n mi thw en rmi tbwilmc done mi tiwu UTON mi tbweuutcol mi tbw mi data n mi thay output sel Arni thi mi mi tbwi mi dane mi thi inc en mi tba nc mna fri c addr Ami tbw mi data aut Ami data aut 10000 00000000 00000000 mi tbw uut mi data out bufz mi Ibw uut mi data out buf3 clk mi tbw mi en mi tbwmi grant mi tbw me done mi Ami tbw uut col mi Ibw mi data in mi tbwi mi output sel tbwmi mi tbw mi done mi tbw me en fmi Ibwme nw mi tbw moc addr imi tbw mi data aut 114 m F m mme i RE c i OOOH oiiz 000000 EBENEN mm nr 10005 r a ae 00000008 sede i a pem pm um pa um od 0000003 SS SS 10000 3 00000004 im 134503
10. chosen by moving its center to the next pixel These processes continue until all statistic values store in the output buffer Once the buffer is fully filled a statistics image 15 created A statistics image is generated for each k th moment statistic calculation of each different type of GLCM Thus there are 3 moments x 4 directions x of the square size images needed to be created In software these processes are performed by the fetch execute cycle of the computer system The processes must be performed in order This can be optimized using hardware which is capable of perform processes that are not depend on the others concurrently and also which does not require the extra time for the instruction fetch part of the fetch execute cycle 2 2 5 Problem Issues When generating four GLCMs the main step 1s to examine all pixels and their corresponded pixels to fill the GLCM Thus there are two possible ways in implementing this algorithm as follow 1 Compute each matrix one after another 2 Compute all four matrices in one loop Filling all four GLCMs concurrently within one round of examining pixel by pixel in the image shortens the steps in calculating GLCMs for a square area but it consumes memory and may suffer from the lack of locality of references which is required by the fetch execute cycle of the computer system _14 As each algorithm above has some difference in its trade offs both algorithms will be tested
11. instructions will be sent to the device by calling reg write function with proper arguments a 32 bit instruction data to be sent and an address to send instruction data to pciif32 will perform a write operation through the local bus with the specified instruction data to the specified address Then proc_ctrl will receive the instruction and control the others modules to operate the received instruction If the instruction is to read data proc ctrl will prepare the data to be ready for the next call to read function When a reg read function 1s called pciif32 will perform a read operation from the local bus with the specified address According to the read operation NCC proc ctrl will be responsible for presenting the data during the low voltage duration ofcs n and rd n A flow chart of the top level system operation is shown in Figure 4 3 Is initDevice called Yes v Wait for a read or write operation Is a write operation performed Is a read operation performed Do the instructed operattion Send the requested data to the local bus Figure 4 3 The top level flow chart of the system 34 4 2 Component Designs The system is divided into modules for generalization This section provides information of how each module is designed and what is its operation 4 2 1 Memory Con
12. pt iph sys vhd uut u it u pczlb int 0 tb pt iph sys vhd uuteu itu perb data in 00000000 ib pt iph sys vhd uut u itiu pepe req 1 b pt iph sys vhd uut u it u perpe ant 1 pt iph sus vhd uut u it u pc mce addr 00 2 opooo fT TCE pt iph sys vhdzuut u itu pe mc data b p Nu s vhd uut u pc mc en tb pt iph sus vhd uut u pe me nw th pt iph sys vhd uut u it u perme elr n pt iph sys vhd uut u it u pc mce done Non Figure A 8 Timing diagram of Read from Memory into Data Register Operation Begin The beginning of the operation by a pciif32 write request is shown at cursor A in Figure A 6 The request writes a Read from Memory into Data Register instruction to the instruction register with the address parameter is 0x000A2 Process Controller performs a Memory Controller access request at cursor B And a read from memory is enabled at cursor C when mc en 1s high 96 iph sys vhd uut u it u tb pt iph sys vhd uut u it u pe pe rst m th pt iph sus vhd uut u it u pc pc req pt iph sve vhd it u ant pt iph sys vhd T ut uit lb Ab pt iph sus vhd uut u it u pc lb wrn pt iph sus vhd uut u it u pc Ib Atb pt iph ses vhd uut u it u perb addr 0001 th pt iph sys vhd uut u it u pe lb data aut pt iph sys vhd uut u it u pczlb int th pt iph sys vhd uut u it u pc mc addr bp Liph sus vhd uut
13. 3 3V one Thus a bi directional level shifter circuit is needed in the system 150Q resisters are used to solve this problem because getting a 3 3 to 5 volt converter circuit is very difficult in Thailand The circuit is shown in Figure 2 11 aw Device Figure 2 11 The bi directional level shifter circuit Chapter 3 Experiments Experiments in this phase are set up to obtain the fastest algorithm for each operation when performed by PC software The processing time of each operation 1s recorded to be compared with the processing time performed by hardware Moreover the outputs of each operation are obtained to be used in verifying correctness of the outputs obtained from the operation performed by hardware 3 1 Possible Algorithms for Gray level Co occurrence Matrix 3 1 1 Introduction The GLCM can be implemented in the code in 2 ways They are 1 Compute each matrix one after another m01 2 Compute all four matrices in one loop m02 Each of them has different advantages and disadvantages as mentioned in the previous chapter The main object of this experiment is to find the best algorithm to be implemented in the GLCM statistics image generation operation 3 1 2 Objectives 1 To obtain the fastest algorithm for calculating GLCMs in four directions i e northeast east southeast and south 2 To examine effects of varying arguments of GLCM operation on processing time 3 To examine processin
14. D Data transfer rate 1s up to 128 MB per second 1000 Mb per second 3 Use Carrier Sense Multiple Access Collision Detection protocol CSMA CD 4 Support full duplex communication Implementation Possibility Transferring data through the network requires packaging data into a packet which wastes data space to the packet header This reduces the transfer rate of the real data Even if the FPGA board can operate in the physical layer of the TCP IP model sending data from the computer to the FPGA board still requires specials commands and the encapsulation according to the protocol of the model e g IP header Data Link header Moreover if the FPGA board provides only the physical layer of the model and there is no Gigabit Ethernet Controller in the board TCP IP stack must be implemented separately as a part of the FPGA The development board at KMUTT includes a soft Ethernet core which may be 1000 Mb per second or may be only 100 Mb per second Implementation in the host computer side 1s very easy using standard sockets protocol 2 1 6 3 Connect via Universal Serial Bus Universal Serial Bus USB is a standard connection between electronic devices including computers The outstanding features of this connection are portable and hot pluggable Specifications l serial data transfer ps support 3 modes of data transferring Low Speed mode with 192 KB per second 1 5 Mb per second required 0 0 0 3 V Full Speed mo
15. The other parts are performs within the host computer This computing platform is designed to contain an FPGA and an external memory unit The bottle neck in this system is the communication connectivity between the platform and the host computer Because of this the fastest possible connectivity 1s chosen It is Peripheral Component Interconnection with data transfer rate of 133 Mbps In this project the GLCM statistics image generation is chosen to be implemented in hardware The FPGA is designed to compute the computationally intensive parts of this operation by separating the operation into modules Each module is functionally independent to one another and its function can be applied to most of other image processing operations as well Moreover the system is generalized by designing its architecture as a digital signal processor which has a controls module for controlling the other modules to operate as a received instruction and an internal buses system for interconnection between each module This architecture aids in modifying and extending the system later After performing an experiment with this system the GLCM statistics image generation can be performed correctly and the speed satisfies the timing constraint ll o y 9 HIVO lasqa1u AIsUTeNIANANINAIVHUIVUSZUIANAMEWAN VUIIAMVOI AIAN 4 wuagna Q7 iai lag unnaase NULYA o 1110030 e LI 919150115 Kurt Rudahl M Sc SLAUMIANYI IAINTII
16. address specified by sb_wr_addr 3 Ow paration 3 Ow Barinas 57 4 2 6 3 Operations There are 2 operations in this module write and read operation For the write operation it begins by receiving a negative edge signal at sb_wr_n Once received the data in sb_data_in is stored into the buffer inside the module For the read operation the data in the buffer corresponded to sb rd addrl will be presented at sb_data_out and the operation is the same for sb rd addr2 and sb_rd_addr2 4 2 7 GLCM Builder 4 2 7 1 Description GLCM Builder was designed to do GLCM operation of direction specified by dx dy This value can be defined by user GLCM Builder does this operation by cooperation with Square Buffer Address Decoder and Matrix Voter This module is named _ builder and abbreviated as gb 4 2 7 2 Ports dbg gb st 3 0 gb addr 16 0 sb rd addr1 3 0 sb rd addr2 3 0 gb done mv done mv en Figure 4 15 Block structure showing ports of GLCM Builder _ 58 Figure 4 15 shows ports of this module and those ports are described in the Table 4 7 Table 4 7 Ports of GLCM Builder Port Name Direction Description MSB LSB Based on GLCM Builder piel dion _ 59 4 2 7 3 Operation Normally GLCM Builder is in the Wait State This state gets the value of specified direction from user dx dy
17. after the request at cursor B A 2 2 Write to Memory Operation tb pt iph sys vhd uut u t u pepo clk pt iph sus vhd uut u itu pe lb es n tb pt iph sys vhd uut u it u petb wr n Atb pt iph sys vhd uut u itu rd n pt iph sus vhd uut u it u pc addi 0001 Nd 0007 Aib pt iph sys vhd uut u it u pc lb data out FFFFFFFF BEER pt iph sys vhd uut u itu pc lb int 0 pt iph spe hd uut u perb data in 00000000 tb pt iph sys vhd uut u itiu req pt iph sys vhd uut u it u pe pc ant tb pt iph sys vhdzuut u it u pe mc Li ph sus vhd uut u itu pe mc data pt iph sus E t u pc mc en iph sys vhd uut u it u perme nw th pt iph sys vhd uut u it u pe me cl n pt iph sus vhd uut u it u pc mc done Mow 51185 ns B HEFFFFFIF Um mun 08 I 01 Figure A 6 Timing diagram of Write to Memory Operation Begin 94 The beginning of the operation by a pciif32 write request 1s shown at cursor A in Figure A 6 The request writes a Write to Memory instruction to the instruction register with the address parameter is 0 000 and the data parameter is OxDI Process Controller performs a Memory Controller access request at cursor B And a write to memory 1s enabled at cursor C when mc en is high pt iph sys vhd uut u it u pevpe clk th pt iph sys vhd uut u it u pc lb es n tb pt iph sus vhdzuut u it u p
18. all pixels in the window are fetched an internal enable done handshake occurs followed by a global enable done handshake and an interrupt Thus it can be calculated as T T WINDOW SIZE T T 2 2T 1 req ram syn syn 8 Calculate GLCM Time After an instruction write request a Clear Temporary Memory occurs without the request and an interrupt and then GLCM Builder is enabled Once enabled GLCM builder iterates through every pixel in the window For each round of the iterations it checks the distance between an interesting pixel and a target pixel in once clock cycle enables Matrix Voter to do a request grant handshake read from memory do a sub enable done handshake write back to the memory and do a sub enable done handshake again Once the vote operation is done a handshake is made and it takes one clock cycle to move to the next pixel Once all pixels are iterated through an enable done handshake is made and the operation is ended by an interrupt These can be represented as T T T xGLCM SCALE LEVEL req ram ram 5 2xWINDOW _ SIZE T 2 1 27 9 Digest into Statistics Values Time Taig Once the digest request 1s received Process Controller enables Matrix Integrator Matrix Integrator will perform a request grant handshake and starts digesting process with every element in the GLCM For each element Matrix Integrator accesses the memory does a sub enable done han
19. and waits until the gb en is high it will change its state to Check State This state combines the current pixel sb rd addrl and value of specified direction from user gb dx gb dy If the result of combining is in the Square windows it will get the pair of pixel sb rd addr2 and then it changes its state to Vote State This state sets the mv en to high and sends the both sb rd addrl and sb rd addr2 to Square Buffer Module for getting the pair of data and then sending these to Address Decoder Module for getting the a result of address row and address column of GLCM matrix for voting at Matrix Voter Module If it has finished this operation it will return the mv done to GLCM Builder and then it changes its state to Move State Otherwise it will change its state to Move State immediately Move State will increment the position of current pixel by checking if the column equals to size of square windows 1 and row is not equal to size of square windows 1 it will increment the row by 1 and reset the column to 0 If the column is not equal to size of square windows 1 1 will increment the column by 1 In 2 conditions above it will change its state to Check State again Because it must move the current pixel all position in the square windows If column and row are equal to square windows 1 it changes its state to Done State This state will set the gb done to Process Controller and wait until gb en is low it changes its state to Wait State Fi
20. are 0 s Reset Square Window Position Description Reset position of the square processing window to position 0 0 of the input image lb addr 0x0001 lb data out bit 29 18 1 Other bits are 0 s shift Square Window Position Description Shift the square processing window to the next position in the input image lb addr 0x0001 lb data out bit 28 and 29 are 1 Other bits are 0 s Fetch Data into Square Window Description Fetch all pixels into the square processing window lb addr 0x0001 lb data out bit 30 1s 1 Other bits are 0 s 10 11 12 _42 Calculate GLCM Description Calculate GLCM 1n the specified direction dx dy dx 15 the horizontal different between an interesting pixel and a target pixel and dy is the vertical different between an interesting pixel and a target pixel lb addr 0x0001 lb data out bit 28 and 29 are T s bit 15 8 dx bit 7 0 dy Other bits are 0 s Digest GLCM into Statistic Values Description Digest a GLCM into 3 statistic values 1 2 and 3 moment statistic values lb addr 0x0001 lb data out bit 31 1s 1 Other bits are 0 s Read 1 Moment into Result Register Description Read the 1 moment statistic value into Result Register lb addr 0x0001 lb data out bit 27 and 31 are T s Other bits are 0 s Read 2 Moment into Result Register Description Read the 2 moment statistic value into Result Register lb addr 0x0001
21. created calculating on pixels within each square area For each GLCM created in a square area the 1 27 and 3 moments are calculated Three statistics values for each GLCM are stored in separated image equal sized buffers at the same position of the center of each square area Here twelve image equal sized buffers are needed because the GLCMs are created for four directions and there are three statistics values for each direction Once all twelve buffers are fully filled twelve statistics 1mages are generated If the distance 1s not more than half of the size of the square area the processes which are Steps 3 to 8 are repeated with the distance value increased Finish the generation _ 12 Input image sRect square section size sReg region size Figure 2 3 Flow chart of the GLCM statistic image generation 132 2 2 4 Reasons for Choosing GLCM This operation is chosen to be implemented in hardware because it performs many iterative and intensively computational processes Those processes are performed for each square area of pixels in the input image four GLCM types for northeast east southeast and south direction are calculated for each square area For each direction GLCMs are calculated for all distance values The 1 2 and 3 statistic moments are calculated for each GLCM and stored in the output buffer at the same position of the center of the square area The next interesting square area 1s
22. for determining the fastest algorithm to be implemented in the GLCM statistic image generation operation in the experimental phase 2 3 Prototyping Board In section 2 1 many communication methods were discussed Among all of them PCI is the best one but the tasks to handle the PCI protocol are not the main purpose of this project Thus a prototyping board called True PCI is chosen to handle communication JTAG Reset DONE 4 LED 2 Button DIP Switch External Port Switch Oscillator E e TELE LL es Wa F 1 RI Ais i ia ee ae mam semm ma coe eel 3 GFIO Connector PCI Slot spartan 3 FPGA Figure 2 4 True PCI the FPGA prototyping board Source True PCI User Manual rev 1 2 Design Gateway Co Ltd page 4 15 2 3 1 Description True PCI is an FPGA prototyping board developed by Design Gateway Co Ltd It has built in PCI interface which can be fit in any type of 32 bit PCI slot Moreover the manufacturer provides the 32 bit PCI interface intellectual property core IP Core the windows driver the dynamic link library and the example application These resources are essential for an FPGA designer who is not used to the PCI protocol so that the designer shouldn t have to implement them by himself 2 3 2 Connectivity between the Board and a Computer The prototyping board uses PCI interface to communicate with a computer The components in
23. h 2 ys vhd uut u it u sf mc en pt iph sys vhd uut u it u sf mc rw th pt iph sve vhd uut u it u sh me addr 00000 th pt iph sys vhd uut u it u sf sb wr addr 0 TL TH at Atb pt iph sys vhd uut u it u sf dbg st st 2 SS SSS SSS Now 65100 ns 12850 ns 12887 ns B Figure A 26 Timing diagram of Square Fetcher Operation Begin The beginning of the operation is when sf en 15 high It is shown at cursor A in Figure A 26 Once enabled Square Fetcher requests for the Memory Controller access grant at cursor B After it receives an access grant it begins fetching data from the Memory Unit to the bus When the data is on the bus mc done 1s high the data 1s written to Square Buffer by a negative edge of sb wr n shown at cursor C 109 th pt iph sys vhd uut u itu sf sf elk pt iph sus vhd uut u itu sf sf rst n tb pt iph sys vhd uut u iteu Atb pt iph sus vhd uut u it u st sf gnt pt iph sus vhd uut u it u sf mc dene sys vhd uut u sf img addr 00000 pt iph sys itu stima width 005 pt iph sus vhdzuut u itu sima height M th pt iph sus vhd uut u ity stici row O00 ooo Atb ot iph sys vhd uut u it u sfc col 001 tb pt iph sys vhd uut u iteu stisi req TRU RUN tb pt iph sus vhd uut u it u sts done pt iph sy 3 it E sb wr n pt sys vhd uut u it u st mc en pt iph sus vhd u
24. in the Wait State It wait until mi en 15 high it changes its state to Request State This state sets mi req is high and wait until mi ent is high it changes its state to Read State The first address of GLCM matrix to be read when it has finished it changes its state to Read State This state sets the mc rw is high for reading the data data in at this address and get the data for calculation the 3 Statistic values such as 1st order 2nd order 3rd order altogether User can choose the output of a GLCM Statistic value mi data out from setting the mi output sel When it has finished it changes its state to Sum State This state does the summation of all calculated positions in the GLCM matrix When it has finished it changes its state to Decode State This 67 state calculates the next position in the GLCM matrix and sets the result to mc addr it changes its state to Read State again until it calculates all positions in the GLCM matrix and it changes its state to Done State This state wait until mi done 1s high and mi en 1s low it changes its state to the Wait State Figure 4 2 shows the finite state machine of these operations iS N Wait State 1 Ji mi en is low mi done is high E P 1 Done State Ka E If it is all position in the GLEM matrix a N Sum State X a 9 Complete the summation of calculated position If it is not all positiorkin
25. ip h sus vhd uut u it u 00000 iph sys vhd uut u it u ci img width 005 pt iph sys vhd uut u it u ci img height pt iph sys vhd uut u it u ci ci end 0 Ab pt iph sys vhd uut u it u ci ci row 000 2 pt iph sys vhd uut u ci ci col 000 nn EN Br 1 EL j EH How amp 10665 ns i 1 Figure A 25 Timing diagram of Center Indexer Operation In Figure A 25 cursor A shows Reset operation of this module and cursor B shows Next operation The reset operation is done at the negative edge of 4 with the low ci_nxt This operation reset ci_row and ci_col to 0 The next operation is done at the negative edge of 4 n with the high ci_nxt The next operation shift the col to the right The figure shows shifting the window position from 0 0 to 0 1 108 A 5 Square Fetcher pt iph ses vhd uut u it u sf sf elk Ab p Liph s yg vhd uut u iE u s fsi rs Ln Mth pt iph sus vhd uut u it u ssl en Ab pt iph sus vhd uut u it u sist gnt tb pt iph sus vhdzuut u it u sf m c done pt ip h sus vhd uut u it u sf ima ad dr pt iph sys v hd uut u shing width Atb pt iph sys vhd uut u it u stima height DA pt iph ses vhd uut u it u sf ci row 000 TE tb pt iph sys vhd uut u it u sf ci col tb pt iph sps vhd uut u tiu sfs req tb_pt iph sys hd uut u it u sf st done pt iph sys vehd uut u itc si sb wr n A b pt ip
26. mostly depends on the time used by the hardware side of the system Time used by the hardware side is divided into types and measures in clock cycle unit 2 Request Time Treq This is the time used by sending a read or write request to Process Controller by pciif32 According to the pciif32 timing diagram this takes four clock cycles per request SRAM Access Time Tram This is the time used in accessing the data in the SRAM It is measured from the moment when Memory Controller received a high mc en until mc done is high indicating the operation is done If there is an assumption that the frequency of the clock signal used by Memory Controller is equal to the frequency of the global clock signal this takes six clock cycles for each access Synchronization Time Tsyn Synchronization Time occurs during the Enable Done or Request Grant Signal handshaking This takes one clock cycle for each handshake Image Processing in Hardware system performs operations by instructions Each instruction 1s analyzed and the time used by each operation is shown below l Clear Interrupt Time Tint This operation takes one Tye and the interrupt is cleared immediately Thus pu a int req 2 Write to Memory Time Twr After receiving this instruction a request grant handshake occurs SRAM is accessed Then enable done handshake occurs followed by an interrupt Thus the time required is 15 T T T T 1 wr
27. ps 2j ca cy c c pum oo E E 10038 mi Ibw uut mi data out butt mi tbwuut mi data out bur mi tbw uut mi data aut burs 12528742 ps Figure A 33 Timing diagram of Matrix Integrator Operation End 115 Figure A 32 shows the operation of Matrix Integrator This operation is enabled at cursor A by rising mi_en This operation read the mi_data_in from the memory when rising mc rw at cursor To compute the ni data out bufl 0 00000002 mi data out buf2 0 00000004 and data out buf3 0 00000008 from three combining the three values such as row 0x00 col 0x01 and data in 0 01 at cursor C And the operation 15 done at cursor D when mi done 1s high A 11 Clock Divider s clk divider c d in d clk divider cd rst n fib clk divider cd clk out 180172 ps Figure A 34 Timing diagram of Clock Divider Operation Figure A 34 shows the operation of Clock Divider Its operation is begun when cd rst n is risen up at cursor A The simulation simulates Clock Divider when setting CLOCK DIVISOR NUMBER to 3 Thus the 30 ns input clock signal is divided into 180 ns output clock signal measured between cursor B and cursor C 116 Appendix B Schematics Schematic of Image Processing in Hardware system is shown in Figure B 1 117 HOST COMPUTER TRUE PCI iSo fE CONS PING i amp r 1i CONS PINS ir addr 10 C
28. req ram syn 3 Read from Memory Time Tia The processes of this operation are the same as Write to Memory Time so the time used is the same T T 41 1 1l req ram syn 4 Clear Temporary Memory Time Ter After a request from pciif32 this operation iteratively performs a zero value SRAM write operation to the last section of the SRAM The number of addresses to be cleared is equal to GLCM SCALE LEVEL of the system Then a handshake and an interrupt are followed Thus T T T xGLCM SCALE LEVEL T 1 req ram syn gt Reset Square Window Position Time After receiving the request the operation takes a clock cycle to change ci_nxt and another for sending a negative pulse through ci d n Then an interrupt follows Thus T T_ 3 rst req 6 Shift Square Window Position Time Tsns Shifting operation takes the same length of the time period taken by Reset Square Window Position operation Thus T 3 shf req T Fetch Data into Square Window Time Tg After receiving a fetching request Square Fetcher is enabled and a request grant handshake occurs Then WINDOW SIZE pixels are fetched into the window For each fetching period it uses one clock cycle to validate the coordinate follows by an SRAM access time with an enable done handshake time one clock cycle for writing the fetched data to the Square Buffer and another clock cycle to shift to next pixel in the window After
29. request from pciif32 occurred mc_data 7 0 Bi directional Data read from or to be written to Memory Unit by Memory Controller 4 2 2 3 Operations Process Controller controls operations of the other components except Arbiter It receives instructions from the host computer through b_addr and b_data_out of the pciif32 write operation 32 bit instructions were designed to controls operations of the system Note that addresses below must be right shifted by 2 to be correctly used in the reg read or reg write functions because the range of MSB to LSB of bAddr of pciif32 is 14 2 but of Process Controller is 12 0 respectively They are l Clear Interrupt Description Clear the interrupt produced by Process Controller lb addr 0x0000 lb data out All bits are 0 s 2 Write to Memory Description Write an 8 bit data to the specified address in the external memory lb addr 0x0001 lb data out bit 24 8 A 17 bit address to be written bit 7 0 An 8 bit data to be written Other bits are O s _4 Read from Memory into Data Register Description Read an 8 bit data from the specified address in the external memory lb addr 0x0001 lb data out bit 24 8 A 17 bit address to be read Other bits are 0 s Clear Temporary Memory Description Clear temporary memory in the external memory used during image processing lb addr 0x0001 lb data out bit 28 1s 1 Other bits
30. structure showing ports of GLCM Builder State Machine of GLCM Builder Block structure showing ports of Address Decoder Block structure showing ports of Matrix Voter Finite State Machine of Matrix Voter Block structure showing ports of Matrix Integrator Finite State Machine of Matrix Integrator Block structure showing ports of Clock Divider The 1 moment output statistics image generated by software with R is 64 direction 1s east The 1 moment output statistics image generated by hardware with R 1s 64 direction 1s east The 2 moment output statistics image generated by software with R 1s 64 direction 1s east The 2 moment output statistics image generated by hardware with R 1s 64 direction 1s east The 3 moment output statistics image generated by software with R 1s 64 direction 1s east The 3 moment output statistics image generated by hardware with R 1s 64 direction 1s east Timing diagram of Memory Read Operation Timing diagram of Memory Write Operation 54 55 56 57 60 61 62 64 65 67 68 81 81 82 82 83 83 90 9 Figure A 3 Figure A 4 Figure A 5 Figure A 6 Figure A 7 Figure A 8 Figure A 9 Figure A 10 Figure A 11 Figure A 12 Figure A 13 Figure A 14 Figure A 15 Figure A 16 Figure A 17 Figure A 18 Figure A 19 Timing diagram of Memory Clear Operation Begin Timing diagram of Memory Clear Operation End Timing diagram of Clear In
31. the GLCM matrix mc done is low Calculate the next position to mc P di Decode State Xe uu Figure 4 21 Finite State Machine of Matrix Integrator 4 2 11 Clock Divider 4 2 11 1 Description Clock Divider is responsible for dividing the frequency of a clock signal into another frequency This module is named clk_divider and abbreviated as cd 68 4 2 11 2 Ports cd clk in cd clk out cd rst n Figure 4 22 Block structure showing ports of Clock Divider Figure 4 22 shows ports of this module and those ports are described in the Table 4 11 Table 4 11 Ports of Clock Divider Port Name Direction Description MSB LSB Based on Clock Divider 4 2 11 3 Operations Clock Divider divides the frequency of the input lock signal by a specified design parameter s value It takes the advantages of a counter circuit in the FPGA Once the counter counts the positive edges of the input clock to the specified number the counter is reset and the output signal is toggled to the opposite logic The formula for calculating the output clock frequency is clk in clk out _ lt gt 2xCLOCK _ DIVISOR_ NUMBER Where clk_in and clk out are the input and output frequency in Hz 69 Chapter 5 Implementation After designing the knowledge of deployment is required to apply the system into the real configuration This chapter presents some prerequisites and pr
32. used to divide images into regions 3 2 2 Objectives 1 To obtain the processing time of this operation performed by software for being compared to the processing time performed by hardware in the later phase 2 To obtain the output statistics image used in verifying the correctness of the operation performed by hardware in the later phase OF 3 2 3 Materials and Equipments 1 Executable file of the GLCM Statistics Image Generation Operation which implements timer functions in it 2 Two tagged Image File Format images TIFF image whose sizes are 16 pixels by 16 pixels and 272 pixels by 280 pixels 3 A Computer with these specifications a CPU Intel Celeron 2 4 GHz b Motherboard IBM Intel 1845 c RAM DDR 512 MB 133 MHz 3 2 4 Procedures 1 Execute the operation with the 16 by 16 pixel TIFF image 3 by 3 pixel square size and 16 line region buffer Observe and record the processing time and output statistics image generated 2 Repeat step 1 but change the image to 272 by 280 pixel TIFF image and the number of lines in the buffer to 240 3 2 5 Results Vary the number of lines in the buffer R by fixing the number of scale levels to 256 levels and size of square areas of pixels to 3 Table 3 2 The result processing time from varying the number of lines in the buffer Processing Time Image Size Region Size seconds pixels lines CPU Time Wall Time 272x280 14682 24 _28 3 2 6 Conclusion
33. 31 0 the operation of pciif32 Input Data to be read write to the register addressed by addr Output If high Process Controller request for pc_req controlling Memory Controller s operation from Arbiter ci ld n Output If low reset Center Index s row and column indices to zeroes ci nxt Output If high shift indices of Center Index to the next pixel co ordinate _ 39 Port Name Direction Description MSB LSB Based on Process Controller mc rw Output Read Write Control Signal for Memory mc clr n Output Active low Temporary Memory Clear E EM Enable Signal for Memory Controller mi output sel 1 0 Output select which results of Matrix Integrator should be present at mi data out Width of the target image Height of the target 1mage Output Different in horizontal direction of the interesting pixel in building the GLCM gb dy 3 0 4 Output Different in vertical direction of the NM interesting pixel in building the GLCM device_id 15 0 Device ID Number Default OxFOFO img addr 16 0 starting Address of the target image mc addr 16 0 17 Output Address used in operation of Memory Controller img width 9 0 img height 9 0 gb dx 3 0 lb int 1 Output If high send an interrupt to the host computer 40 Port Name Direction Description MSB LSB Based on Process Controller lb data in 31 0 Data to be sent to the host computer when a read
34. 32 and operates as instructed by combinations of local bus address and data signals Moreover this module also manages interfacing with pciif32 interrupt generating and error reporting This module is named proc ctrl and abbreviated as pc 4 2 2 2 Ports Ib addr 12 0 dbg pc st 3 0 device id 15 0 Ib dat t 31 0 gb dx 7 0 gb dy 7 0 mi data out 31 0 img addr 16 0 img height 9 0 img width 8 0 Ib data in 31 0 mc addr 16 0 Ib rd n mi output sel 1 0 vendor id 15 0 wr n ci Id n imc done mi en pc rst n pc req sf en sf done mc data 7 0 Figure 4 6 Block structure showing ports of Process Controller Figure 4 6 shows ports of this module and those ports are described in the Table 4 2 _ 38 Table 4 2 Ports of Process Controller Port Name MSB LSB Direction Description Based on Process Controller Input 33MHz Clock Signal Input Active low Reset Signal Input If high gain control of Memory pc clk pc rst n pc gnt Controller s operation Input If low pciif32 1s performing a write sf done gb done mi done mc done lb cs n lb wr n operation lb rd n Input If low pciif32 1s performing a read operation lb addr 12 0 Input Address of the register participated in lb_data_out
35. 4 13 55 f enis low f enis high Validate State Ail pixels in the window are not completely fetched All pixels in the window are fetched Address is validated Assign State The pixel in the window is assigned to Square Buffer Figure 4 13 Fetch Finite State Machine of Square Fetcher 4 2 6 Square Buffer 4 2 6 1 Description square Buffer represents the square processing window Its address space is independent from the Memory Unit address space Thus the low locality of reference addressing 1s changed into one contiguous addressing which speed up the fetching operation This module 1s capable of reading and writing the data to one of its address at the same time Moreover it provides the 2 channelled data accessing which aids in accessing 2 data at the same time Hence accessing 2 data in the square processing window is another uniqueness of image processing operation This module is named sq buf and abbreviated as sb 56 4 2 6 2 Ports sb_data_in 7 0 sb_data_out1 7 0 sb rd addri 3 0 sb rd addr2 3 0 50 wr addr 3 0 sb rst n sb wr n sb data out2 7 0 Figure 4 14 Block structure showing ports of Square Buffer Figure 4 14 shows ports of this module and those ports are described in the Table 4 6 Table 4 6 Ports of Square Buffer Port Name Direction Description MSB LSB Based on Square Buffer sb wr n 1 Input If low write the data from sb_data_in to the
36. EFFFFFF MAESTRTRTRTIORTRTI JFFFFR iium tb pt iph sus vhd uut u int pt iph sus vhd uut u it u pc mi en Ab pt iph aye vhd uut u it u pom done ib pt iph vhd uut u pe mi output sel GH tb pt iph sus vhd uut u itiu permi data out Hor Figure A 19 Timing diagram of Digest GLCM into Statistics Values Operation Begin Figure A 19 shows the beginning of the operation An instruction write request at cursor A writes a Digest GLCM into Statistics Values instruction to the instruction register The operation of Matrix Integrator is enabled when Process Controllers sets mi_en to high at cursor B 104 tb pt iph sys vhd uut u itsu elk Ab pt iph sys vhd uut u it u peripe rst n um d b pt iph sys vhd uut u it u pe pc req pt iph sys vhd uutzu it u pe pe gnt tb pt iph sys vhd uut u it u pc lb cs n tb pt iph itiu pc lb wr n pt iph sys hd uat u it u perb rd m 1 tb pt iph sys shd uutzu it u pczlb addr 0001 mH Ab pt iph sus vhd uut u it u pc lb data out FFFFFFFF b p t iph apys vhd uut u it u pc Ib int 1 b pt iph sys vhd uut u it u pczmi en 0 tb pt iph sug vhd uut u it u pc mi 0 tb pt iph sus vhd uut u it u pczmi output sel 0 ptiph sus vhd uut u it u permi data aut 13804 ns Figure A 20 Timing diagram of Digest GLCM into Statistics Values Operation End After Matrix Integrator sets mi done to high at cursor A in Figure A 20
37. Image Processing in Hardware Mr Kittituch Manakul Mr Surachai Chatchalermpun A Project Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Engineering Department of Computer Engineering Faculty of Engineering King Mongkut s University of Technology Thonburi Academic Year 2007 Image Processing in Hardware Mr Kittituch Manakul Mr Surachai Chatchalermpun A Project Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Engineering Department of Computer Engineering Faculty of Engineering King Mongkut s University of Technology Thonburi Academic Year 2007 Project Committee m 2 Committee and Advisor C 9 Vr Committee E M Committee Asst Prof Surapont Toomnark Project Title Image Processing in Hardware Project Credit 4 credits Project Participant Mr Kittituch Manakul Mr Surachai Chatchalermpun Advisor Kurt T Rudahl M Sc Degree of Study Bachelor s Degree Department Computer Engineering Academic Year 2007 Abstract This project tries to reduce software processing time of image processing operations by integrating a computing platform into an ordinary host computer as a co processor The computationally intensive parts of the operations are immigrated to the computing platform The platform performs operations with superior speed and returns results back to the host computer
38. NGIAATUANA A A A na 3803313 003 17021803 1 2550 A 9 zb o r msuszvianamwmiUuniussvaanan linang l asaanuiimanaseontiiniWua8 wv A oA 1195 amp OIANDNWIUMDTLNWOVIdAALIaAILUNITUIENIANANIN 013 AA Q7 9 e T T zi 9 VisNIAHANVAINYUHOUMAsUAITINS as LAJ o c 1 w 9 w A A o A Hadwssnnisilsgzuaanaszgndanau hllimiasosnouWarmosrensevinisilszaaamna9us ao 13 A yY E d 9 nuaevsznIananas NIUNIMUUTe NAA NIVINWAIINS LACH HNIDAIINIINIGWON AA E w A 9 9 w A VuninnavunussuuUAd AIIMADUA WVIVAISNINUUIGVISNIAHANISWOANUIATO a ER 9 9 A ag A A A c a o 103 31003 9419119190005 IZUULVONMOVUNTMAIWUWONVOIIZSUUADNWUNDS P 3 A 9 9 A A A 470212152 lumsimaousuyvsy_a 133 INNZSUAMDIWIN gan Y A 9 Aaa 1 A laraan Ma WIUMWNWaAAIIANIWNIATUNGN MuniusesyIananiw A oq Vw a yd Rw imiamlszgnd lsimuszun 1115111315 1 VONNVG 9 wv 1155 SLUUUOINUIWUIZNIANAN ULONIZMINIOMUUNUIGION Nd o g A o v t yq y o eg YY yo TwiniawzdwiunsiUszuaanauuuanas 197101130101 hhlszon ls l munas A 2 Awe 9J ey I UISNIANANINDU WOWINUUAAIVAINIINVGIIZUVIENANNMLAAIUNUNWIGVIZNIANA Q7 AA A A A A Q7 A 9 no ymnuyggodgiWwonidn 3 munisidu uaziszuudaiolslunas A 9 9 d 9 Q7
39. Names and connections between each block are shown in Figure 4 1 ARBITER MEMORY UNIT MEMORY CONTROLLER Juri m ch a ic Bk mi poni 7 ITI LII ug x oceso p ORTAM CEES ng 3 1 1 CENTER INDEX TG e il PCI INTERFACE PROCESS CONTROLLER 8 E T Figure 4 1 Block diagram of Image Processing in Hardware system Top B me cir PTT E NM eget i 11 an t di ma E DIVIDER speeds ad aca 16 d MATRIX VOTER me dore mw en Iw oni mw na n me addita ES nme 4 45 m mc en m acter 16 0 Figure 4 2 Block diagram of Image Processing in Hardware system Bottom ud 4 1 2 System Components There are 13 modules in the system They are 1 Memory Unit 2 Process Controller 3 Memory Controller 4 Arbiter 5 Center Indexer 6 square Fetcher 7 Square Buffer 8 GLCM Builder 9 Address Decoder 10 Matrix Voter 11 Matrix Integrator 12 Clock Divider 13 pciif32 4 1 3 Top level System Operation All components work together by exchanging digital signals between one another The system operation starts from the host computer By calling the initDevice function pciif32 will take care of initializing the device Once the device has been initialized
40. ONS PINT ir cx ty CONS PING ier addr 1T CONS PIN11 ir addr 9 CONS PINGS itr addr CONS PINTI ir adir T3 CONS PINTS Er We i CONS PINZI iir cx 2 CONS PINA itr addr 15 CONS PINS itr addr 16 CONSE 24 itr adir 10 CONS PIN22 ir adir 12 CONS PINI CONS PINTS tr addr CONS PINTE Er addr CONS 12 itr addr CONS PINIO tr addr 3i CONS PING ir addr 2 CONS PING ir addr Ti CONS PING iir addr 0 CONT PING itbg mc ch CONT PING bg pee req CONT PINS mc en CONE PINZI dif CONG PIN22 ir data Dn CONG 24 ir data fi CONG PINS ir data 2i CONG PINZS i amp r data 3 CONG 21 ir data CONE PINZ3 itr data 5i CONG PIN25 ir data CONG PINZI ir data CONG PIN28 GND Figure B 1 Schematic of Image Processing in Hardware system
41. are divided into 2 sections as followed 1 Prototyping board side Section 2 Computer side Section PCI Bus Local Bus User Application Dynamic Link Library PCI Interface PCI Interface IP Core User IP Core FPGA Computer Prototyping Board Figure 2 5 Model of the connectivity between the prototyping board and a computer 16 2 3 2 1 Prototyping board side Section This section contains an IP core called pciif32 which capable of sending and receiving PCI protocol commands and data to or from a computer via PCI bus The core also provides ports interfacing with user designed IP cores inside the FPGA via local bus 13 bit Address Bus 32 bit Output Data Bus 32 bit Inout Data Bus Chip select signal PCI Core User s Design Read signal file Figure 2 6 Local bus signals of pciif32 interfacing with user core Source True PCI User Manual rev 1 2 Design Gateway Co Ltd page 13 Table 2 1 Ports in the local bus of pciif32 Port Name Direction Description MSB LSB Based on pciif32 Ibaddr 14 2 Address Signal Ibdatain 31 0 Input Data Signal Ibdataout 31 0 Output Data Signal lbrdb 1 Output Active low Read Signal When this signal goes low it shows that there 1s a read request from the computer and pciif32 will read data from bdatain and sending to the computer
42. are generated by hardware in the same experimental phase for verifying the correctness between the outputs of operation in software and hardware 6 1 2 Objectives 1 To obtain the processing time of this operation performed by hardware 2 To obtain the output statistics performed by hardware To compare the correctness between operation performed by software and hardware 6 1 3 Materials and Equipments l Executable file of the GLCM Statistic Image Generation Operation which implements timer functions in it 2 A Tagged Image File Format image TIFF image which its size is 272 pixels by 280 pixels 80 3 A Computer with these specifications a CPU Intel Celeron 2 4 GHz d Motherboard IBM Intel 1845 e RAM DDR 512 MB 133 MHz 4 A True PCI with these specifications a Spartan 3 device 200 000 system gates XC3S200 b 3 3V 32 bits 33MHZ PCI Interface for PC Slot based development 6 1 4 Procedures l Execute the operation with the TIFF image 3 by 3 pixel square size and 256 line region buffer Observe and record the processing time and output statistics image generated 2 Verifying the correctness of output statistics image generated by software and hardware following as 2 1 Find the Percent Error from gt 2 GL CM software GLCM hardware Percent Error 00 2 GLCM SCALE LEVELxPIX NUM Where PIX NUM is the total pixels of output statistics image 2 2 Repeat step 2 1 until complete all
43. at cursor A when a write request of pciif32 occurs The request writes the instruction to the instruction register At cursor B Process Controller requests for an access grant to Memory Controller After it 1s granted the clear operation of Memory Controller is enabled by lowering mc_clr_n at cursor C 98 th pt iph sys vhd uut u itu pe pc clk tb pt iph sus vhd uut u itu pepe rst m pt iph sys vhd uut u it u pepe req Atb pt iph sus vhd uut u it u pc pc ont tb pt iph sys vhd uut u it u pc lb cs n tb pt iph ses vhd uut u pc lb wr n pt iph sys vhd uut u it u perb rd n 1 Atb pt iph sys vhd uut u itu pc lb addr 0001 SSS SSS pt iph sys vhd uut u itu pc lb data nut FFFFFFFF tb pt iph sus vhd uut u iteu pev lb int Er pt iph sys vhd uut u it u pczmc addi T tb pt iph sys vhd uut u it u pc mc data pt iph sys v hd uut u it u pc me en tb pt iph sus vhd uut u it u peime rw Atb pt iph sus vhd uut u itiu peime clr n pt iph sys vhdluut u t u peme done Now 151185 ns BT217B5 ns Bl22b5 ns oO B Figure A 11 Timing diagram of Clear Temporary Memory Operation End After the completion of the clear operation a high mc done occurs at cursor A in Figure A 11 followed by the Memory Controller access cancelation at cursor B and an interrupt at cursor C 99 A 2 5 Reset Square Window Position Operation b pt iph sys vhd uut u tru p
44. atrix Voter 4 2 10 Matrix Integrator 4 2 11 Clock Divider 23 23 23 24 24 24 25 25 26 26 26 27 27 27 28 29 29 30 32 32 34 34 37 46 49 5 55 57 60 62 65 67 _ yj Chapter 5 Implementation 69 5 1 Prerequisites 69 5 2 System Deployment Process 72 5 2 1 Hardware Side T2 5 2 2 Software Side 73 5 3 Theoretical calculation 74 Chapter 6 Verification 79 6 1 Correctness Verification 79 6 1 1 Introduction 79 6 1 2 Objectives 79 6 1 3 Materials and Equipments 79 6 1 4 Procedures 80 6 1 5 Results 81 6 1 6 Conclusion 84 6 2 Execution Time Verification 85 6 2 1 Introduction 85 6 2 2 Objectives 85 6 2 3 Results 85 6 2 4 Conclusion 86 Chapter 7 Conclusion 87 References 88 Appendix A Timing Diagrams 89 Memory Controller 90 A 2 Process Controller 93 A 3 Arbiter 106 A 4 Center Indexer 107 A 5 Square Fetcher 108 A 6 Square Buffer 110 A 7 GLCM Builder 111 A 8 Address Decoder 112 A 9 Matrix Voter 113 A 10 Matrix Integrator 114 A 11 Clock Divider 115 Appendix B Schematics Vil 116 Figure 2 1 Figure 2 2 Figure 2 3 Figure 2 4 Figure 2 5 Figure 2 6 Figure 2 7 Figure 2 8 Figure 2 9 Figure 2 10 Figure 2 11 Figure 4 1 Figure 4 2 Figure 4 3 Figure 4 4 Figure 4 5 Figure 4 6 Figure 4 7 Figure 4 8 Figure 4 9 Figure 4 10 Figure 4 11 Vill List of Figures Directions from an interesting pixel in an image GLCM with 8 scale levels and 1 pixel distance in eas
45. b int 1 pt iph sve vhd uut u it u pc lb data in 00000003 How 1151185 ns 6525 ns Figure A 23 Timing diagram of Read from Interrupt Register Operation Figures A 23 shows how a Read from the Interrupt Register instruction works A read request to the interrupt register address 0x0000 of pciif32 occurred at cursor A causes b data in to change to the 0x00000003 This value is stored in the interrupt register A 3 Arbiter arbiter ar elk tb arbiter ar rst n arbiter pc req tb arbiter sE req arbiter mi req i th arbiter pc qnt tb_arbiter sf_gnt Atb arbiter mv ant arbiter mi qnt M ow 3000000 ps 364013 ps 120423 ps 404436 ps 244014 ps 728450 ps Figure A 24 Timing diagram of Arbiter Operation The timing diagram in Figure A 24 simulates the operation of Arbiter Matrix Integrator requests for an access grant at cursor A Arbiter accepts the request and grant an access for it During the grant a request from Square Fetcher occurs at cursor B 107 Arbiter neglects this request until the grant of Matrix Integrator is cancelled Cursor C shows a race condition when there are requests from both Square Fetcher and Matrix Voter Arbiter decides that the Square Fetcher wins because of its higher priority and grants it A 4 Center Indexer th_pt_iph_sys_vad uut u_ it u ci ci rst n tb pt iph sys vhd uut u it u ci ci Id m th pt iph sys vhd uut u itu ci ci nt pt
46. be the same as mc addr received address to read the data put the data to mc data and return to the Wait State If mc_rw 15 low it will change its state to Write State perform write signaling and return to the Wait State Otherwise if mc_clr_n is low when Memory Controller 15 in the Wait State it will change its state to Clear State When it is in Clear State a clear flag is set a counter is started and the state changes into Write State During SRAM write operation if the flag is set Memory Controller will write a zero number to the address which equals to starting address of the temporary memory plus a number in the counter and return to the Clear State for adding the counter If the counter reaches the size of the temporary memory the clear process will be done and the controller will return to the Wait State Otherwise it continuously goes to the Write State Figure 4 5 shows the finite state machine of these operations Count r reaches size of temporary memory Write operation is done Clear Flag is clear mc en is high Clear mc rw is low State Write operation is done Clear Flag is set Counter doesn t reach size of temporary mefnory Figure 4 5 Finite State Machine of Memory Controller 4 2 2 Process Controller 4 2 2 1 Description a Process Controller was designed to control the other modules and provide relevant information for their processing It is controlled by pciif
47. but give the same synthesis result Nowadays high level languages such as C and Java are developed to abstractly describe the behavior of the circuit This makes the design process easier than using HDLs but those languages need special compiler programs to generate the netlist or to convert them to HDLs The most popular one is ImpulseC With ImpulseC you can use C language to describe circuits and can debug them as C programs The compiler facilitates the design process bypassing many tasks writing HDLs synthesizing etc Moreover the compiler makes the description of the circuit become more abstract because it is not necessary that designers must have knowledge related to hardware before using it With those high level language compilers the design process can be finished faster than using only HDLs and more easily used by C programmers 2 1 5 Finite State Machine FSM Finite State Machine 1s a behavioral model which consists of a finite number of states and transitions between the states In different states the action of the model is also different It differently obtains inputs and produces outputs in each state When sufficient condition occurs the state will transit to a state that suits the condition States transitions and actions can be illustrated in a diagram called State Diagram The model is very suitable in designing control devices and processing devices such as an elevator controller or a calculator beca
48. c lb wr n pt iph_sys_ ue it u pc lb rd n pL iph vhd u ut uit u pc lb addr pt iph sys vhd uut u it u pc lb data aut pt iph sy s vhd uut u it perli int Atb pt iph sos vhd uut u it u perb data in th pt iph sys vhd uut u it u pe pc req pt iph sys vhd uut u it u pe pc gnt Ab pt iph sys vhd uut u it u perme addr 1 0001 FFFFFFFF 1 00000000 0 0 bp ti ph sys vhd uut u iteu perme data a pt iph sys vhd u utu it T mc en b pt ip h ET amp vhd uut u it pe m C nM Ab pt iph sys vhd uut u it u perme n tb pt iph sys vhd uut u pelmc Mon foals Dr 7305 ns Figure A 7 Timing diagram of Write to Memory Operation End After a write to memory is complete Process Controller receives a high mc done at cursor A in Figure A 7 The request signal of Process Controller is lowered at cursor B to give up the request for the access grant After that an interrupt is initialized at cursor C to synchronize with the host computer 95 A 2 3 Read from Memory into Data Register Operation tb pt iph sus vhd uut u it u pc pc clk b pt iph sys vhd uut u it u pczlb Wr n pt iph_ sug vhd uut u itiu pczlb cs n 1 1 1 Ab pt iph sys vhd uut u it u perlb Ab pt iph sus vhd uut u it u pe lb addr 0001 pl iph eye iv perl data out FEEFEEEES JUSQdA2U JFFFFFFFF
49. c sr GR th _pt_iph_ l vhd uut u it u addr th pt iph vhd uut u it u mc bf dir 4b pt iph sys vhd uut u it u mc dbg mc st tb pt iph sys vhd uut u it u mc mc data pt iph sus vhd uut u it u mers data How 35 nz 5n Figure A 1 Timing diagram of Memory Read Operation Figure A 1 shows a memory read at the address 0x000A2 The operation is enabled at cursor A by rising mc en The data Ox5D is received at cursor B And the operation is done at cursor C when done is high 91 A 1 2 Memory Write Operation th pt iph sys vhd uut u it u mc mc clk b pt iph sys vhd uut u iteu me me rst mn pt iph sys vhdzuut u it u mc mc en b p Li ph sys vhd uut u it u mc mc nw EH pt iph sys vhd uut u it u meme addr 100081 tb pt iph sys vhd uut u it u me mc clr n Ab pt iph sys vhd uut u it u mc me done Ab pt iph sus s hd uut u t u mers cel n pt iph sys vhd uut u itu mc sr we n tb pt iph sys vhd uut u ite u mc sr ae n tb pt iph sus vhd uut u itiu me s cez 0 TIN mM Ab pt ip h ses vhd uut u it um c sp addr sys vhd uut u it u mezbf dir th pt iph sys vhd uut u it me mc data d o E E GE Atb ptiph sys vhdzuut u it u mc sr data Mow Figure A 2 Timing diagram of Memory Write Operation Figure A 2 shows a write operation to the address 0x000A1 with the data OxDI The operation is enabled at cursor A by risi
50. de with 1 5 MB per second 12 Mb per second High Speed mode with 60 MB per second 480 Mb per second required 2 8 3 6 V 3 support half duplex communication 4 Use differential signaling Implementation Possibility Using this connection the host computer requires USB Host Controller connected to the PCI bus of it and USB Host Controller Driver to control the data transfer between devices and the computer Device drivers are necessary in communication between a host and devices Thus the FPGA board must have a USB controller to handle the USB protocol in communication between the board and the host and the driver for the board is necessary as well The protocol of this connection uses packets in the communication Not only the real data 1s transmitted but the header of the packets 1s also transmitted The protocol also specified transaction of packets to be sent For example if a host wants to send data to a device handshake packets must be sent to each other to ensure the availability of the communication According to the non data bits sending along with the real data in the packets the real transfer rate of this connection is lower than the 60 MB per second for the High Speed mode This caused a bottleneck to be created in the system Implementation of a high speed driver in the host side is reported to be difficult 2 1 6 4 Connect to the PCI bus via an I O Board By using the PIO 24 PCI I O board as an intermediate be
51. dshake with 7 Memory Controller and takes more clock cycle for the summing operation and another one clock cycle for shifting to the next element Then an enable done handshake is made between Process Controller and Matrix Integrator 1s made following by an interrupt Thus 2 GLCM SCALE LEVEL T p T 2 1 dig req syn ram syn 10 Read 1 2 or 3 Moment Time Tra After receiving the instruction Process Controller takes one clock cycle for changing mi_output_sel to the selected value and another one for the interrupt so the time can be calculated as T T_ 2 srd req 11 Initialize Image Time Timg For initializing the image information the instruction is received and it takes one clock cycle for storing the information and sends out an interrupt Thus T 2 img req For GLCM statistics image generation it begins with divide the whole image into sub images called region For each region initialize it and loads every pixel in it into the SRAM one by one For each pixel in the input image except one that makes a hole in the window the window is moved to that pixel nearby pixel is fetched into the window GLCM is calculated for the window digest it into values and send all three statistics value to the host computer Once all three values are received the operations are performed to every region and the calculation of the next direction values begins by iterating the
52. e Fetcher is done its operation sb_wr_n 1 Output Active low Write Signal If low command Square Buffer to store the data from sb_data_in to the address specified by sb wr addr 2553 Port Name Direction Description MSB LSB Based on Square Fetcher Read Write Control Signal for Memory Controller sf st 3 0 square Fetcher State LEDs debug sb wr addr 3 0 4 Output Address in Square Buffer address space tpm to fetch data into mc addr 16 0 17 Output Address used in operation of Memory 4 2 5 3 Operations Square Fetcher starts fetching the top left pixel of the square processing window It fetches in left to right and top to bottom direction There are 2 finite state machines in this module Main FSM and Fetch FSM Square Fetcher s Main FSM originates in Wait State If sf_en is high its state will be changed to Request State In this state Square Fetcher sends a request for controlling the Memory Controller to the Arbiter via a high sf req signal Once sf_gnt 1s received the request operation is done and Square Fetcher moves to Fetch State Fetch FSM 1s enabled in this state by setting the f_en flag to logic 1 The Main FSM is now waiting for the logic 1 of f done flag Once the flag is set The Main FSM resets f_en flag and moves to Done State sending high sf_done to the Process Controller The done signal is sent continuously until sf_en 1s reset to low signal Once sf_en is reset the s
53. e PCI driver The list of functions and descriptions is provided in Table 2 2 Table 2 2 Functions and their descriptions provided in True PCI DLL Initialize the device and retrieve the device handler ChkDeviceCnt Count number of devices in the system reg read Read data from an address The address will set the baddr signal and the data will be read from bdatain signal reg write Write data to an address The address will set the baddr signal and the data will set bdataout signal Finalize the device 20 2 4 Static Random Access Memory SRAM According to the FPGA specification there are some distributed RAMs inside the FPGA on the prototyping board but image processing operations consume a lot of memory so that those RAMs is not enough The solution 1s an external SRAM 2 4 1 Description SRAM is an electronic memory which is capable of storing data as long as there is the power supply for the device The word Random Access means that the time used in accessing every data in it 15 always constant 2 4 2 Operation The SRAM was chosen to be used in this project is AMIC LP621024D Its access time 1s 75 ns It was chosen because it has 128k x 8 bit size of memory and it can be fit 1n the prototyping board Because of the high access time despite high speed clock signal 33 MHz the speed of the processing 1s slowed down because of this bottle neck The operation of this SRAM is controlled by four sig
54. e for calculating the GLCM dx and dy NUMBER OF DISTANCE BITS log MAX _ DIFFERENT EE Parameter Allowable Default VHDL Description Name Values Value Type NUMBER OF IMAGE 10 natural Number of bits to contain the maximum index of the image both vertical and ADDRESS BITS horizontal index NUMBER OF IMAGE ADDRESS BITS log MAX IMAGE _ INDEX WINDOW SIZE size of the square processing window NUMBER OF WINDOW 0 to 6 4 natural NUMBER OF WINDOW ADDRESS BITS log WINDOW _ SIZE ADDRESS BITS om 5 2 System Deployment Process For someone who wants to modify or evolve this project deployment processes are necessary To deploy the system for properly working condition can be separated into 2 sides hardware and software 5 2 1 Hardware Side 10 11 Study the interested image processing operation Define which parts of the operation can utilize designed components e g Square Buffer and Matrix Integrator and which parts must be done in the host computer e g Floating point calculation Define how much memory size the operation needs and look for a suitable SRAM Verify that operation timing diagrams of the chosen SRAM are the same as LP621024D or can utilize the existing signals from Memory Controller Modify the old components or additionally design new components Integrate those components to the system using VHDL Set design parameters values Maps ports of
55. e om operation in Memory Unit mv req 1 Output If high request for controlling of Memory Controller s operation mc_en 1 Output Enable Signal for Memory mc_rw 1 Output Active high Read the data in en mc addr 16 0 Address in Memory Unit mc data 7 0 Bidirectional Data from the data bus or Data into data bus 4 2 9 3 Operation Normally Matrix Voter is in the Wait State It wait until en is high it changes its state to Request State This state set mv req is high for requesting to read the data in Memory Unit and wait until mv gnt is high it changes its state to 64 Read State This state set mc addr equal to mv addr that getting from Address Decoder and set mc rw is high for reading the data at this address When the mc done 1s high it changes its state to Vote State This State increment the data by 1 and set the mc rw is low and wait until en is high it changes its state to Write State This state write the data to Memory Unit in the same address that be read and wait until mc done 1s high it changes its state to Done State This state wait until mv en is low mv req is low and send mv done to GLCM Builder for reporting its tasks 1s done it changes its state to Wait State Figure 4 19 shows the finite state machine of these operations mi en is low mv en is high Request State Done State Request for reading Writing is done mv gnt is high mc done is high Head State mc done Is
56. erpe clk pt iph ses vhd uut u it u pe pc rst n pt iph sys vhd uut u it u pe pereg pt iph sus vhdzuut u itu pc pc gnt pt iph sus vhd uut u it u pc lb c ss n tb pt iph ses vhd T ut uit wr n pt iph ET SIS TT S u it n pt iph sys vhd uut u it u addr DO TRI 11 pt iph sys shd uut u it u pe lb data aut IE1 12000000 JFFFFFFFE pt iph sys vhd uut u it u pe lb int th pt iph sys vhd uut u ite u pc c ildn Ab pt iph vhd uut u it u pc ci nst 10545 ns 1 ns 150 Figure A 12 Timing diagram of Reset Square Window Position Operation The operation is initialized at cursor A in Figure A 12 when an instruction write request of pciif32 occurs The request writes a Reset Square Window Position instruction to the instruction register through data out The position is reset by the negative edge of ld n when ci nxt is low at cursor B After some delay an interrupt is initialized at cursor 100 A 2 6 Shift Square Window Position Operation th pt iph sys vhd uut u it u clk fth pt iph sus vhd uut u it u perpe rs n th pt iph sus vhdz uut u it u pc pc req pt iph sus vhd uut u it pe pc qnt pt iph sys vhd uut u it pezlb cs n tb pt iph sus vhdzuut u it u pc lb wr n pt iph sys shd uut u it u pc lb rd n Atb pt iph sys vhd uut u itu pe lb addr 0001 GE th pt iph sus vhd uut u it u perb da
57. g characteristics of a computer system 3 1 3 Materials and Equipments 1 Executable files of each algorithm which implements timer functions in it 2 An Open Dragon image which its size is 3822 pixels by 2560 pixels 3 A Computer with these specifications a CPU Intel Pentium 4 1 6 GHz b Motherboard IBM Intel 1845 c RAM DDR 640 MB 133 MHz 3 1 4 Procedures 1 Execute m01 algorithm with the image and 16 pixel distance Observe and record the processing time and GLCMs generated 2 Repeat step 1 but using the m02 algorithm instead of the m01 algorithm 3 Repeat step 1 2 but change the distance into 1 128 512 and 1024 respectively 25 3 1 5 Results Vary the distance D by fixing the number of scale levels to 8 levels Table 3 1 The resulting processing time from varying the distance Processing Time with D 1 Processing Time with D 16 Algorithm seconds seconds 2 3 Mean 1 2 3 Men L mor sa se ssaa 89 3797 S702 5 849 Processing Time with D 128 Processing Time with D 512 Algorithm seconds seconds 5 552 Processing Time with D 1024 CIT Mean Low ss ss m 3 1 6 Conclusion According to the results above the algorithm m01 which says Compute each matrix one after another is fastest By increasing the distance measured between interesting pixels the processing time is decreased This caused by the reduction
58. gure 4 16 shows the finite state machine of these operations 60 gb en is high yk en is low Done State Interested pixel combine gb dx and gb dy is not in Square windows Interested pixel combine gb dx Process is all pixels the Square and gb dy is in Square windows windows Vote State Move State mv_done is high Process is not all pixels in the Square windows Figure 4 16 Finite State Machine of GLCM Builder 4 2 8 Address Decoder 4 2 8 1 Description Address Decoder was designed to calculate a linear address which is used in addressing data in the address space of the Memory Unit from a couple of row index and column index This module is named dec and abbreviated as ad NE 4 2 8 2 Ports jad col 7 0 ad addr 16 0 ad 7 0 ad start 16 0 rst n Figure 4 17 Block structure showing ports of Address Decoder Figure 4 17 shows ports of this module and those ports are described in the Table 4 8 Table 4 8 Ports of Address Decoder Port Name Direction Description MSB LSB Based on Address Decoder ad rst n Active low Reset Signal ad row 7 0 ad start 16 0 Starting Address Decoded Address ad addr 16 0 ad col 7 0 Input Column Index 4 2 8 3 Operation Normally Address Decoder is processed after Square buffer sends 2 data the first data is defined as column ad col of GLCM matrix 1s used for voting and the second da
59. high Vote State Figure 4 19 Finite State Machine of Matrix Voter 65 4 2 10 Matrix Integrator 4 2 10 1 Description Matrix Integrator was designed to calculate the three GLCM Statistic values by summation all calculated positions of GLCM matrix is in Memory Unit Matrix Integrator does this operation by cooperation with Memory Controller This module is named nat int and abbreviated as mi 4 2 10 2 Ports mi data in 7 Q dbg mi st 3 0 imi output sel 1 0 mc addr 16 0 imc done mi data out 31 0 mi clk mc en imi en mc nw mi qnt mi done mi rst n mi req Figure 4 20 Block structure showing ports of Matrix Integrator Figure 4 20 shows ports of this module and those ports are described in the Table 4 10 Table 4 10 Ports of Matrix Integrator Port Name i Direction Description MSB LSB i Based on Matrix Integrator 33MHz Clock Signal Active low Reset Signal Enable Signal for Matrix Integrator 66 Port Name i Direction Description MSB LSB j Based on Matrix Integrator mi_gnt Input If high gain control of Memory Controller s operation mi output sel 1 0 Output Select GLCM Statistic Moment 1 Ist order 2 2nd order 3 3rd order mi_req Output If high request for controlling of pa mE Memory Controller s operation mc addr 16 0 Address in Memory Unit mi data out 31 0 Output GLCM Statistic Moment 4 2 10 3 Operation Normally Matrix Integrator is
60. if32 Figure 4 7 shows the finite state machine of this module Memory Contoller is done Fetch Data into Square instruction is Square Feteher is done A window position related instructiofi is received M Builder is done A statistic e related nstruction S received Figure 4 7 Finite State Machine of Process Controller 4 2 3 Arbiter 4 2 3 1 Description Arbiter is a module which is responsible for granting an access to Memory Controller Because there re 4 modules connecting to the same ports of the Memory Controller the ambiguity of which module is controlling the Memory Controller occurs Arbiter clarifies this situation by granting the access to the 47 highest priority request at that moment This module is named arbiter and abbreviated as ar 4 2 3 2 Ports pc ant sf_gnt Figure 4 8 Block structure showing ports of Arbiter Figure 4 8 shows ports of this module and those ports are described in the Table 4 3 Table 4 3 Ports of Arbiter Port Name i Direction Description MSB LSB Based on Arbiter If high Process Controller is requesting for controlling of Memory Controller If high Square Fetcher is requesting for controlling of Memory Controller If high Matrix Voter is requesting for controlling of Memory Controller If high Matrix Integrator is requesting for controlling of Memory Controller 48 Direction Descri
61. ified so that the data transferring and specific functions of the device are correctly substituted or inserted into the source code and the program needed to be rebuilt In addition this project tries to generalize the designs to widen the user space of the reconfigurable system e g application developers and researchers by divide the function of the system into functional independent modules so that these modules can be modified or rerouted into a new system Moreover the designs used design parameter technique to aid in deploying the system into different platforms and reducing the task of studying each module in details 10 11 12 I3 88 References David Pellerin and Scott Thisbault Practical FPGA Programming in C Pearson Education Inc USA 2005 RC Cofer and Ben Harding Rapid System Prototyping with FPGA Elsevier Inc USA 2006 Digital Image Processing online available http en wikipedia org wiki Digital image processing 2007 June 22 Finite State Machine online available http en wikipedia org wiki Finite state machine 2007 June 18 Peripheral Component Interconnect online available http en wikipedia org wiki Peripheral Component Interconnect 2007 July 21 Universal Serial Bus online available http en wikipedia org wiki USBZUSB mass storage 2007 July 26 Using a Gray Level Co occurrence Matrix online available http matlab izmiran ru help toolbo
62. in the number of pixels taking into calculation a GLCM The reduction cannot be avoided because the GLCM requires that both pixels must be valid For example if the interesting pixel was 5 pixel far from the east side of the image the 10 pixel far pixel to the east of it does not exist So it is necessary to ignore the interesting pixel for that GLCM Because the algorithm 01 is faster than m02 it is illustrated that the computer fetch execute cycle will work well if the processing data has some locality of references e g the next data is nearby the processing data To emphasize the idea in the m01 algorithm GLCM is processed one by one which means that a pair of interesting pixels 1s always in the same direction and distance Unlikely in m02 all four GLCMs is filled at the same time This causes the CPU of the computer to fetch the next interesting pixel which 1s not in the same direction of the processing pixel 3 2 Gray level Co occurrence Matrix Statistics Image Generation 3 2 1 Introduction This operation generates many statistics images from many GLCMs of specified size square segments of the input 1mage Two arguments are required for generating the statistic images from an input image They are l The size of square areas of pixels of the input image using in calculating GLCMS for each direction 1 e northeast east southeast and south 2 The number of rows of pixels that can be in a buffer which is
63. is register holds information of the input images The information about the input image s width and height is declared to the other modules This register can be set with Initialize Image instruction 4 Result Register Address 0x0004 This register contains the result digested value after a Read 1 2 or 3 Moment into Result Register instruction 44 5 Message Register Address 0 0005 This register will be set after each completion of every instruction to provide error report information to the host computer Its value can be MSG OK The operation is done properly MSG IMAGE NOT INIT Cannot perform requested instruction because Image Initialize instruction has never been received MSG IMAGE SIZE INVALID The specified input image size 1s larger than the unused external memory size of the system Normally Process Controller is in the Wait State Instructions are received by monitoring when b cs n wr n and Ib rd n signals will go low According to pciif32 operations Process Controller responds to 2 types of requests from pciif32 a read request or a write request When receiving a write request b cs n and wr n are low if a flag named is image described is clear an Initialize Image instruction should be sent to Process Controller is image described will be set an interrupt will be received at the host computer with MSG OK in the Message Register Otherwise an inter
64. iter Operation Timing diagram of Center Indexer Operation Timing diagram of Square Fetcher Operation Begin Timing diagram of Square Fetcher Operation End Timing diagram of Square Buffer Operation Timing diagram of GLCM Builder Operation Timing diagram of Address Decoder Operation Timing diagram of Matrix Voter Operation Timing diagram of Matrix Integrator Operation Begin Timing diagram of Matrix Integrator Operation End Timing diagram of Clock Divider Operation schematic of Image Processing in Hardware system 104 104 105 106 106 107 108 109 110 111 112 113 114 114 115 117 Table 2 1 Table 2 2 Table 3 1 Table 3 2 Table 4 1 Table 4 2 Table 4 3 Table 4 4 Table 4 5 Table 4 6 Table 4 7 Table 4 8 Table 4 9 Table 4 10 Table 4 11 Table 5 1 Table 6 1 Table 6 2 Xll List of Tables Ports in the local bus of pciif32 Functions and their descriptions provided in True PCI DLL The resulting processing time from varying the distance result processing time from varying the number of lines in the buffer Ports of Memory Controller Ports of Process Controller Ports of Arbiter Ports of Center Indexer Ports of Square Fetcher Ports of Square Buffer Ports of GLCM Builder Ports of Address Decoder Ports of Matrix Voter Ports of Matrix Integrator Ports of Clock Divider Design Parameters of Image Processing in Hardware System The result processing time fr
65. lb data out bit 28 and 31 are T s Other bits are 0 s Read 3 Moment into Result Register Description Read the 3 moment statistic value into Result Register lb addr 0x0001 lb data out bit 27 28 and 31 are 1 Other bits are 0 s 43 13 Initialize Image Description Specify width and height of an input image lb addr 0x0002 lb data out bit 31 16 Width of the image bit 15 0 Height of the 1mage There are 5 32 bit registers related in operations of Process Controller Each can be addressed by changing b addr to a defined address l Interrupt Register Address 0x0000 This register controls sending an interrupt to the host computer If its bit O is set an interrupt will occur and int will go high Once set it can be cleared by Clear Interrupt instruction 2 Instruction Data Register Address 0x0001 This register operates in 2 modes Input and Output It operates in the input mode when there s a write request with b_addr is 0x0001 and the register will operate as Instruction Register storing an instruction which is going to be performed by the Process Controller Oppositely it operates the output mode when there s a read request with the same b addr and the register will operate as Data Register storing the data which 1s fetched from the specified address in the external memory with Read from Memory into Data Register instruction 3 Image Register Address 0x0003 Th
66. lder by setting gb_en to high at cursor B th pt iph sys vhd uut u itu pepe clk tb pt iph sys vhd uut u perpe rst n Ab ptiph sys vhd uut u itiu perpe req Ab pt iph sys vhd uut u it u perpe gnt pl iph sus vhd uut TE tb pt ip h s us vhd uut u itiu perli wr n th pt iph sys vhd uut u it u pe lb rd n b ptph sys vhd uut u itu perb addr E pbiph sus vhd uut u it u pc lb data aut FFFFFFFF th pt iph sys vhd uut u it u pcrlb int 0 Ab pt iph sys vhd uut u it u perme cln Ab pt iph sys vhd uut u it u pergi en i GE ptiph sys vhd uut u it u ds tb pt iph sys vhd uut u it u pergi dy 151185 ns B14307 ns 1 a E Figure A 18 Timing diagram of Calculate GLCM Operation End 103 The ending of the operation is shown in Figure A 18 At cursor A GLCM Builder declares the completion of its operation by rising gb_done Then Process Controller reset the gb_dx and gb_dy at cursor B and an interrupt occurs at cursor C A 2 9 Digest GLCM into Statistics Values Operation b pt iph sys vhd uut u itu pepe clk pt iph sus vhd uut u itu pepe rst n pt iph sys vhd uut u it u po pc req tb pt iph sug vhd uut u it u pc pc grt tb pt iph vhd uut u it u pe lb cs n tb pt ip h sug vhd uut u it u perli wr n th pt iph sys vhd uut u it u perb rd n tb pt iph sys shd uut u itu perb addr EI th pt sus vhd uut u it u perb data F
67. mmable Gate Array FPGA A Field Programmable Gate Array is a large scale integrated circuit LSI which is programmable It is different from other ICs that can t be reprogrammed after they re manufactured The FPGA is programmable because it contains a large number of programmable logic cells that are capable of perform small logic functions They are connected to one another using programmable interconnections in the FPGA By programming the devices a more complex logic function 1s formed to suit needs At the present time designing the FPGA circuit configuration begins from gathering requirements e g inputs and outputs of the circuits timing constraints or area constraints Then files called HDLs are written to describe the behavior of the system The HDLs are behaviorally simulated in the computer to make sure that they can work properly After the simulations the circuit diagrams are generated by circuit synthesis software in the computer Finally the diagrams are mapped to the technology of the FPGA that will be used by place amp route software The result of the mapping is a configuration of the FPGA called Netlist Once the netlist 1s loaded into the FPGA board and the switch 1s turned on the configuration will be applied to the FPGA Then the FPGA can perform the behavior described by those HDLs Advantages of using FPGAs in processing 1 Perform computationally intensive operations much faster than computers
68. n an image Many statistics images are generated from an input image with each different from the others because each different image is generated from a different gray level co occurrence matrix which is unique in distance and direction These statistic images describe a texture so they can be used in texture detection and texture segmentation by comparing the statistics images generated from a pattern image to those generated from a segment of the image in which the texture detection or segmentation is needed 2 2 2 Theories 2 2 2 1 Gray level Co occurrence Matrix GLCM GLCM is a square matrix which contains numbers of times that patterns of 2 scaled values are found while examining pairs of pixels through an image The row and column indexes of the GLCM represent the possible scaled values of an interesting pixel and a pixel which corresponds to the specific direction and distance from the interesting pixel Thus size of the matrix is equal to the number of all possible scaled values e g 256 values Each position in the matrix represents a pattern of a pair of scaled values and the number of times that the pattern is found in the image is stored in the position Figure 2 1 Directions from an interesting pixel in an image 10 An example of a GLCM with 8 scale levels and 1 pixel distance in east direction is shown below to describe how a GLCM is generated scaled Image Figure 2 2 GLCM with 8 scale levels and 1 pixel di
69. n the digital image processing system except processing that related to compute intensive operations 2 1 2 Computer Architecture Present computer architecture 1s developed from Von Neumann architecture Computers in Von Neumann architecture consist of 2 units They re a processing unit and a storage unit Both data and instructions are stored in the same storage and processing unit processes them The architecture has a bottleneck in processing When the processor needs to process a large amount of data it has to wait for a long time due to throughput of the transfer between the storage and the processing unit leading to the lack of efficiency in this architecture In order to process the processor calls for an instruction in the storage After the instruction is fetched to the processor it executes the instruction If the instruction needs input data the processor will request to fetch the data from the storage to itself Then the execution was successful This process will be repeated continuously so it was called Fetch execute Cycle Because of the cycle the architecture performs slower operation on intensively computational processes than an FPGA which executes the operation without the instruction fetch part of the cycle This speed difference will be most important with the process that does the same compute intensive operation on a large amount of the data such as the digital image processing operation 2 1 3 Field Progra
70. nals Active low Chip Enable 1 n Chip Enable 2 ce2 Active low Write Enable we n and Active low Output Enable oe n Additional signals are a 17 bit Address Signal Address 16 0 and a 8 bit Data Signal Din 7 0 Dout 7 0 Read operation can be performed by continuously setting both ce _n and oe n to low voltage and both we2 n and ce2 to high voltage Then changing Address to the required address will read the data from the SRAM and send it out to data after 75 ns This operation is shown in Figure 2 9 201 Figure 2 9 Timing Diagram of the SRAM Read Operation Source AMIC LP621024D Data Sheet AMIC Technology Corp page 6 Write operation 1s shown in Figure 2 10 The oe n is constantly at low voltage Begin with changing Address signal to the address which needs to be written and set ce _n to low voltage and both ce2 and n to high voltage then after 15 ns toggle we n to another state The write operation will be performed during the overlapped time of high ce2 low and low we n thus the data to be written must be present during this period After about 60 ns those signals will be toggle again and the operation is finished tec ST A WA gt XXXXXKKXO 4 Figure 2 10 Timing Diagram of the SRAM Write Operation Source AMIC LP621024D Data Sheet AMIC Technology Corp page 6 Note that the SRAM is 5V device but FPGA prototyping board is
71. ng mc_en then a memory write 15 performed at cursor B And the operation is done at cursor C when mc done 15 high 92 A 1 3 Memory Clear Operation th pt iph sys vhd uut u itu mc mc clk pt iph sys vhd uut u itu me mc rs n at fth pt iph sus whd uut u ib mec mc Pd b pt iph apg vhd uut 128 Ihe ra iph_ sys vhd uut u ul mes mc addr pt iph sus vhd uut u it u mc mc th pt iph sys vhd uut u i t u me mce done th pt iph sys vhd uut u it u mers cel n ib pt iph sys vhd uut u it u mc sr we n pt iph sys vhd uut u iu mers ae n EH WEE WE th pt iph sus vhd uut u it u mess cez tb pt iph sys vhdzuut u it u mest addr 00007 lb pt iph sys vhd uut u itu mczbt dir Ab pb iph_ sys vhd uut u it u me dbg mc st 110002 m rs th pt iph sus vhd TER Lib mezmc data pt iph sus vhd uut u it u mc sr data 00 Hoy zd ns Figure A 3 Timing diagram of Memory Clear Operation Begin Figure A 3 shows the starting of a clear operation The operation begins at cursor A by lowering mc_cir_n Signals between cursor B and C show a write operation with zero data The address of zero writing 1s increasing from 0x10000 which 1s the starting address of the temporary memory tb pt iph sus vhd uut u it u mc mc clk tb pt iph sys vhd uut u it u mc me rst m th pt iph sys vhdzuut u it u mesme en
72. ntroduction 1 1 Project Background 1 2 Project Objectives Chapter 2 Research and Study 2 1 Related Theories 2 1 1 Digital Image Processing 2 1 2 Computer Architecture 2 1 3 Field Programmable Gate Arrays 2 1 4 Hardware Description Language 2 1 5 Finite State Machine 2 1 6 Communication Connectivity between an FPGA Board and a Computer 2 2 Gray level Co occurrence Matrix Statistic Image Generation 2 2 1 Description 2 2 2 Theories 2 2 3 Processes 2 2 4 Reasons of Choosing 2 2 5 Problem Issues 2 3 Prototyping Board 2 3 1 Description 2 3 2 Connectivity between the Board and a Computer 2 4 Static Random Access Memory 2 4 1 Description 2 4 2 Operation Pages d U N N N N 11 13 13 14 15 15 20 20 20 Chapter 3 Experiments 3 1 Possible Algorithms for Gray level Co occurrence Matrix 3 1 1 Introduction 3 1 2 Objectives 3 1 3 Materials and Equipments 3 1 4 Procedures 3 1 5 Results 3 1 6 Conclusion 3 2 Gray level Co occurrence Matrix Statistic Image 3 2 1 Introduction 3 2 2 Objectives 3 2 3 Materials and Equipments 3 2 4 Procedures 3 2 5 Results 3 2 6 Conclusion Chapter 4 Designs 4 1 Top level Design of the System 4 1 1 Block Diagram 4 1 2 System Components 4 1 3 Top level System Operation 4 2 Components Design 4 2 1 Memory Controller 4 2 2 Process Controller 4 2 3 Arbiter 4 2 4 Center Indexer 4 2 5 Square Fetcher 4 2 6 Square Buffer 4 2 7 GLCM Builder 4 2 8 Address Decoder 4 2 9 M
73. ocess of how to deploy the Image Processing in Hardware system 5 Prerequisites In the software side the software related pre requisites are programming in C language using DLLs and driver installation In the hardware side the implementation of designs is based on VHDL thus basically some knowledge of programming in VHDL and using VHDL synthesis tools is required VHDL also provides a method of using parameters called design parameters or generic values to dynamically deploy the system into any heterogeneous systems e g two systems with different data bus width or different size of memory The system 1s generalized with this technique The Image Processing in Hardware System s design parameters are described in Table 5 1 with default values 2 Table 5 1 Design Parameters of Image Processing in Hardware System Parameter Allowable Default VHDL Description Name Values Value Type CLOCK DIVISOR 1 to natural Divide the frequency of the global clock signal to SRAM access time NUMBER 2 147 483 647 CLOCK _ DIVISOR__ NUMBER Aes SS gt ADDRESS BUS WIDTH natural SRAM address bus width DATA_BUS_WIDTH Le SRAM data bus width GLCM SCALE LEVEL natural Number of color scale levels used in calculating the GLCM where n is a natural number from 3 to 8 NUMBER OF natural Number of bits to contain the maximum value of both vertical and horizontal DISTANCE BITS differenc
74. of the image 0 0 The index can be shifted to the next position by setting ci nxt to high and send a negative edge signal to 4 n If ci nxt 15 low when a negative edge signal has been received at 4 n the index will be reset to 0 0 position VT 4 2 5 Square Fetcher 4 2 5 1 Description Square Fetcher fetches corresponding pixels from the Memory Unit into the Square Buffer which represents the square processing window This module has an advantage over the CPU in the reduction of the time used in fetching pixels in the window which is virtually nearby one another but physically far away one another in the memory address space This module is named sq fetch and abbreviated as sf 4 2 5 2 Ports ci col 8 0 dbg sf st 3 0 4 mc addr 16 0 img addr 16 0 sb wr addr 3 0 img height 9 0 Figure 4 11 Block structure showing ports of Square Fetcher Figure 4 11 shows ports of this module and those ports are described in the Table 4 5 LB Table 4 5 Ports of Square Fetcher Port Name i Direction Description MSB LSB i Based on Square Fetcher 33MHz Clock Signal Active low Reset Signal sf gnt 1 Input If high Square Fetcher gains control of ae Memory Controller LE usn EI 3 img width 9 0 10 Input img height 9 0 10 Input sf req 1 Output If high Square Fetcher is requesting for controlling of Memory Controller sf_done 1 If high Squar
75. om varying the number of lines in the buffer Resulting of speed between performed GLCM image generated by software and hardware for comparing Pages 17 19 25 21 35 38 47 50 52 56 58 6l 63 65 68 70 84 85 Chapter 1 Introduction Image Processing in Hardware project aims at accelerating image processing operations by using an FPGA as a co processor of the computer system This co processor will perform the compute intensive processing part of those operations 1 1 Project Background Digital image processing is one of the most compute intensive processing in the world Obviously it repeats the same operations on a large amount of data It 1s found that doing the digital image operations by using a computer is slow because the computer executes very repetitive operations by fetch execute cycle The cycle keeps fetching data and instructions from the storage and executes them one by one not knowing that it is executing the same operations An FPGA does not use the fetch execute cycle It can be programmed to function as a parallel computing unit which takes a large amount of data and does the same operations to each segment of the data at the same time Thus Image Processing in Hardware project takes the advantages of the FPGA in speeding the digital image processing The computationally intensive operations in a digital image process will be implemented in the FPGA For other operations users have to im
76. one glcm buider twm buider bwigo addr OO De utu __ OS i es 8 3 m Figure A 29 Timing diagram of GLCM Builder Operation Figure A 29 shows the operation of GLCM Builder This operation is enabled at cursor A by rising gb en The sb rd addrl is 1 match the sb rd addr2 is 3 follow by the direction gb dx and 2 dy This operation gets their pair to vote and operation is done when mv done is low at cursor B And the all operations is done at cursor C when gb done is high 112 A 8 Address Decoder decoder ad rst n tte at qu aq E E Vi cS E E EE NERO EN decoder ad 1100100 Tooooo Y00105 100000 1 00204 Now 0000 ps Figure A 30 Timing diagram of Address Decoder Operation Figure A 30 shows the operation of Address Decoder This operation takes both of the value of ad row 0x01 and ad col 0x05 to compute the result and combine the ad_start for getting the value of ad_addr 0x00204 113 A 9 Matrix Voter ftest_matrix_voter my_clk test matris voter ms rst ni est matrix voter mv en tesk matrix vater my grant tesl matrix voter mc done test matrix voter mv addr ca E itest matris voter mv req matris voter mwy done matris voter mc en SE CE CECI _ test matrix voter me test matris voter mc data Now 2400000 pe 114035 pe
77. output statistics image generated 81 6 1 5 Results Verifying the correctness of the output statistics image generated by software and hardware Figure 6 1 The 1 moment output statistics image generated by software with R is 240 direction is east Figure 6 2 The 1 moment output statistics image generated by hardware with R 1s 240 direction is east 82 Figure 6 3 The 2 moment output statistics image generated by software with R is 240 direction is east Figure 6 4 The 2 moment output statistics image generated by hardware with R is 240 direction is east _ 83 Figure 6 5 The 3 moment output statistics image generated by software with R 1s 240 direction is east Figure 6 6 The 3 moment output statistics image generated by hardware with R is 240 direction is east 84 Table 6 1 Average error from verifying the correctness of output statistics image generated by software and hardware Statistic Summation of Direction Moment Difference Northeast Southeast 6 1 6 Conclusion According to the results above it says Average error 1s zero So verifying the correctness between GLCM statistic image generation operation by software and hardware is correctly 85 6 2 Execution Time Verification 6 2 1 Introduction From the generated statistic image outputs by software in the experimental phase in previous chapter and verifying the correctness in this chapte
78. plement them in the host computer 1 2 Project Objectives 1 Use the FPGA as co processor of the computer system in computing the computationally intensive part of operations in the digital image processing 2 Study methodology of using the FPGA to build a computing platform which co operates in computing with the CPU of the computer 3 Make programming of specified algorithms in the FPGA possible for an applications developer or researcher Chapter 2 Research and Study In this phase many image processing operations are studied Interesting operations which contains intensively computational processing are chosen and studied in detail 2 1 Related Theories 2 1 1 Digital Image Processing Digital image processing is used to process digital images to recover information which 1s not visible in the original images It has advantages over analog image processing it lets algorithms which can be implemented only in digital system be applied to the input data Moreover during the digital process there is less noise and distortion than when you using analog processing In the past the cost of digital image processing was very high This made the digital image processing limited to a small number of uses After computers and dedicated hardware were cheaper the processing became more popular At the present time computers have more speed than in the past Computers now take over the role of most dedicated hardware i
79. ption MSB LSB Based on Arbiter 1 Output If high Process Controller gains control pm of the Memory Controller 1 Output If high Square Fetcher gains control of pm the Memory Controller 1 Output If high Matrix Voter gains control of pm the Memory Controller 1 Output If high Matrix Integrator gains control pm of the Memory Controller dbg ar st 3 0 Arbiter State LEDs debug 4 2 3 3 Operations Arbiter starts from Wait State When a request signal which 15 pc req sf req mv req or mi req is set it will move to Decide State Within the state grant signals which are pc sf mv and mi are changed by granting the request from the highest priority module first in case of 2 or more modules request in the same time The signal connected to the granted module goes high during granting The priority of granting can be rearranged from Process Controller Square Fetcher Matrix Voter and Matrix Integrator descending After granting the state of this module 1s changed into Grant State While being in Grant State all non granted request signals are 1gnored until the granted request signal becomes low Once the signal goes low Arbiter will return to the Wait State and continue its operation Figure 4 9 shows the finite state machine of this module 49 There is a request Decide State The granted request becomes low Figure 4 9 Finite State Machine of Arbiter 4 2 4 Center Inde
80. r we get the processing time the both software and hardware So we can compare the speed between performed GLCM image generated by software and hardware 6 2 2 Objectives To compare the Processing Time between performed GLCM image generated by software and hardware 6 2 3 Results Table 6 2 Resulting of speed between performed GLCM image generated by software and hardware for comparing Processing Time Processing Time Image i by by Speed Up Size i software seconds hardware seconds pixels Wall Wall Time Time Time Time 16x16 16 38 13 13 X0 11 15 79 86 6 2 4 Conclusion According to the results above the processing time of the hardware is faster than the processing time of the software Speed up tends to be increase when increasing size of the input images The speed up is not as good as expected because the clock signal for Memory is needed to be divided by eight by setting CLOCK DIVISOR NUMBER to four in order to maintain the correctness of the experimental system If the default value in Table 5 1 of CLOCK DIVISOR NUMBER is used the data read from the Memory Unit will become inconsistent and the result images are incorrect The inconsistency is caused by wiring configuration of the prototype and time delay of signal transitions through resistors 87 Chapter 7 Conclusion Image Processing in Hardware is the project concentrating on designing a co processor for the compu
81. rupt occurs and the Message Register is set to MSG IMAGE NOT INIT In case of MSG IMAGE SIZE INVALID an interrupt occurs but the flag 1s not set If the flag is set any of the instruction can be received Process Controller will move to the Decide State to perform different tasks for each different instruction except for Clear Interrupt and Initialize Image instructions For Clear Interrupt instruction the instruction register 1s clear and the current interrupt is disappeared suddenly without state changing of Process Controller 45 For Initialize Image instruction Image Register is set once the instruction is received without state changing if and only if the size is valid Otherwise an error will be reported For the other instructions Process Controller performs operations as described below For Write to Memory Read from Memory into Data Register or Clear Temporary Memory instruction Process Controller will move into the mc State It begins the state by first send a request for controlling Memory Controller to Arbiter by set req to high After receive a grant 1 pc_gnt 15 high it performs one of 3 Memory Controller s operations in order to which one of 3 instructions was received Once it receives a Done Signal from Memory Controller 1 e mc done is high it returns mc en and n to the default values cancelling the operation and goes back to the Wait State
82. same operations again There are 4 directions to be calculated Northeast East Southeast and South Note that for every instruction there must be an interrupt read request followed by an interrupt clear TQ Thus time used for GLCM statistics image generation Tasg 1s I Lj 4 eT p SP dir ST cal dig req int Fr Where is the size of the input image R is the size of each region For the typical values of the time types and the default values specified in Table 5 1 of GLCM SCALE LEVEL and WINDOW SIZE which 15 256 and 3 respectively each instruction time 1s T 4 12 1 12 393 222 Toc Tu x py 88 393 384 Lie 97 Thus 1 Tose 1 393 682 R 16 1 574 728 1 SIG _79 Chapter 6 Verification Verification in this phase is set up to obtain the comparison of both correctness and speed of algorithm for each operation performed between software and hardware The processing time of each operation is recorded to be compared with the processing time performed by hardware Moreover the outputs of each operation obtained from the experimental phase are used in verifying correctness of the outputs obtained from the operation performed by hardware 6 1 Correctness Verification 6 1 1 Introduction From the statistic image outputs by software in the experiment phase in the previous chapter statistics images
83. sets sf en to high for enabling Square Fetcher at cursor B After enabling a fetch 1s shown at cursor C tb pt iph sys vhd uut u itiu pe pc clk Ab pt iph sys vhd uut u it u peripe rst n b pt iph sys vhdzuut u it u pe pc req pt iph sys vhd uut u pczpc gnt tb pt iph sus vhd uut u it u pc lb cs n Ab pt iph sys vhd uut u it u perb wr n pt iph sys vhd uut u it u perb rd n pt iph sys vhd uut u itu perb addr 0001 mH Ab pt iph sus vhd uut u it u pc lb data aut FFFFFFFF pt iph s vhd uutzu itu pe lb int Ab pt iph sys vhd uut u it u en pt iph sys vhd uutz u it u perst done iph sys vhd uut u it u pe mc data Fa l Figure A 15 Timing diagram of Fetch Data into Square Window Operation End Figure A 15 shows the end of the operation After a high sf_done at cursor A Process Controller initializes an interrupt at cursor B A 2 8 Calculate GLCM Operation Ab pt iph sys vhd uut u itu pepe clk th pt iph sys vhd uut u it u perpe rst n pt iph sys vhd uut u it u pe pc req pt iph sys vhd uut u it u perpe gnt Ab pt iph ers vhdzuut u it u perb es n pt iph sys iteu pell wr n Atb pt iph sys vhd uut u it u pedib rd n 1 tb pt iph sys vhd uut u perb addr 0001 0001 pe iph sys vhd uut u itu deta out Ye nncigi JFFFFFFFF tb pt iph sys vhd uut u it u pc lb int Ab pt iph sys vhd uut
84. stance in east direction Source http matlab izmiran ru help toolbox images enhanc15 html From Figure 2 2 position 1 1 in this GLCM contains the value 1 because in the scaled image there is only one time that a pattern of an interesting pixel value and its corresponded pixel value is 1 1 Position 1 2 in this GLCM contains the value 2 because in the scaled image the pattern of an interesting pixel value and its corresponded pixel value which is 1 2 is found twice The other values in the GLCM can be derived in the similar way 2 2 2 2 Calculating Statistics Value for a GLCM The statistic value 1s a representation of every value in a matrix For a GLCM the k th moment method is used in representing It can be calculated as follow Statistic gt gt i i x GLCM j i Where k is a natural number Only the 1 27 and 3 moments are calculated for GLCM Statistics image generation SUE 2 2 3 Processes 10 GLCM statistics image generation processes are as follows Open an input image Set distance value used in GLCM calculation process to 1 Process the image using a moving window of pixel lines The number of lines 1s specified by user but 1s limited by the available memory For each location of the window a square area with its size specified by users 1s set up with its center at each pixel in the window Then the northeast east southeast and south direction GLCMs are
85. t direction Flow chart of the GLCM statistic image generation True PCI the FPGA prototyping board Model of the connectivity between the prototyping board and a computer Local bus signals of pciif32 interfacing with a user IP core Read operation timing diagram Write operation timing diagram Timing Diagram of the SRAM Read Operation Timing Diagram of the SRAM Write Operation The bi directional level shifter circuit Block diagram of Image Processing in Hardware system Top Block diagram of Image Processing in Hardware system Bottom The top level flow chart of the system Block structure showing ports of Memory Controller State Machine of Memory Controller Block structure showing ports of Process Controller Finite State Machine of Process Controller Block structure showing ports of Arbiter Finite State Machine of Arbiter Block structure showing ports of Center Indexer Block structure showing ports of Square Fetcher Pages 9 10 12 14 15 16 18 18 21 21 22 30 3l 33 34 36 37 46 47 49 49 5 Figure 4 12 Figure 4 13 Figure 4 14 Figure 4 15 Figure 4 16 Figure 4 17 Figure 4 18 Figure 4 19 Figure 4 20 Figure 4 21 Figure 4 22 Figure 6 1 Figure 6 2 Figure 6 3 Figure 6 4 Figure 6 5 Figure 6 6 Figure A 1 Figure A 2 _ ix Main Finite State Machine of Square Fetcher Fetch Finite State Machine of Square Fetcher Block structure showing ports of Square Buffer Block
86. ta 1s defined as row ad row of GLCM matrix Address Decoder will calculate the result of address ad addr and the starting address ad start can be calculated from size of Memory Unit GLCM Scale Level 2 6 The result of address ad_addr can be calculated from following formula ad_addr ad start GLCM_ Scale Level x ad row ad col When complete the calculation Address Decoder will send ad to Matrix Voter for voting this address 4 2 9 Matrix Voter 4 2 9 1 Description Matrix Voter was designed for reading the data from Memory Unit for voting by increment the value of the address that gets from Address Decoder by 1 and writes into Memory Unit in the same address Matrix Voter does this operation by cooperation with Address Decoder and GLCM Builder This module is named mat voter and abbreviated as mv 4 2 9 2 Ports dbg mv st 3 0 mc adar 16 0 mc en mc rw mv done mv req mc data 7 0 Figure 4 18 Block structure showing ports of Matrix Voter Figure 4 18 shows ports of this module and those ports are described in the Table 4 9 2532 Table 4 9 Ports of Matrix Voter Port Name Direction Description MSB LSB Based on Matrix Voter Input 33MHz Clock Signal 1 mv_clk mv en mv gnt 1 Input If high gain control of Memory oum um Controller s operation mc done 1 Input Done Signal from Memory mv_addr 16 0 17 Input Address for voting of GLCM ie o
87. ta aut FFFFFFFF 30000000 JFFFFRHRFF l1 Ll l th pt iph sus vhd uut u it u pc lb int Atb pt iph sus vhd uut u it u pc ci ld n pt iph sys vhd uut u it u pc ci nt How 11654 ns 11745 nz Figure A 13 Timing diagram of Shift Square Window Position Operation The operation is shown in Figure A 13 It begins when an instruction write request of pciif32 occurs at cursor A Process Controller forces nxt signal to high and performs a negative edge ci d n at cursor B After the operation 1s done an interrupt 1s initialized at cursor C A 2 7 Fetch Data into Square Window Operation th pt iph sys vhd uut u itu pe pc clk Ab pt iph sus v id uut u it u pepe rst n th pt iph sus wvhdzuut u it u pc pc req tb pt iph sys shd uut u it u perpe ont pt iph sys vhd uut u it u pc lb cs n b pL ip h ET z whd uut u it pc lb wr Atb pt iph sys vhd uut u it peb rd n pt iph sys vhd uut u it u perb aua Ab pt iph sus vhd uut u it u pc lb data out 1 pt iph sus v id uut u itu pebint b pt iph sys vhd uut u it u perst en pt iph sys vhd uut u it u pc st done gt pt iph sus vhd uut u pc mc data How 151185 ns 12 66 ns 12857 ns Figure A 14 Timing diagram of Fetch Data into Square Window Operation Begin 101 The beginning of the operation is at cursor A in Figure A 14 where a write request of the instruction occurs Process controller
88. tate 1s looped back to Wait State wating for sf_en again Figure 4 12 shows the Main FSM of the Square Fetcher 54 sf enis low sf enis high Request State f done is high sf gntis high Figure 4 12 Main Finite State Machine of Square Fetcher Fetch FSM originates in its Wait State Once f en is high its state will changes to Validate State In this state the row and column indices of expected to be fetched pixel are validated If the coordinate 1s out of bound the row and column indices will be replaced by the coordinate of the nearest border pixel Once the validation process 1s complete the FSM will move to Assign State As the name of the state this state assigns a value to the new Square Buffer address space It performs a read operation of the Memory Controller with the index validated address When a high mc done is received a negative edge signal 15 generated at sb wr n to write the data in the data bus to the address in Square Buffer address space specified sb wr addr Again the state changes to Next State to move the coordinate to the next position and sb wr addr is added by 1 If the coordinate doesn t reach the end of the square processing window the FSM moves to Validate State to continuously perform fetching operations Otherwise if the end of the window is reached it moves to Done State setting f done to logic 1 and waiting for a reset of f en to return to Wait State This FSM is shown in Figure
89. ter system to compute the computationally intensive part of operations in the digital image processing This project integrates the knowledge of both software and hardware computer engineering into one application The designed system is capable of speeding a digital image processing operation by transferring the selected computationally intensive part of the GLCM statistic image generation which takes a week to complete its process into a Spartan 3 FPGA with an external SRAM The system theoretically speeds the generation of the image about four times but practically the speed up is about 25 percent due to the parasitic capacitance in the prototype of the system and a bottle neck in transferring data between the host computer and the FPGA To build a computing platform which co operates in computing with the CPU of the computer a lot of tools and knowledge must be integrated together Firstly the device which is capable of computing and contains a fast memory unit is required An FPGA prototyping board was selected to play this role Secondly the communication protocol between CPU and the device must be chosen or created It is highly recommended that the speed of the communication should be the fastest because the data transferring consumes the precious time of computing In this project PCI was chosen because it is the fastest way to communicate directly with the computer local bus Lastly a program which needed to utilize the device must be mod
90. terrupt Operation Timing diagram of Write to Memory Operation Begin Timing diagram of Write to Memory Operation End Timing diagram of Read from Memory into Data Register Operation Begin Timing diagram of Read from Memory into Data Register Operation End Timing diagram of Clear Temporary Memory Operation Begin Timing diagram of Clear Temporary Memory Operation End Timing diagram of Reset Square Window Position Operation Timing diagram of Shift Square Window Position Operation Timing diagram of Fetch Data into Square Window Operation Begin Timing diagram of Fetch Data into Square Window Operation End Timing diagram of Calculate GLCM Operation Begin Timing diagram of Calculate GLCM Operation Finish clearing Timing diagram of Calculate GLCM Operation End Timing diagram of Digest GLCM into Statistics Values Operation Begin 92 92 93 93 94 95 96 97 98 99 100 100 101 101 102 102 103 Figure A 20 Figure A 21 Figure A 22 Figure A 23 Figure A 24 Figure A 25 Figure A 26 Figure A 27 Figure A 28 Figure A 29 Figure A 30 Figure A 31 Figure A 32 Figure A 33 Figure A 34 Figure B 1 _xj Timing diagram of Digest GLCM into Statistics Values Operation End Timing diagram of Read 1 Moment into Result Register Operation Timing diagram of Initial Image Operation Timing diagram of Read from Interrupt Register Operation Timing diagram of Arb
91. the integrated system to pins of the FPGA Assemble electronic components together It should be better if the noise elimination and analog filtering are applied in building the circuitry because the system will become more stable and the speed of the operation will be closer to the theoretical value Check whether each device working properly by measurement tools If the software side 1s done the deployment is finished xr 5 2 2 Software Side 1 Study the interested image processing operation 2 Define which parts of the operation can utilize designed components e g Square Buffer and Matrix Integrator and which parts must be done in the host computer e g Floating point calculation 3 Write a program which presenting the operation without Image Processing in Hardware functions 4 Modified the source code in the sections which were selected to be implemented in the hardware into the hardware function calls 5 Modified the code so that after a call to a hardware function the device interrupt is monitored This aids the synchronization of the host computer and the prototyping board 6 Install the device driver ce Compile and link the program 8 If the hardware side is done the deployment 1s finished d 5 3 Theoretical calculation The processing time can be calculated by assuming that the basic operations performed by a CPU are a lot faster than the 33 MHz PCI clock frequency Thus the processing time
92. troller 4 2 1 1 Description Memory Controller takes care of reading from and writing to the SRAM A read or write request comes from other modules which need to access data within the SRAM Once the request is received the memory controller signals the SRAM as in Figure 2 9 or Figure 2 10 for a read or write request respectively and notifies the requester about the completion This module is named mem ctrl and abbreviated as me 4 2 1 2 Ports mc addr 16 0pgbg mc st 3 0 sr addr 16 0 bf dir mc done sr ce1 n sr ce2 sr oe n sr we n mc data 7 0 sr data 7 0 Figure 4 4 Block structure showing ports of Memory Controller Figure 4 4 shows ports of this module and those ports are described in the Table 4 1 Nor um Table 4 1 Ports of Memory Controller Port Name i Direction Description MSB LSB Controller Clock Signal from Clock Divider mc rw Input If low activate the read operation If high activate the write operation NL sr_cel_n Output Active low Chip Enable 1 of SRAM Lm 3 Reden Baas wi ori pon 4 2 1 3 Operation Normally Memory Controller is in the Wait State and controls all SRAM signals to operate in the reading operation until mc_en is changed to high or mc_clr_n is change to low If mc en is high it checks mc rw whether it is low or high If mc rw is high it will change its state to Read State change sr addr to
93. tween the FPGA board and the computer it allows the FPGA board be connected to the PCI bus for transferring data between the FPGA and the computer Specifications l 24 bit bus width to the development board via a 50 pin IDC connector with half of the pins connected to the ground 2 5 V signaling Implementation Possibility The I O board uses 8 bits out of 32 bits of the PCI bus as its instruction port to control its operation The other 24 bits of the bus is connects to the FPGA board The speed of the 24 bit depends on the circuit of the I O board Predictably if it uses the same clock frequency as the PCI bus the transfer rate will be 99 99 MB per second 3 bytes x 33 33 MHz However the boards available at low cost are much slower probably only 3 MB per second With this connection FPGA development boards that support 24 or more I O pins can be use in this project Moreover they must support 5 V signaling If they don t the driver circuit must be used to convert the lower voltage to 5 V 2 2 Gray level Co occurrence Matrix Statistics Image Generation This is an image processing operation which is chosen to be a representation of other image processing operation The operation generates statistics images from a large fine image e g a satellite image which contains about 1000 million pixels 2 2 1 Description Gray level Co occurrence Matrix statistics images are images which represent the statistic uniqueness of textures i
94. u it u peime clr n Ab pt iph sys vhd uut u tu pe gb en Ab pl iph sus vhd uut u it u pe gb done pt iph sus vhd uut u it u dx pt iph sys vhd uut u it u pergi dy How A 15945 ns B 135036 ns Figure A 16 Timing diagram of Calculate GLCM Operation Begin 102 Figure A 16 shows the beginning of the operation In the figure the instruction is written at cursor A along with dx and dy dx is 0x01 and dy is 0x81 indicating it is negative 0x01 Thus the GLCM is calculated for 1 1 direction Process Controller changes gb_dx and gb_dy to the given values at cursor B and performs a memory clear by lowering mc_clr_n at cursor C Ab pl iph sys vhd uut u it u pe pc clk th pt iph sys vhd uut u i u pe pe rst n fth pt iph sys vhd uut u itiu perpe req pt iph sus vhd uut u it u pepe ant tb pt iph sys vhd uut u itu pc lb cs n pt iph sys vhd uut u iteu perb wr n AB pt iph sys vhd uut u it u perb rd n pl ph sys vhdz uut u itu perb addr 0001 Atb pt iph sys vhd uut u it u pe lb data FFFFFFFF pt iph sus vhd uut u itu pc lb int 0 pt iph sys vhd uut u it u pc mc n Ab pt iph svs vhd uut u it u poc gb en pt iph sys vhd uut u itu pc gb done How A 12205 Figure A 17 Timing diagram of Calculate GLCM Operation Finish clearing After the temporary memory is cleared as shown at cursor A in Figure A 17 Process Controller enables GLCM Bui
95. u tiu pc mc data pt iph sus vhd TT p u t u perme en Ab pti ph ET amp vhd uut u it u pez me n pt iph sys vhd uut u it u pe me elr n How 151185 ns 3435 ns 9466 ns 2 Es B E Figure A 9 Timing diagram of Read from Memory into Data Register Operation End After Memory Controller performs its read operation a high mc_done occurs as shown at cursor A in Figure A 9 At cursor B Process Controller gives up its access grant and initializes an interrupt at cursor C tb pt iph sys vhd uut u it u pepe clk pt iph sus vhd uut u iteu pe lb cs n tb pt iph aps vhd uut u iteu pc lb wr n pt iph sys vhd uut u itu perb rd n tb pt iph sus vhd uut m it u pc lb addr tb pt ip h Nhdzuut u it u pc lb data pt iph sys vhd uut u it u pe lb int pt iph sys vhd uut u itu perb data in tb pt iph sys vhd uut u itu perpe req pt iph sys vhd uut u u pe pc gnt 97 A 2 4 Clear Temporary Memory Operation 1 0001 FFFFFFFF 0 00000050 1 i 00000 tb pt iph sys vhdzuut u it u pe mc addr Atb pt iph sys vhd uut u it u pc mc data e tb pt iph sus vhd uut it u pcz mc tb pl ip h ET amp whd uut u permie mn Atb pt iph sus vhd uut u it u peme clr n tb pt iph sus vhd uut u tiu perme done 18946 ns 19035 ns Figure A 10 Timing diagram of Clear Temporary Memory Operation Begin In Figure A 10 The operation begins
96. use they need to perform different actions in the different input conditions It is also appropriate in programming image processing using the FPGA 2 1 6 Communication Connectivity between an FPGA Board and a Computer The connection between the FPGA board and the computer is the bottle neck of processing data outside the CPU Possible connections are 2 1 6 1 Connect Directly to the PCI Bus of the Computer Peripheral Component Interconnection PCT 1s a local bus which connects directly with the processor bus or system bus of the computer Specifications 1 33 33 MHz clock with synchronous transfers 2 Data transfer rate 1s 133 MB per second for 32 bit bus width 3 3 3 or 5 V signaling Implementation Possibility This connection provides the fastest transfer rate over other connections It greatly reduces the effect of the bottle neck to the system In spite of the profit from the speed the FPGA board which contains an FPGA must provide the connector to the PCI slot of the computer and there must be a PCI controller on the FPGA board This requires a specialized development board which is not available at KMUTT 2 1 6 2 Connect via the 1000BASE T Gigabit Ethernet Gigabit Ethernet is a computer network connection according to the IEEE 802 3z standard 1000BASE T 15 a type of this connection which uses category 5 unshielded twisted pair UTP 5 cables to connect between devices Specifications l serial data transfer
97. ut u it u stime nw 0 Ab pt iph sus vhd uut u itu shme addr 00000 100007 EH pt iph vhd uut u it u sf sb wr addr 0 BE tb pt iph sus vhd uut u itu sf dbg sf st E 2 Now 65700 ns A 17894 ns Figure A 27 Timing diagram of Square Fetcher Operation End The fetch operation continuously occurs until all elements in the processing window are fetched Figure A 27 shows this event At cursor B Square Fetcher set sf_done to high in order to indicate the completion of the operation and cancels the request for controlling Memory Controller at the same time 110 A 6 Square Buffer th_sq_buffer sb_rtst_n Ab sq_butfer sb_wr_n th sq_buffer sb data in zb iq b wr addr bulfer sb rd Aib sq bullet sh rd addi2 Aib sq bulfer sb data Abs q buffer s b dat Now 1000000 ps 210502 ps 300290 ps Figure A 28 Timing diagram of Square Buffer Operation The operation of this module is simulated as shown in Figure A 28 At cursor A sb_rst_n 1s set to high and the operation begins At the same time the sb_rd_addr2 1s changed to 0 1 The output data at sb_rd_data2 1s 0x00 A write occurs at cursor B Writing the data OxD1 via sb data in to the address 0x1 causes sb data out2 to change to OxD1 Another write occurs at cursor C writing OxD2 to the address 0x2 111 A 7 GLCM Builder f ralem ee tow b folem buider tbw ab dy Aglen buider Be d
98. with a MSG OK interrupt For Reset Square Window Position or Shift Square Window Position instruction Process Controller move into the ci State send ci d n negative square pulse or nxt positive pulse respectively Then it will return to the Wait State For Fetch Data into Square Window Calculate GLCM or Digest GLCM into Statistic Values instruction it will move into a corresponding state 1 sf State gb State and mi State to enabling Square Fetcher through a high sf_en GLCM Builder through a high gb_en and Matrix Integrator through a high mi_en respectively Once the enabled module is complete its operation and the Done Signal is received 1 e sf done gb done mi_done Process Controller will disable the module and return to the Wait State with a MSG interrupt For Read 1 24 or 3 Moment into Result Register instruction Process Controller will change into mi State and change mi output sel to be the same as bit 28 down to 27 of the instruction register to order Matrix Integrator to send the selected data through mi data out After setting the signal Process Controller returns itself to the Wait State and sends a MSG interrupt to the host computer 46 When receiving a read request b cs n and rd n are low Process Controller will select a register corresponded to b_addr and put the data within the specified register to data in for the read operation of pci
99. x images enhanc 15 html 2007 August 24 Finite State Machine online available http www nist gov dads HTML finiteStateMachine html 2007 June 24 Marching Cubes Algorithm online available http www polytech unice fr lingrand MarchingCubes algo html 2007 August 23 Introduction to Radar Remote Sensing online available http satftp soest hawail edu space hawaii vfts kilauea radar_ex intro html 2007 August 10 True PCI User Manual rev 1 2 online available http www design gateway com download Trupci UserManual zip 2008 February 25 OpenDragon Project online available http www open dragon org 2008 February 25 AMIC LP621024D Data Sheet online available http www es co th Schemetic PDF LP621024D PDF 2008 February 25 89 Appendix A Timing Diagrams Timing diagrams of the Image Processing in Hardware system and its modules are shown in this section 90 A 1 Memory Controller A 1 1 Memory Read Operation pt iph sus vhd uut u it u mc mc clk tb pt iph sys vhd uut u it u meme rst n fth pt iph sys vhd uut u t u mc mc en bp Lip h sus vhd uut u it umem C DA mH Atb pt iph vhd uut u it u meme addr 5b ib pt iph sus vhd uut uit umeme clr n pt iph sys vhd uut u it u me me done Ab pt iph sys vhd uut u mes cel n pt iph sys vhd uut u it u mc sr we n pt iph sys vhd uut u it u mers oe n th pt iph sys vhd uut u it u m
100. xer 4 2 4 1 Description Center Indexer is one of the modules representing the square processing window This window is one of unique characteristics of 1mage processing operations Operations are performed to every pixel in the window and the results are placed at the center pixel of the result images Center Index keeps track of row and column indices of this center pixel and is responsible for moving this index through all pixels in the input image This module is named cen index and abbreviated as ci 4 2 4 2 Ports I addr 16 0 ci col 8 0 img height 9 0 ima width 9 0 ci row 0 Id n ici nxt ici ret n ci end Figure 4 10 Block structure showing ports of Center Indexer 50 Figure 4 10 shows ports of this module and those ports are described in the Table 4 4 Table 4 4 Ports of Center Indexer Port Name Direction Description MSB LSB Based on Center Indexer ci ld n 1 Input If low calculate and change the index to the next position cl nxt Input If high shift the index to the next position If low reset the index to 0 0 img width 9 0 Width of the Image img height 9 0 Height of the Image img_addr 16 0 Starting Address of the Image ci_end 1 Output If high indicates that the index reaches omo t oe the end of the image 4 2 4 3 Operations During the rst n signal is low Center Indexer resets itself to the origin position

Image Processing in Hardware - Department of Computer Engineering

Contents

Download Pdf Manuals

Related Search

Related Contents