Home

fulltext - DiVA Portal

1. STKPC OPCODES JMP begin end endcase end next pc mux sel STKPC SEL PC INSTR alu a mux sel STKPC SEL ALU A STO alu b mux sel STKPC SEL ALU B ZERO alu op mux sel STKPC SEL ALUOP OR dsp mux sel STKPC SEL DSP rsp mux sel STKPC SEL RSP rstack mux sel STKPC SEL RSTACK PC 8E0 mux sel STKPC SEL STO NEW ALU halt mux sel STKPC FLOW HALT NO rstkW 0 dstkW 0 ioWE 0 r io access 0 all instructions were decoded as the previous 2 examples for practical purposes other instructions will be removed from the written document please refer to the digital appendix for th full tists dd Data Stacks IST TT TELT data_stack DATA STACK 61 Outputs data out sti Inputs data in sto addr rd dsp addr wr dsp vr en dstkW clk clk JE f Reture Stack NTT HH return stack RETURN STACK Outputs data out rsto Inputs data in rstkD addr rd rsp addr wr rsp WI en rstkW clk clk SANNE stkpc alu STKPC ALU Outputs SC alu c STKPC WORD MSB 0 Inputs a alu alSTKPC WORD MSB 0 b alu b STKPC WORD MSB 0 op alu op mux sel STKPC SEL ALUOP MSI V fUpdate PC polintersi JIM always posedge clk or posedge reset begin if resetssl bi begin pc lt 0 dsp lt 0 sto l
2. Data in and out always posedge clk i or posedge rst i begin if rst i 1 bl begin dat o lt 0 r ack o lt 0 rty o lt 0 err o lt 0 end else if cs begin If WB is the target if wcet 0 begin if we i 0 kg stb i 1 begin Master is going to read dat_o lt reg fileladr i amp 255 r ack o lt 1 end else if we i 1 amp amp stb i ss 1 begin Write reg fileladr i amp 255 lt dat i r ack o lt 1 end else begin if we are reading and slave is ready r ack o lt 0 end end else begin if we i 0 amp amp stb i 1 amp amp ready 1 begin Read dat o lt reg fileladr i amp 255 r ack o lt 1 end else if we i 1 amp amp stb i 1 amp amp ready 1 begin Master is going to write reg file adr i amp 255 lt dat i r ack o lt 1 end else begin if we are reading and slave is ready 69 r ack o lt 0 end end end end endmodule wb s 16 7 2 ArchC Files 7 2 1 Stkpc ac beginlarch ac AC ARCH stkpc ac mem DM 5M ac regbank RB 15 ac reg new sto ac reg st0 ac reg stl ac reg rst0 ac wordsize 16 ARCH CTOR stkpc ac isa stkpc isa ac set endian little 1 end arch_ ac 7 2 2 Stkpc isa ac AC ISA stkpc beginlinst format ac format Type Lit opcodel l ukl15 15 ac format Type Jmp opcodel l opcode2 2 Sukl3 13 ac format Type Cond Jmp opcodel l
3. end end ME fot STN NNN always begin if start signal 0 begin _st0 lt 0 end else begin case stO mux sel STKPC SEL STO NEW ALU begin st0 lt alu c end STKPC SEL STO NEW RAM IO begin st0 lt st0 15 14 io din ramrd end STKPC SEL STO NEW IMM begin _st0 lt 1 b0 instr STKPC_IMM_MSB 0 end default _st0 lt sto endcase end end Halt instruction always begin case halt mus sel STKPC FLOW HALT YES begin halt 1 bl end STKPC FLOW HALT NO begin halt 1 DO end default halt 1 DO endcase end Instruction Decoder always begin next pc mux sel 0 alu a mux sel 0 60 alu b mux sel 0 alu op mux sel 0 dsp mus sel 0 rsp mux sel 0 rstack mus sel 0 st0 mux sel LEO halt mux sel rstkW 0 dstkW 0 ioWE 0 0 r io access 0 casez instr STKPC OPCODES LIT begin end next pc mux sel STKPC SEL PC INC alu a mux sel STKPC SEL ALU A STO alu b mux sel STKPC SEL ALU B ZERO alu op mux sel STKPC SEL ALUOP OR dsp mux sel STKPC SEL P INC 1 rsp mux sel STKPC SEL P rstack mus sel STKPC SEL RSTACK PC SEO mux sel STKPC SEL STO NEW IMM halt mux sel STKPC FLOW HALT NO rstkW 0 dstkW 1 ioWE 0 r io access 0 Fy Pu Fy Pu nn bi oO CO Di al
4. wire wire STKPC STKPC dstkW dstkW DSTACK MS DSTACK MS B 0 B 0 SEO New top of stack data Data stack write enabl Data stack write enabl stl Next of TI e Reg e Stack Return Stack ReGS ILN NTL LT TT HEY reg reg reg reg reg reg reg reg reg reg reg reg reg wire wire wire STKPC RSTACK MSB 0 rsp Return stack pointer STKPC RSTACK MSB OJ rsp New return stack pointer rstkW R stack write enable STKPC RSTACK MSB 0 ratkD Newest return stack data wire STKPC RSTACK MSB OJ rst0 Top of Return stack Ganeral Regs III VEN LE CELE EED OD ELS EA ISTKPC PC MSB 0 pc Program counter ISTKPC PC MSB 0 pes New program counter ISTKPC PC MSBSOJ Start signal Start Signal stall Stall signal for IO RAM Access STKPC_PC_MSB 0 instr addr Next instruction address r io access IO Access Reg halt Halt signal halt r Halt Reg LOWE TO RAM write enable Reg io access 10 Access Signal _ioWE IO RAM write enable STKPC PC MSB 0 pe plus 1 PC 1 to fetch next instruction STKPC WORD MSB 0 ramrd Ram read input wire deene LG TA instr addr Pass addr of next instruction MI We use top of We use next in stack N io access amp stall High if access ta Every time a instr is taken from assign ioWE ioWE assign dstkiizdstkii assign instr_ addr assign 10 e ne acc
5. Problem DGS CHT GI gja s SEE NE de Bee GE Ee Ee Re ees iene oc Pe Ee Ged ee Ne Ee i ele EE iii EE v e nie e v Table e EE vii Table TRUE ee xi ver 1 1 1 Internet of Kn Un e E 1 1 2 Motivation Energy Harvesting AEN 2 1 3 Assignment Interpretation see AR Re ee ee ee ee ee ee 3 1 4 Report Organization ai ee RA ee eee eee 4 2 BaekAorolhd ss EE EE RE EE EE RE EE EE EE Ee 5 2 1 Stack Processors ae eege ee ee 5 2 1 1 Ee 5 2 1 2 Why Use a Stack Processor AEN 7 2 2 he dl Proessor oieee de ER EE ed oder eens de ee ee ee 10 2 3 Wishbone BIS Hare 12 2 3 1 Wishbone Signals sr 12 vii 2 3 2 Wishbone Operation EEN 14 2 3 2 1 Single Read Cycle ca ad nin di ER iv EEN EN Ee Re Ede EN Ee ee see 14 2 3 2 2 Single Write Cycle ie vuni Ee die de de de ee ed cede 15 22 DE FO en n an 17 EE E 18 9 buy el e EE 21 AN Eiere ee EE 21 3 1 1 Development Basis and Organization sees ee 21 312 Choice Of TOO un SPRES 23 9 2 Besian PrOESSS EE EE EEE 23 3 2 1 Implementation of the J1 Processor esse ee ee ee ee ee 24 3 2 2 Design of the Stack Processor ii EES EE Eg Ek Ge Eg eg ie 25 3 2 3 Instruction Set Description EE 26 3 2 4 Initial Architecture Design 27 3 2 5 The Wishbone BUS een 30 3 2 6 System RL E 31 3 3 Lesing Ve 33 3 4 Synthesis ee 35 3 5 Place and Route le 38 FFU NNN 41 AA lee e EE 41 e NIE ee 42 4 3 Area Distribution and Layout 47 5 Discussion amp Future EE ER aia aera aa 49 Bal Power Analysis ss dm dan e 4
6. Stage 1 Stage 1 d Ka Stage 2 Stage 2 Figure 5 3 Pipeline Modification 52 6 References 1 2 3 4 5 6 7 8 9 Dave Evans The Internet of Things How the Next Evolution of the Internet Is Changing Everythin CISCO IBSG Apr 2011 Online http www cisco com web about ac79 docs innov loT IBSG 0411 FINAL pdf Jim Drew Energy Harvesting Produces Power from Local Enviroment Eliminating Batteries in Wireless Sensors Linear Technology Design Notes 2010 Online http cds linear com docs en design note DN483 pdf Charles Eric LaForest Second Generation Stack Computer Architecture University of Waterloo Waterloo 2007 Online Available http www eecg utoronto ca laforest Second Generation Stack Computer Architecture pdf Jr Philip J Koopman Stack Computers the new wave Pittsburgh USA Ellis Horwood 1989 Dr Klaus Samelson Dr Friedrich Ludwig Bauer Verfahren zur automatischen Verarbeitung von kodierten Daten und Rechenmaschine zur Aus bung des Verfahrens Munich 1957 Online http worldwide espacenet com publicationDetails originalDocume nt CC DE amp NR 1094019 amp KC amp FT E James Bowman 2009 The Ji Forth CPU Online http www excamera com sphinx fpga j1 htm James Bowman 2010 www excamera com Online http www excamera com files j1 pdf OpenCores 2010 www opencores org Online http cdn opencores org downloads wbspec b4
7. 0 opcode2 3 opcode3 129 rs push set asm rs push rs push set decoder opcodel 0 opcode2 3 opcode3 327 swap set asm swap 71 swap set decoder opcode1 0 opcode2 3 opcode3 384 over set asm over over set decoder opcodel 0 opcode2 3 opcode3 385 drop set_asm drop drop set decoder opcodel 0 opcode2 3 opcode3 259 mem wr set asm mem wr mem wr set decoder opcodel 0 opcode2 3 opcode3 291 add set_asm add add set decoder opcodel1 0 opcode2 3 opcode3 515 and set_asm and and aert decoder opcode1 0 opcode2 3 opcode3 771 or set_asm or or set decoder opcodel 0 opcode2 3 opcode3 1027 xor set asm xor xor set decoder opcodel 0 opcode2 3 opcode3 1283 invert set asm invert invert set decoder opcodel 0 opcode2 3 opcode3 1539 n eg t set asm n eq t n eg t set decoder opcodel 0 opcode2 3 opcode3 1795 set asm n 1E t set decoder opcodel 0 opcode2 3 opcode3 2051 n sr t set asm n sr t n sr t set decoder opcodel 0 opcode2 3 opcode3 2307 subl set asm subl subl set decoder opcodel 0 opcode2 3 opcode3 2563 rs pop set asm rs pop rs pop set decoder opcodel 0 opcode2 3 opcode3 2957 rs cp set asm rs cp rs cp set decoder opcodel 0 opcode2 3 opcode3 2945 mem rd set asm mem rd mem rd set decoder opcodel 0 opcode2 3 opcode3 3073 n sl t set
8. 1 err lyrty 4 dat 0 uara cou we i cpu cs io access inclu Outp output output output output output output output WB mod Inpu input input input input input input input input Module input WB Mod input input reg WI reg WI reg de st Wes INB ADI INB DAT MS MS WB SEL O MS WB DAT O MS ule ES lt pc params V B 0 B 0 B 0 B 0 WB DAT I MS B 0 WB DAT O MSI WB ADR O MSI ule B ADR O MSB 0 B DAT O MSB 0 reg WB EL O MSB 0 reg reg reg WB DAT O MSB 0 B 0 B 0 WB Signals adr o dat o ve Oo sel o stb o cyc Oj dat i cpu tst 17 cik i dat i ack i err 17 rty i dat o cpu adr o cpu we i cpu CS io access adr oi dat oi ve Oo sel o stb_o cyc Oj dat i cpu Address Out Data Out Write Read Enable Select Out Strobe Signal Cycle Signal Data send to CPU received from WB Signals Reset HOLGER Data In Acknowledgement Out BError Signal Retry Signal Data from CPU to WB Module Address from CPU to WB Enable signal from CPU to Chip Select 10 Access Signal 66 wire 10 access assign io access io access stb oj Every IO Access we update the data out the address and write enable always begin if io access 1 begin adr o l
9. Data In Return Stack Data Out Return Stack Enable Figure 5 2 Stack Merging 5 3 Wishbone Bus Extension Even though successful single read and write 16 bit operations are possible with the actual implementation the Wishbone Bus could be extended further to be capable to do advance pipeline communication and burst communication The actual Wishbone bus implementation only uses the required signals for simple communication no signals providing information of the data transferred are used this signals could also be implemented in future work The design had specific problems when a back to back VO access was done in which the first access tried to read and the second to write to the same location from an external VO Module Possible solutions for this corner case could be obtained by modifying the stall module in cpu v or the signals from the Wishbone Bus modules 5 4 Pipeline Optimization The final design has two pipeline stages the first one consists exclusively of the fetch stage and the second one consists of the decoding execution and 51 write back stages of the common pipeline One possible change that could improve the behavior of the stack processor system is to move a small part of the decoding process to the first stage of the pipeline As shown in Figure 5 3 adding a small portion of the decoding to the first pipelining fetch could result in a faster processor and prevent errors from corner cases
10. Design Compiler from Synopsis 16 e Place and Route The design flow from Atmel was again used along with Encounter from Cadence 17 3 2 Design Process The design process was divided into four tasks a Implement an existing processor A stack processor with characteristics similar to the ones needed for the assignment was chosen and implemented The implementation needed to be simulated to view and verify the behavior of the processor Correct operation of the 23 processor was needed to continue to the next step of the implementation b Design of the stack processor Using the initial processor as a reference a new design was designed and implemented The new design was compared to the initial one to assure proper behavior c Design of Wishbone bus modules Once the processor was implemented the communication channel needed to be established The Master and Slave modules of the Wishbone bus were implemented d Integration of the System The final step took all the elements and integrated them into one complete system As previously mentioned the design process was done in parallel with the testing process to ensure a correct design throughout The implementation of the final stack processor followed the methodology mentioned in the previous sections Starting from the implementation of the J1 processor followed by the implementation of every element and finally covering the integration of the final system 3 2 1 Implem
11. as basis or start a new design from scratch Due to time constrains the decision was made to start the project using an existing processor as a foundation The implementation was divided into four different sections bearing in mind that implementation was not a linear process and multiple iterations and recursions were needed to complete corrections to the design The four sections were 1 Design Process having a proper design was the basis for the implementation The design used an existing processor as a base and reference The goal of this step is to obtain a functional RTL design 2 Testing Process the testing process was done throughout the complete implementation process This section explains the evolution of the testing techniques and scripts used to simplify testing This section describes the simulation part of the assignment 3 Synthesis Process the synthesis process is explained together with the scripts used The alternative possibilities available when doing the design are also shown 21 4 Place and Route Process this was the last step of the implementation This section explains the steps taken to obtain the final resulting architecture It is important to mention that every step and element of the implementation was the result of several iterations of a process shown in Figure 3 1 Simulation test benching and synthesis were used for testing every element of the design The processor instructions used
12. asm n sl t n sl t set decoder opcodel 0 opcode2 3 opcode3 3331 stk dep set asm stk dep stk dep set decoder opcodel 0 opcode2 3 opcode3 3584 n ult t set asm n ult t 72 n ult t set decoder opcodel 0 opcode2 3 opcode3 3843 hy by 7 3 Instruction Set Table For a complete description of each instruction and instruction type please refer to the digital Appendix Name Literal Operation Jump Conditional Jump Call Addition Logical And Logical Or Logical Xor Drop Duplicate Exit Invert Memory Read Memory Write Equal comparator No operation Assembler lit jmp cond_jmp call add and or xor drop dup exit invert mem_rd mem_wr n eg t nop Instruction Type Literal Jump Conditional Jump Call ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Description Load 15 bit value to top of stack Jump to given PC value Jumps to given PC value if Top of stack equal zero Jumps to given PC value and old PC value is saved in return stack 16 bit addition of the 2 top values of data stack 16 bit and operation of the 2 top values of data stack 16 bit or operation of the 2 top values of data stack 16 bit xor operation of the 2 top values of data stack Drops top value of the data stack Duplicate top value of the data stack Jumps to PC taken from the top of the return stack Bitwise logic invert of the top of stack Read to e
13. because no value has gone through the top of stack and therefore the first two locations on the data stack have value of 0 The first locations of the data stack that are filled with zeroes due to this behavior will still work properly and be used by the architecture This will not affect the proper execution of instructions or the behavior of the processor A hard coded solution could be made by using the drop instruction at the beginning of a program if starting from position 0 of the data stack is completely necessary 43 All ALU instructions had similar satisfactory behaviors so showing examples of each one of them is not shown Instead an example of a CALL instruction is shown next covering the proper capability of the processor to do address modifications Figure 4 3 shows the wave form of the simulation of the CALL instruction test The instructions of the test are lit 4 Ox8004 call 5 0x4005 halt 0x6010 lit 6 0x8006 lit 7 0x8007 lit 8 0x8008 lit 9 0x8009 exit 0x700C lit 10 0x800A halt 15 0x6010 Signals Time clk 0 reset 0 start_signal 12 0 0001 pcl12 0 0006 instr 15 0 8009 _st0 15 0 0009 stol15 0 0008 st1 15 0 0004 dsp 4 0 02 alu a 15 0 0008 alu b 15 0 0000 alu c 15 0 0008 data stack 0 15 0 0000 data stack 1 15 0 0000 data stack 2 15 0 0004 data stack 3 15 0 zzzz data stack 4 15 0 data stack 5 15 0 zzzz rsp 15 0 0001 return stack 0 15 0 z
14. opcode2 2 Sukl3 13 ac format Type Call Sopcodel 1 Sopcode2 2 uk13 13 ac format Type Alu opcodel l opcode2 2 Sopcode3 13 Gend inst format beginlinst list ac instrsType Lit gt lit ac instr lt Type Jmp gt jmp ac instr lt Type Cond Jmp gt cond jmp ac instr lt Type Call gt call 70 ac instrsType Alu gt nop halt nip exit dup rs push swap over drop mem wr add and or xor i nvert n eg t n lt t n sr t subl rs pop rs cp mem rd n sl t stk dep n ult t Gend inst list gas MIPS specific register names ac asm map reg r 0 14 0 14 begin ac_isa_sreg ac asm map sreg new sto 8 sto 9 sti 10 rstO 0 end ac_isa_sreg ISA_CTOR stkpc beginlac set asm lit set asm lit exp uk15 lit set decoder opcodel 1 jmp set asm jmp exp ukl3 jmp set decoder opcodel 0 opcode2 0 cond jmp set asm cond jmp exp ukl3 cond jmp set decoder opcodel 0 opcode2 1 call set asm call exp ukl3 call set decoder opcodel 0 opcode2 2 nop set_asm nop nop set decoder opcode1 0 opcode2 3 opcode3 0 halt set asm halt halt set decoder opcodel 0 opcode2 3 opcode3 16 nip set asm nip nip set decoder opcodel 0 opcode2 3 opcode3 3 exit set_asm exit exit set decoder opcodel 0 opcode2 3 opcode3 4108 dup set_asm dup dup set decoder opcode1
15. the assembler language used by the design represented by section B on Figure 2 10 The design flow will also be in charge of installing the compilation tools for Verilog and Arch C if needed finally the design flow will do the simulation of the previously generated Verilog file The design flow is not the main focus of this assignment and is still a work in progress Therefore changes were applied continuously during the development of the assignment The assignment focused on section B of the design flow creating the RTL Files and the Arch C files 2 4 1 Arch C A brief introduction taken from the Arch C User Manual 9 will be given to provide an overall understanding of the process taking place when generating the stack processor assembler language The solution to simplify the development and testing of a design in recent years has been the use of Architecture Description Languages ADL Due to the increasing complexity of modern designs and time to market restraints designers are moving from hardware description languages to system level designs where automatic generation of a software toolkit Composed by assemblers linkers compiler and simulators is mandatory 10 11 9 Arch C is a language that follows the System C syntax style and is capable of describing a processor s architecture and a memory hierarchy The goal of Arch C is to provide information at the right abstraction level to provide designers with the to
16. toggling activity of all the signals in the system Generating clock trees and clock gating configurations will have an important role in power analysis This opens many possibilities for 49 testing and comparison For these reasons it is considered that this could be part of another assignment and was not covered by this assignment After PnR Back annotated Simulation Timing Analysis Power Analysis Flow Netlist Netlist l Netlist l SPEF File SPEF SDF File SPEF SDF File Timing Constrains Timing Constrains Stimuli SAIF Activity Figure 5 1 Post Place and Route Flow 5 2 Stack Merging A possibility was discussed at the end of the assignment to merge the Data Stack and the Return Stack This possibility would enable a more compact architecture but it would also have some new challenges e Arbiter A module in charge of arbitration should be implemented to avoid cases in which multiple stack accesses are made e Pointers The pointers for both stacks would need to be monitored or modified to not use illegal stack locations e Delay This modification would add a possible delay in cases that consecutive access to the stack is needed As shown in Figure 5 2 the arbiter would need to determine which access is done to the stack and generate a stall if needed 50 Data Stack Pointer gt Data Stack Data In gt Data Stack Data Out lt Data Stack Enable gt Return Stack Pointer Return Stack
17. used for the stack processor was a simple dual port RAM The code can be also found in the Appendix 7 1 This first architecture design behavior was tested using the working J1 processor as a reference The initial architecture design was able to correctly execute all the instructions listed on Appendix 7 3 The behavior was tested using a simulation and the respective waveforms more details on Section 3 3 29 3 2 5 The Wishbone Bus The implementation of the Wishbone bus needed two parts The master interface and the slave interface The goal for the communication bus implemented in this assignment was to perform standard single reads and standard single writes Also consider that the processor was only able to send or read data 16 bits in length and the design works under the assumption that the user addresses valid memory locations and valid data The design was to be as compact as possible The tag signals used to provide extra information were not required for this assignment therefore were omitted All other signals were added to the implementation for a potential change in future implementations The complete codes named vvb m 16 v and vb s 16 v can be found in the Appendix 7 1 Wishbone Wishbone CPU Master Interface Slave Interface Data In CPU Data In Data Out Address In CPU EA Kokiin Ack Out Write Enable Error in Error Out Data Out CPU Retry In Retry Out Data Out Data In SN Addr
18. 4000 dsp 4 0 02 00 03 clk_1 1 adr ol15 0 4000 dat i 15 0 0000 ve 0 1 stb 0 0 ack 1 0 cyc 0 0 clk i 1 adr i 15 0 4000 dat 0 15 0 0000 dat 1 15 0 0004 we 1 1 stb 1 0 ack 0 0 cyc 1 0 Figure 4 4 VVishbone Bus Communication Example 46 4 3 Area Distribution and Layout Due to confidentiality reasons with Atmel only the area distribution of the final design is shared as seen in Figure 4 5 The area used for the CPU logic is minimal in comparison to the rest of the system An image of the final place and route can be seen in Figure 4 6 Area Distribution Bed CPU 5 Peripheral 5 Return Stack 5 Figure 4 5 Area Distribution It is important to remember that the present assignment needed to create a solid base for future projects and to enable the next user to continue work as easily as possible It should be considered that not only was an architecture implementation made but also the test flow the custom assembler and the documentation needed to follow and repeat the process from step one all the way to Place and Route were created 47 Figure 4 6 Place and Route 48 5 Discussion amp Future Work This section discusses the results of the assignment as well as some topics for future work and optimization The main tasks of the assignment were completed successfully Due to time limitations the power analysis was not covered in this assignment Working on this a
19. 8004 lit 1 0x8001 lit 10 08x00A add 0x6203 halt 0x6010 Basically the values 4 1 and 10 are pushed to stack and then the last two values are added It is important to remember that whenever a value is pushed to the stack it must to pass through the Top of Stack st0 and Next After Top of Stack st1 because only after going to these two registers will the value be stored on the stack As soon as the add instruction is read 0x6203 when the 42 program counter pc has a value of 3 the new top of stack is calculated sto and in the next clock cycle the top of stack 510 value is updated correctly It is important to notice the data stack s first two positions data_stack_O and data stack 1 are filled with the value 0 This is due to the fact that whenever an instruction that will access the stack is executed the data stack pointer dsp is increased and the value of next of stack st7 is stored in the data stack location pointed by the data stack pointer Signals Waves I D Time clk 1 reset 0 start_signal 12 0 0001 pc 12 0 0003 instr 15 0 6203 _st0 15 0 000B stof 15 0 000A st1 15 0 0001 dsp 4 0 03 alu a 15 0 0001 alu b 15 0 000A alu c 15 0 000B data stack ol data stack 11 data stack 21 data stack 3 15 0 0000 15 0 0000 15 0 0004 15 0 0001 data stack 4 15 0 zzzz Figure 4 2 Add Simulation During the initialization the value of st1 is 0
20. 9 H lack Een DEE 50 5 3 Wishbone Bus Extension usa 51 54 Pipeline Optimizallon je ad dat da ea dd d i ns dh deka car 51 6 FAGENE R R 53 FAPPENA D a aa e Er a E E EEEE Aaaa 55 7 1 Piel ATE Gode ae ED ER RD eg eege tee ED ged 55 KOR U PEE E 55 fle Dual Porn RaM gjit des ed 63 7 13 Data SBK gara eh di 64 7 1 4 EE e 65 7 1 5 Wishbone Master Module AEN 66 7 1 6 Wishbone Slave Module ENEE 67 TENG PIGS sa ER ANENE N N RE OG EE 70 Ke 70 7 22 SEN 70 7 3 Instruction Set Table Table of Figures Figure 1 1 The Internet of Thing Evolution Il 1 Figure 2 1 LIFO PUSH and POP Operations EEN 6 Figure 2 2 Generic Stack Processor Architecture l 7 Figure 2 3 Add Operation on Stack PDrocessor sesse ee ee ee RR ee ee 8 Figure 2 4 Return Stack Example cati its h edit h k Eda ee kd 9 Figure 2 5 J1 Architecture Diagram TI 11 Figure 2 6 J1 ALU Instruction Decoding Il 11 Figure 2 7 Master and Slave Wishbone s Interface 8 anen nese eve venin 12 Figure 2 8 Single Read Cycle 8 see nanen nenen nenen ene nene eseve eee nenen enen e vene m ete 15 Figure 2 9 Single Write Cycle 8 see RR eee eee eee even eee eee eee eee 16 IS 17 Figur ST PASION E TOC SS aaa iri k cl hacen RE GR cious 22 Figure 3 2 Instruction Decoding un 26 Figure 3 3 Initial Architecture Diagram au uuuaaa aaa ee e eee enen eee eee eee 27 Figure 3 4 Wishbone Bus Connection Diagram iese ee ee n nenen eneve eee 30 Figure 3 5 Stack
21. B Data In A Data Memory Data Out B Out B Data Out A Figure 3 8 Different RAM Connections 37 3 5 Place and Route Process To perform the last step in the implementation the tool Encounter from Cadence was used As with previous steps a design flow and script were used to automate and simplify the process Netlist LIB File Clock Tree Script Definition Timing Constraints Encountel LEF file sdc Netlist SPEF File DEF File Figure 3 9 Place and Route Flow The place and route flovv used at Atmel is explained in the steps belovv using Figure 3 9 as a reference 1 Setup The LIB file and LEF file are in charge of the setup The LIB file provides the timing information of the cells The Library Exchange Format LEF file contains the physical vievv pin layout metal layers and abstract information of the cells 2 Read Netlist The netlist generated by the synthesis process is used as the input 38 3 Floor Planning The first distribution of the chip is made and the die size and core area are determined Other blocks like RAM or VO Buffers are also placed in this step 4 Power Supply Definition Depending on the configuration the characteristics of the power supply are determined For example the decision between using rings or stripes for the power supply is made in this step 5 Timing Constraint Reading The SDC file generated by the synthesis is used to determine timing limitations and
22. DAT I e Master negates STB_O and CYC_O to indicate end of cycle e Slave negates ACK lin response to negated STB O ag J Sf _f apo XXXXXva AAA pat XXXXXXKXKXAAAVALID KKK pao AAA WE_O seo XXXKXva XXX STB_O ACK CYC_O Figure 2 8 Single Read Cycle 8 2 3 2 2 Single Write Cycle The following explains how the write cycle works on a Wishbone interface The explanation is once again separated by clock cycles for practical purposes Figure 2 9 represents this bus transaction as well Clock Edge 0 e Master presents valid address on ADR_O e Master presents valid data on DAT_O e Master asserts WE_O to indicate write cycle e Master asserts CYC_O to indicate the start of cycle e Master asserts STB_O to indicate the start of phase Clock Edge 1 e Slave prepares to latch the data on DAT OG e Slave asserts ACK lin response to STB_O to indicate latched data e Master monitors ACK I and prepares to terminate the cycle Clock Edge 2 e Slave latches data on DAT_O e Master negates STB_O and CYC_O to indicate end of cycle e Slave negates ACK lin response to negated STB_O ag N VY moro XXX KD pats AAA patio KLOWE KA WE_O seo XXX KA STB O ACK GG O Figure 2 9 Single Write Cycle 8 Both of the previous descriptions assume the slave needs no waiting time to respond to the master The actual project implementation has a stall that allows the slave to have a waiting time more
23. H 12 PUSH24 ADD Figure 2 3 Add Operation on Stack Processor What benefits can we see from the previous example The instructions PUSH and POP only have one argument Basic arithmetic and logical operations vvill always use the values on the top of the stack Therefore the operation instructions do not require an argument In a typical RISC processor a similar instruction could be done in one instruction but it vvould need at least three arguments Stack processors have the capability to call subroutines and exit them using the Return Stack which follows the same principle as any stack The only 8 difference is that it stores the subroutine return addresses instead of operands Figure 2 4 shows an example in which the return stack is used Assume the program counter PC increments by one after executing each operation When the instruction CALL is executed the address of the next instruction is stored in the Return Stack and the PC is updated to the subroutine address Once the execution in the subroutine is finished using the EXIT instruction the subroutine return address is taken from the Return Stack and is used to update the PC Finally the HALT instruction stops the program Instruction Address Return Stack PUSH 0x02 0x0001 0x0004 PUSH 0x05 0x0002 CALL OXOE 0x0003 HALT 0x0004 ADD OXOOOE EXIT OXOOOF Figure 2 4 Return Stack Example The previous example shows that with a simple instruction set sta
24. NTNU Trondheim Norwegian University of Science and Technology Ultra low Power Stack based Processor for Energy Harvesting Systems Allan Green Embedded Computing Systems Submission date June 2014 Supervisor Snorre Aunet IET Norwegian University of Science and Technology Department of Electronics and Telecommunications Problem Description Ultra low Power Stack Based Processor for Energy Harvesting Systems One of the most recent trends in electronics is the Internet of Things The transition to these systems is happening The base for these systems will be multiple low energy consuming nodes able to connect different devices between them A promising option to replacing the battery on systems is to use an energy harvesting system Energy harvesting systems are battery less systems powered by energy sources in the environment such as heat gradients light or vibration Due to the limited energy available these systems often need a small programmable subsystem for basic control tasks as well as processing and interpreting sensor data This subsystem should be as small as possible to accomplish the required task while consuming as little as possible of the available energy The assignment goal is to implement an ultra low power stack based CPU to be used in an energy harvesting system The CPU should be integrated in a complete subsystem with a RAM and low power peripherals Basic example code typical for the applicatio
25. Processor HiSrarChy uu iese sesse esse ee ee e ee eee enen ete 32 Figure 3 6 Testing Flow Diagram EE 34 Figure 3 7 Synthesis EEN 36 Xi Figure 3 8 Figure 3 9 Figure 4 1 Figure 4 2 Figure 4 3 Figure 4 4 Figure 4 5 Figure 4 6 Figure 5 1 Figure 5 2 Figure 5 3 Different RAM Connections cccceccceceeeececceceeeececceeececceeeaeeceeeaeeeeeeaes 37 Place and Route FOM se id ke ed Ee Ne ta Se das ete Eg ee Ge eed 38 Final Design Architecture un 41 AOAd SNE Orson 43 GE 000 EE 44 Wishbone Bus Communication Example rrrrnnnnnnnonnnnnnrrrnnnnnnnnnrrnnnnnnnn 46 Area DSD ea 47 Placeand ROUTO HEES de GE Re hae 48 Post Place and Route FOM aS 50 Stack ING QING E 51 Pipeline Modification eerrnnnnnnnnnrrrnnnnnnnnnonnnnnnnrnnnnnnnnnnnrnnnnnnrrnnnnnnnenennn 52 xii 1 Introduction The electronic revolution is a reality Every day more gadgets and appliances are given the capability to interconnect and communicate using the internet To better understand the work done in this assignment insight is needed into the actual trends and problems internet connected devices face 1 1 Internet of Things The internet of things is a fairly new concept yet it has become rapidly popular in the last years Even though an official definition for this term does not exist for this assignment it is defined as the attempt to eguip all gadgets objects and appliances in the world with a way to con
26. SB 0 reg Ch Inputs WB Signals input rat i Reset input clk i Clock input WB DAT O MSB 0 dat 17 Data in input WB ADR O MSB 0 adr i Address in input cyc i Cycle input to notifies a cycle is in progress if high input we i Write enable input stb i Strobe in input WB SEL O MSB 0 sel i7 S8elect in input CS Chip select reg WB DAT O MSB 0 reg file 0 GPIO REG FILE MSB Register file to store data reg WB DAT O MSB 0 dat_o reg rty o reg err oi reg WB DELAY MSB 0 temp count Temporal to count cycles reg ready Signal to notify that slave finish its processing and can transmit back reg r ack o wire ack 0 Used to monitor register file in simulation assign reg f0 reg file 0 assign reg fl reg file 1 assign reg f2 reg file 2 assign reg f3 reg file 3 assign reg f4 reg file 4 assign reg f5 reg file 5 assign ack o r ack oj assign ack o ack o amp stb i 68 wire WB DELAY MSB 0 weet Worst case execution time max cycles of delay needed for slave operation assign wcet d0 Delay generator always posedge clk i or posedge rst_i begin if rst_i 1 bl begin ready lt 0 temp count lt 0 end else begin if stb i 1 begin temp count lt temp_count tl if temp count wcet begin ready lt 1 temp count lt 0 end end else begin temp count lt 0 ready lt 0 end end end
27. SEL ALU B DSP begin Data stack pointer alu b lt dsp end default alu b lt alu b endcase end end MOA Boe DEP IT TTT TTT AU always begin if start signal 0 begin We wait for start signal _dsp lt 0 end 58 else begin case dsp mux sel STKPC SEL DSP begin _dsp lt dsp end STKPC SEL DSP INC 1 begin dsp lt dsp 1 end STKPC SEL DSP INC 2 begin dsp lt dsp 2 end STKPC SEL DSP DEC 1 begin _dsp lt dsp 1 end default _dsp lt dsp endcase end end JMUK for ROP dd UD TT IE ELECT TI ETT TELL always begin if start signal 0 begin We wait for start signal _rsp lt 0 end else begin case rsp mux sel STKPC SEL RSP begin rsp lt rsp end STKPC SEL RSP INC 1 begin rsp lt rsp 1 end STKPC SEL RSP INC 2 begin rsp lt rsp 2 end STKPC SEL RSP DEC 1 begin rsp lt rsp 1 end default rsp lt rsp endcase end end MUX for Return Stack 1 1 0 7 1 111 7 always begin if start signal 0 begin We wait for start signal _rstkD lt 0 end else begin case rstack mux sel STKPC SEL RSTACK PC begin _rstkD lt pc 59 end STKPC SEL RSTACK STO begin _rstkD lt st0 end STKPC SEL RSTACK PC PLUS 1 begin _rstkD lt pc plus 1 STKPC PC MSB 0 1 bO end default _rstkD lt pc endcase
28. a latched lt addr a end if cs b begin addr b latched lt addr b end end 63 Update of data out of port a always addr a latched begin gt dbg addr a data_out_a lt memoladdr a latched ADDR MSB 0 end Update of data out of port a always addr b latched begin data out b lt memoladdr b latched ADDR MSB 0 end Write always posedge clock begin if cs a amp amp we al begin display Memory write A mem addr al lt data in a end if cs b amp amp ve b begin S display Memory write B memOladdr bl lt data in b end end endmodule ram2ports 7 1 3 Data Stack module data stack Outputs data out data stack O data stack 1 data stack 2 data stack 3 data st ack 4 data stack 5 data stack 6 data stack 7 data stack 8 Inputs data in addr rd addr wr rd en wr en clk VE include stkpc params v Outputs output STKPC WORD MSB 0 data out output STKPC DSTACK MSB 0 data stack 0 Signals to monitor content of data stack 7 more registers are used but omitted of the printed version for practical purposes Inputs input STKPC WORD MSB 0 data in input STKPC DSTACK POINTER MSB 0 addr rd input STKPC DSTACK POINTER MSB 0 addr wr input rd en input wr en input clk assign data_stack 0 dstack 0 reg ISTKPC WORD MSB 0 data_out 64 reg ISTKPC DSTACK MSB 0 dsta
29. ays updated 2 1 2 Why Use a Stack Processor A stack processor design was chosen for the assignment due to certain benefits it offers Before explaining the benefits or characteristic in depth an overview of a simple stack processor architecture is needed Figure 2 2 is an example of generic stack processor architecture from 4 DS DATA STACK RS RETURN STACK ADDRESS ap PROGRAM gei MEMORY Figure 2 2 Generic Stack Processor Architecture 4 The previous figure shows some of the key elements of a stack processor e Data stack DS memory in charge of managing all the operands for the arithmetic and logic operations e Return stack RS memory in charge of storing subroutine return addresses e Top of stack TOS Last element pushed into the data stack The other elements are common to most processors the program memory program counter PC memory address register MA arithmetic logic unit ALU and data bus Stacks enable benefits in two areas Basic operand operations and subroutine calling An example can explain basic operand operations A simple addition is shown in Figure 2 3 The first step pushes the values that need to be added the values 12 and 24 Once the values are stored in the data stack the operand is sent The processor takes the two values on the top of the data stack adds them and places the result on the top of the data stack Top Top V 124 24 TV 12 112 112 36 PUS
30. ble solution for the previously mentioned problem Energy harvesting uses ambient energy sources which are free most of the time some examples are light heat differentials vibrating beams or transmitted RF signals As promising as it may sound energy harvesting devices generate only small amounts of energy and they need a system to control their operation To get a better insight some examples are shown 2 e Small solar panels can produce 100s of mW cm in direct sunlight and 100s of uW cm e Piezoelectric devices using compression or deflection can produce 100s of uW cm depending on size and construction e RF energy harvesting collecting antennas can produce 100s of pW cm e Seebeck devices using temperature gradients can generate 10s of uW cm working with body temperatures or 10s of mW cm working with a furnace exhaust stack temperatures Therefore to offer a working solution the CPU controlling the energy harvesting needs to work with ultra low power levels Otherwise all of the energy generated by the system would be used by the CPU controlling the system The solution proposed is to use a stack processor to achieve an efficient and ultra low power consumption system Stack processors are not new They were developed in 1950 and are still being used mainly due to their simplicity Yet using them to provide a solution to energy harvesting systems is something that has so far not received any deep research Previous stack
31. cessor used as a reference the J1 Processor is discussed An overview of the Wishbone communication protocol and the design flow used at Atmel are also covered 2 1 Stack Processors Stacks have been used for more than over 50 years in the computer environment The first proposal for using a stack was made in 1946 in the computer design of Alan M Turing as a tool for calling and returning from subroutines Later a formal proposal and a patent was obtained in 1957 by Klaus Samelson and Friedrich L Bauer of Germany 8 4 5 It is important to understand that a stack simplified the ability to do recursion and loops Their popularity followed a path of ups and downs as history moved forward The introduction of Very Large Scale Integration VLSI and Complex Instruction Set Computers CISC caused processor design to drift away from the stack processor due to the long and comprehensive instructions However with the growth in popularity of Reduced Instruction Set Computers RISCs that proposed a simple instruction set to achieve higher performance stack processors became strong candidates for processor designs once again 2 1 1 What is a Stack A stack is one of the simplest ways of storing temporal information It is an area of computer memory with a fixed origin but variable size The data structure follows the concept of Last In First Out LIFO In a LIFO data structure the last data that comes into the stack top of stack is the f
32. ck processors are able to manage subroutines in a very efficient manner This opens the possibility to more complex code structures like loops conditional statements etc From the previous examples we can summarize some points about stack processors e Compact code Even though operand loads need to be done separately and the total number of instructions needed for a program is higher than a normal RISC program due to the reduced size of the instructions the total code size in bytes is less for a stack processor than for a register file processor e Simple instruction set the simplicity of the instructions allows a compiler to be built quickly making the simulation and testing process faster and simpler e Simple return stack Enables recursion and subroutine execution e Simple data stacks Replaces a complex cache system Remembering that everything has a downside the stack processors have also some disadvantages The stacks cannot be accessed randomly therefore planning ahead is needed to obtain efficient code Also the instruction set used by stack processors is not able to reference multiple registers like RISCs instruction sets 2 2 The J1 Processor The J1 Processor is a very simple 16 Bit stack based single cycle processor created by James Bowman The J1 is not a general purpose CPU it was originally intended for FPGAs and to run the six Ethernet cameras in the Willow Garage PR2 robot 6 as The J1 uses a very compres
33. ckl0 STKPC DSTACK SIZE Data update of data and return stack pushing the top of stack always posedge clk begin if wr_en begin dstackladdr wr data in end end always begin data_out lt dstack addr_rd end endmodule 7 1 4 Return Stack module return stack Outputs data out return stack O return stack l return stack 2 Inputs data in addr rd addr wr rd en wr en clk include stkpc params v Outputs output STKPC WORD MSB 0 data out output ISTKPC RSTACK MSB 0 return stack 0 use to monitor output ISTKPC RSTACK MSB OJ return stack 1 output ISTKPC RSTACK MSB OJ return stack 2 Inputs input STKPC WORD MSB 0 data in input STKPC WORD MSB 0 addr rd input STKPC WORD MSB 0 addr wr input rd en input wr en input clk assign return stack 0 rstack 0 assign return stack 1 rstack 1 assign return stack 2 rstack 2 reg STKPC WORD MSB 0 data out reg STKPC RSTACK MSB 0 rstack 0 STKPC_RSTACK SIZE Return update of data and return stack pushing the top of stack always posedge clk begin if wr_en begin rstackladdr wr data in end end always begin stack stack stack 65 data out lt rstack addr_rd end endmodule 7 1 5 Wishbone Master Module module Ji vb m 16 Outputs adr o dat o we 0 sel o stb o cyc o dat i cpu L Inputs vat dyelk 1 dat ipetk
34. detail of this will be discussed on Section 3 2 16 2 4 Design Flow Implementation simulation and testing of a design can become a very time consuming process Therefore a design flow is used to simplify the process The goal of the design flow is to use high level language scripts to build all the tools needed to simulate test and document the design The design flow was a work in progress that took place in parallel to the assignment implementation The ideal design is shown in Figure 2 10 Xweb description Xweb tangle Xweb weave i Pdf generator xslt A Python spec snippets E Python template file Verilog generator Archc generator Verilog snippets JI ArchC snippets il Verilog main files B Arche main files h A il Verilog Compiler EEN j Gnu assembler gas Gnu Linker d Verilog Simulator Figure 2 10 Ideal Design Flow The first step of the design flow represented by section A on Figure 2 10 is to use docbook documents with xweb termination These files will have the description of the architecture and instructions for the processor These xweb 17 files will be used to generate the documentation PDF and more importantly multiple snippets of Python code These small pieces of code then will be united to form the design template Ideally the design template will be the base to generate two sets of files the Verilog files for the RTL design and the Arch C files for implementing
35. e Figure 4 4 shows a case in which the communication between master and slaves has no delay The test writes the value 4 to the first address register of the I O module after this the value from the first address register is read to verify that the content is correct Remember the decoding used by the design uses the top 2 bits of the address to check which module is addressed that s why the value used is 49152 1100 0000 0000 in binary The test instructions are the next lit 4 0x8004 lit 49152 0xC000 mem_wr 0x6123 lit 49152 0xC000 mem_rd 0x6C01 halt 15 0x6010 Figure 4 4 signals can be divided in 3 categories from top to bottom signals from the CPU signals from the Wishbone Bus master module and signals from Wishbone Bus slave module every segment starts with the modules clock The simulation shows the same signal activity as seen in Figure 2 8 and Figure 2 9 when doing a signal write and read operation The strobe signal stb_o and 45 cycle sonal cyc o are set high at the start of an external access and will wait for the acknowledgement signal ack i After receiving the first acknowledgment the value 4 is wrote in the register reg f0 The value is then successfully read and wrote to top of stack st0 Signals Waves Time S 5 1100 ns 1200 ns clk 1 reset 0 start_signal 12 0 0001 0000 pc 12 0 0002 0000 0004 instr 15 0 6123 xx 8004 6COl _st0 15 0 4000 0000 0000 st0 15 0 4000 0000 st1L15 0 0004 XXXX
36. entation of the J1 Processor The J1 architecture is documented and the actual implementation was not time consuming This step of the assignment had four main contributions e Obtaining a more complete understanding of a stack processor architecture e Familiarization with the tools used in the assignment e Understanding of stack processor behavior e Building a processor to compare the new future design behavior to The J1 documentation includes a Verilog file which was used as a base to implement it reference The original J1 Verilog code was used to create a project that was able to be simulated within the design flow used in the project The next step was to test the implementation using the five types of instructions available for the J1 Literal jump conditional jump call and ALU operations 24 The complete description of the test benching and debugging process used in the assignment will be described in Section 3 3 3 2 2 Design of the Stack Processor Once a working implementation of the J1 was completed the J1 architecture was analyzed to identify any characteristics that could be used or removed in the new design The characteristics that were used as a guideline for the new design that derived from this process were e Two pipeline stages on a 16 bit processor The new processor would have the same instruction length and have two pipeline stages instead of the single cycle that the J1 had e Instruction Set The
37. esign flow helps the assignment to be 3 reusable and scalable These two characteristics make the compatibility with the design flow a priority Detailed theory and insight on both the Internet of Things and Energy Harvesting subjects are not within the scope of the assignment 1 4 Report Organization The organization of the report is divided into individual chapters that are briefly described for the reader s convenience e Chapter 1 Introduction gives an overview about the internet of things the motivation for this assignment and the assignment tasks and limitations e Chapter 2 Background describes the basic knowledge needed to fully understand the assignment report including stack processors the J1 processor Wishbone communication and the Atmel design flow e Chapter 3 Implementation covers the methodology tools and implementation starting from each individual element s point of view up to the complete unification of the system e Chapter 4 Results shows the final design and the simulations used to verify the correct behavior of the system e Chapter 5 Discussion amp Future Work covers final thoughts on the assignment as well as possible optimizations of the design 2 Background The goal of this section is to give a brief summary on the basic theory necessary for accomplishing the present assignment The first and foremost point covered is the stack processor as that is the base for the assignment Next the pro
38. ess assign pc plus 1 pc 17 assign io wr Lob assign io addr st0 assign io dout stl assign io access assign read data instr read data is updated O and Ram write enable as address as data IO stack T MUX SELECTSS I II NTL reg reg reg reg reg reg reg reg reg STKPC SEL PC MSB OJ next pc mux sel STKPC SEL ALU A MSB 0 alu a mux sel STKPC SEL ALU B MSB 0 alu b mux sel STKPC SEL DSP MSB 0 dsp mux sel STKPC SEL RSP MSB 0 rsp mux sel STKPC SEL RSTACK MSB 0 rstack mux sel STKPC SEL STO NEW MSB 0 stO mux sel ISTKPC SEL ALUOP MSB 0 alu op mux sel STKPC FLOW HALT MSB 0 halt mux sel fisteall Generator TG AT 56 always begin if reset 1 begin stall lt 0 end if ack i io access 0 begin stall lt 0 end else if stb i 1 io access 1 amp amp ack i 0 begin stall lt 1 end end If stall we stay in actual instruction always begin if stall 1 begin _instr_addr lt pc end else begin instr addr lt pc end end HIMUX for FE OO GOO always begin if start signal 0 begin We wait for start signal pe lt 0 end else begin case next pc mux sel STKPC SEL PC COND JMP begin Conditional Jump if st0 0 begin If top of stack 0 jump po lt instr STKPC ALU TO PC MSB 0 end else begin po
39. ess Out Address In Generator Strobe Out Strobe In Select Out Select In Write Enable Write Enable Cycle Out Cycle In System Clock System Reset Figure 3 4 Wishbone Bus Connection Diagram 30 Figure 3 4 shows a diagram of the connection between the Wishbone bus and the CPU The CPU needed to pass three values a valid address the write enable signal and valid data The write enable signal was used to determine if it was a write operation write enable 1 or a read operation write enable 0 The CPU only needed to pass valid data in the case of a write operation The Wishbone received the information from the CPU and output the required signals depending on the operation as shown in Figure 2 8 and Figure 2 9 The master requested or transferred the required information from to the slave Because both Wishbone interfaces had sequential logic a minimum of one clock cycle delay was added to wait for the operation to finish Therefore a stall was needed A stall generator was added to the CPU and works in the next way Whenever an instruction access an external module VO RAM etc a signal called io access will go high to notify the external access The notification of an external access set the stall signal to high Because the decoding of the instruction was done using combinatorial logic the stall was set to high before the update of the registers stopping the execution and preventing advancement t
40. for testing consisted of the instructions a basic stack processor needs The instruction list can be found in Appendix 7 3 Initial Constraints Instruction Testing Correct Behavior No o Yes Yes Yes Figure 3 1 Design Process 22 3 1 2 Choice of Tools Once the implementation steps were defined the tools to do them needed to be chosen Considering that the assignment requires a RTL design the first step was to choose a hardware description language For this assignment Verilog was used The tools can be cataloged depending on what step of the assignment they were used e Design Process Due to the simplicity of the processor a simple text editor could be used to write all of the Verilog code needed for the design and for this project GNU Emacs 12 and Notepad 13 were used For the simulation of the design the decision was made to work with Icarus Verilog 14 a free open source Verilog simulator and synthesis tool for Linux Icarus Verilog has all the capabilities needed to implement and test the designs done for this assignment Icarus Verilog relies on command line and has no graphic interface making it a very light weight tool e Testing Process To view and analyze the wave forms generated by Icarus Verilog GTKWave 15 was used GTKWave has made available a free wave viewer for Linux e Synthesis Process All the synthesis was done using the design flow of Atmel and
41. irst to be removed Once a memory area is defined for the stack a stack pointer is needed to point to the most recently referenced location on the stack The stack pointer is normally implemented in the form of a hardware register 4 One of the advantages of working with a stack is that only two operations can be used to modify the stack e Push Data is introduced to the location pointed by the stack pointer also known as the top of stack and the stack pointer is updated depending on the size of the data introduced to the stack e Pop Data at the current location pointed by the stack pointer is removed from the stack and the stack pointer is updated depending on the size of the data removed to the stack It is important that the stack pointer always references the top of stack and is always updated properly Failure to do so will lead to loss of information and faulty execution of the processor For this assignment the data introduced or removed from stack will have the same size as the memory space Therefore the update of the stack pointers will always consist of a unitary addition or subtraction PUSH POP Top Co 37 37 37 lt A Top 99 99 99 Figure 2 1 LIFO PUSH and POP Operations Figure 2 1 depicts an example of using the push and pop instructions In this instance the value 12 is first pushed onto the stack and then later popped off of the stack Notice how the stack pointer Top is alw
42. is allowed easier reading of the code and simplified the testing overall The resulting assembler instructions can be found in Appendix 7 3 Two types of files were needed to run the testing flow as shown in Figure 3 6 The RTL files which contained the design of the processor in Verilog code and the assembly file which contained the instructions that needed to be executed along with the expected result of the processor registers The assembly file was already written in the custom assembly implemented using Arch C Next the RTL files were compiled using Icarus Iverilog 14 creating an executable file The assembly file was compiled to get a Verilog memory file vmem Both the executable and the VMEM file were used to run the simulation 33 Figure 3 6 Testing Flow Diagram The simulation created two files a Log file and a Value Change Dumb file VCD The Log file contained the registers values at the end of the simulation This file was compared with the initial assembly file to determine if the instruction execution finished as expected If not an error was indicated The VCD was used for debugging It could be opened using the GTKwave and the waveforms could be analyzed to find any discrepancies or bugs Every instruction was tested individually with a specific test To make testing even faster a script was made to run every instruction test with a single command The resulting testing flow had the advantage of being automa
43. ishbone bus was selected for this assignment to provide communication between the stack processor and any external VO The following information was taken from the OpenCores Wishbone User Manual 8 SysCon RST I RST I ADR 00 ADR 10 DAT I0 FSK DAT 10 DAT 00 DAT 00 5 v WE O WEI E se ooj seo 2 STB_O STB I E CYC_O CYC_O TAGN 0 H User TAGN I TAGN I Defined TAGN O Figure 2 7 Master and Slave Wishbone s Interface 8 The Wishbone bus is a popular open source hardware computer bus which makes it great to work with due to the many examples and documentation that exists The aim of the Wishbone is to allow the connection between different components inside of a chip which suits the assignment perfectly The Wishbone is a parallel bus and can work with different bus widths including 8 16 32 and 64 bits and follows a master slave topology as shown in Figure 2 7 This project will use the 16 bit width due to the fact that the J1 is a 16 bit processor 2 3 1 Wishbone Signals The communication is done based on a clock and multiple signals The signal names are standardized and can be divided into three categories signals 12 common for both slave and master interface master interface signals and slave interface signals Signals can be categorize in 3 types Signals exclusive to the master signals exclusive to the slave and signals that are common for both the master and the slave Descri
44. lt pe plus 1 end end STKPC_SEL PC_INC begin PC increment po lt pe plus 1 end STKPC SEL PC INSTR begin JMP or CALL pc lt instr STKPC ALU TO PC MSB 0 end STKPC SEL PC RSTACK begin EXIT of a routine pc lt rst0 STKPC RSTACK PC MSB STKPC RSTACK PC LSB end default pe lt pc endcase end end MON for BLU AM VAN LEVY always begin 57 if start signal 0 begin We wait for start signal alu_a lt 0 end else begin case alu a mux sel STKPC SEL ALU A STO begin Top of data stack alu a lt st0 end STKPC SEL ALU A ST1 begin Next of data stack alu a lt stl end STKPC SEL ALU A RSTO begin Top of return stack alu a lt rst0 end STKPC SEL ALU A RSP begin Pointer to return stack for checking stack depth alu a lt rsp lt lt 8 Shift right to display rsp in high bits and dsp in low bits end default alu_a lt alu_a endcase end end d MUX for MU BLIP GO always begin if start signal 0 begin We wait for start signal alu_b lt 0 end else begin case alu b mux sel STKPC SEL ALU B ZERO begin Zeros alu b lt STKPC ZERO 16BITS end STKPC SEL ALU B ONES begin alu b lt STKPC ONES 16BITS Ones end STKPC SEL ALU B STO begin Next of data stack alu b lt st0 end STKPC
45. mp instruction does if the top of the stack is equal to 0 otherwise the Jump is not performed e Call Saves the program counter to the return stack and then modifies the program counter to point to a subroutine e ALU Covers all stack operations duplicate over swap etc and basic ALU operations addition logical or negation etc Opcode 1 Opcode 2 Opcode 3 AO 15 14 13 12 11 10 9 B 7 6 5 4 value NI Literal 15 14 13 12 11 10 9 8 15 14 13 12 11 10 9 6 7 6 5 4 target Conditional Jump 15 14 13 12 11 10 9 B 7 6 5 15 14 13 12 11 10 9 B 7 6 5 4 3 e instruction opcode IF Figure 3 2 Instruction Decoding 26 All instructions were divided into three opcodes as shown in Figure 3 2 The first opcode was to specifically identify literal instructions If opcode 1 had a value of 1 the remaining part of the instruction was taken as the value to be used by the literal instruction and opcode 2 and opcode 3 did not need to be decoded In the case opcode 1 was equal to 0 then opcode 2 was used to determine the instruction type Finally opcode 3 was only used in the case of an ALU instruction type Opcode 3 determined which specific ALU operation addition logical or etc or Forth instruction swap duplicate etc was to be used 3 2 4 Initial Architecture Design Once the design parameters instruction set description were finished the architectural design could take place The stack processor architecture is g
46. n should be written The stack processor will be used as a part of a bigger project therefore it should be designed and implemented to be compatible with previous and future work Compatibility will have a high priority in the project The architecture should be simulated and tested If time allows it the performance and power consumption of the system should be compared when implemented with a regular standard cell library as well as an ultra low voltage library Abstract The fast evolution of the Internet of Things suggests an unavoidable transition to this infrastructure in the near future and to achieve this multiple nodes need to interconnect and communicate efficiently All nodes will need a power source to operate Most of them will have very low power consumption reguirements Therefore a possible solution would be to have an energy harvesting system for the nodes The energy harvesting systems will need a CPU to control all operations and to manage the power consumption The goal of this assignment is to create a base processor capable of controlling the system using ultra low levels of power The proposed approach for the assignment is to use a stack processor Using the J1 processor as a reference a new architecture was designed The design process was done following the design flow tools used by Atmel and covered the simulation testing synthesis and place and route process The end result of the assignment was a functi
47. nect and communicate between them and the internet To give a bit of perspective refer to Figure 1 1 provided in a study by Cisco 1 The number of connected devices has already surpassed the world population and bear in mind that 30 years ago an internet connection was not commercially available to the general public World SS Gees SES S Population 6 3 Billion 6 8 Billion 7 2 Billion 7 6 Billion Connected 599 Milli 12 5 Billi 25 Billi 50 Billion Devices illion 5 Billion illion More connected Connected devices Devices 0 08 WER 1 84 3 47 6 58 Per Person people 2003 2010 2015 2020 Source Cisco IBSG April 2011 Figure 1 1 The Internet of Thing Evolution 1 1 2 Motivation Energy Harvesting The previous figure takes brings the next question What are the consequences and problems that emerge when there are so many connected devices in terms of energy All the connected devices need an energy source Some can have a wired connection and others can be outfitted with a battery Yet in many cases a wired connection is not possible and having a battery brings up the problem of maintenance Changing batteries in some devices can be extremely hard or impossible An example of this would be sensors used in the industry These normally do not need to be active at all time Usually a very short duty cycle is used and therefore small amounts of energy should be enough to keep them operational Energy harvesting is a possi
48. new processor should be able to execute the same five instruction types as the J1 Literal Jump Conditional Jump Call and ALU instructions However the instruction format would need to be modified due to the next point e Removal of flag bits from instruction decoding The J1 processor uses bits from the instruction code to determine certain behaviors for each instruction Figure 2 6 The new design would not use any of these flag bits in the instruction decoding All behavior would be determined by the opcode of each instruction This would give the possibility to have a greater number of possible instructions or to even modify the instruction size in the future e Redesign of instruction decoding logic The new design would be modified to simplify the addition of new instructions The new design instruction decoding logic would use multiplexers to gain a more organized architecture e Hierarchy rearrangement The new design would need to have a modular hierarchy to simplify the debugging process and integration into the data flow 25 3 2 3 Instruction Set Description Once the guidelines for the processor were set the next step was to explain the new instruction set structure There were five possible instruction types e Literal Pushes a value directly to the top of stack equivalent to PUSH Instruction e Jump Modify the program counter to a given value moving to specific part of the program e Conditional Jump Performs as Ju
49. nformation about the synthesis and was used to verify a proper synthesis process was carried out 35 Standard Cells Libraries lib db RTL Code VO RAM Timing Constraints Constraints File sdc Figure 3 7 Synthesis Flow The second part of this section focuses on the different possibilities in which the RAM could be connected depending on the target of the synthesis The original design had a single dual port RAM that was used for both Program and Data Memory as shown in Figure 3 8 For the ASIC implementation the dual port RAM had to be replaced by two single port RAMs one for the program memory and the other for the data memory The needed modifications where made to the RTL code to create the possibility of toggling between the original design and the ASIC design A third possibility was to implement an FPGA Design that used two dual port RAMs separating Data and Program memory but using the second port in every RAM for debugging This last possibility has not been implemented and will be discussed in Section 5 36 Original Design Address B Ad EEE Program Data Memory Data In Memory Data Out A Data Out B ASIC Design Address A Program lt _DataInA Memory Data Out A ___Address A Data Data In Memory FPGA Design Address A AddressB Program Data In A Memory Debugging DatalnB Address B Address A Debugging Dataln
50. ntil the start signal is set As long as the initial reset signal is set to 1 the start signal will be 28 set to 0 Only when the reset signal is set to 0 will the start signal be set to 1 and remain set to 1 as long as there is no reset 2 Instruction FETCH Using the PC as the instruction address the next instruction is read from the program memory in the RAM and passed to the processor At this point all possible values for the Muxes are ready 3 Pipeline Second Stage This is in charge of the instruction decoding execute and write back a Instruction Decoding The instruction is passed to the decoder This decodes the instruction and outputs the corresponding select signals to each Mux b Execute All of the Muxes output a valid value now The ALU uses these values to calculate the needed result c Write Back All updated values are ready to be passed to their respective registers A clocked update takes place and all registers that need to be modified are updated It is important to note which part of the behavior was sequential clocked and which was combinational not clocked The only sequential part of the process is the final update of the registers All of the rest of the logic is combinational and this allowed the architecture to be single cycled The updating of the registers depended on the instruction being executed A brief description of the behavior for every instruction can be found in Appendix 7 3 The RAM
51. o the next instruction The stall generation continuously monitored the Wishbone acknowledgement and strobe signal waiting for the end of the bus cycle indicating that the communication exchange was finished The Wishbone Bus can be used to access both the I O Module and the RAM Module the decoding to decide which module is address works by checking the 2 most significant bit of the address if any bit is a 1 the address is meant for the I O Module else is for the RAM the remaining 14 bits will be used to address the respective module The use of 2 bits allows future addition of different I O Modules 3 2 6 System Integration The last step of the implementation was connecting the stack processor with a peripheral to test the correct communication using the Wishbone bus Before 31 this point the design was not organized with a proper hierarchy therefore the next hierarchy was chosen The purpose of this restructuring was to simplify the testing debugging and integration to the design flow Figure 3 5 Stack Processor Hierarchy Figure 3 5 shows the hierarchy organization used for the assignment The goal of the hierarchy was to provide modularity and flexibility allowing the project the opportunity to use a different RAM module or connect to a different peripheral in the future using the same code The CPU had both the Data Stack and the Return Stack each in an individual module separated from the logic of the processor The
52. ols needed to explore and verify a new architecture automatically like assemblers simulators linkers and debuggers 9 An architecture description using Arch C is divided into two parts e Instruction Set Architecture AC ISA Includes the instruction formats size and names the information needed to decode instructions and their respective behavior e Architecture Resources AC_ARCH Contains information about storage devices pipeline structure and all the structure of the architecture For the present assignment both the AC_ISA and AC_ARCH files previously mentioned were created to generate an assembler language for the stack processor architecture The resulting assembler language was used and simplified the testing phase of the project For further details consult the Arch C User manual 9 19 3 Implementation The implementation of a stack processor system was the main task of this assignment This chapter provides information on the steps taken to reach the final design starting from the methodology through the implementation of each individual element and finally the integration of the final system 3 1 Methodology The next section will focus on explain the steps taken to implement the design as well as mentioning and justifying the tools used for the assignment 3 1 1 Development Basis and Organization The first step in the development of a stack processor was to decide whether to use an existing processor
53. onal stack processor system with the capability to communicate with I O modules using a Wishbone bus A custom assembler was created using Arch C to simplify the testing of the architecture The design was simulated synthesized and routed using specific libraries from Atmel The assignment completed a working design flow that will allow the realization of a proper power analysis in the next phase of development The stack processor architecture shows high potential for ultra low power operations Further time and power analysis is needed to have a complete comparison with other processors Preface This thesis is submitted to the Norwegian University of Science and Technology in cooperation with Atmel Corporation as a reguirement for the fulfillment of the European Master in Embedded Computer Systems EMECS degree This work has been performed at the Atmel office in Trondheim Norway under the supervision of Ronan Barzic and in association with the Department of Electronics and Telecommunications at NTNU with Prof Snorre Aunet as the university supervisor Acknowledgements would like to extend my gratitude to my supervisors Ronan Barzic and Prof Snore Aunet for their guidance and support through the entire process to all of the staff instructors and my friends in the Erasmus Mundus Embedded Computing Systems program and finally to my family and girlfriend for all their love and unconditional support Vi Table of Contents
54. pdf The ArchC Team The ArchC Architecture Description Language v2 0 The ArchC Team Campinas Brazil Reference Manual 53 10 11 12 13 14 15 16 17 18 2007 Online http archc sourceforge net doc html The ArchC Team the ArchC Language Suppirt amp Tools for Automatic Generation of Binary Utilities v2 1 The ArchC Team User Manual 2011 Online http archc sourceforge net doc html The ArchC Team 2008 Arch C Online http archc sourceforge net Free Software Fundation 1998 GNU Operating System Online http www gnu org software emacs Don Ho 2011 Notepad Online http notepad plus plus org Stephen Williams 2005 Icarus Verilog Online http iverilog icarus com sourceforge net 2014 Welcome toGTKWave Online http gtkwave sourceforge net Synopsys 2014 Design Compiler 2010 Online http Awww synopsys com Tools Implementation RTLSynthesis Des ignCompiler Pages default aspx Cadence Design Systems 2014 Encounter RTL Compiler Online http www cadence com products ld rtl compiler pages default asp x Chris Bailey Mark Shannon Global Stack Allocation University of York 2006 Online http Awww complang tuwien ac at anton euroforth2006 papers sha nnon pdf 54 7 Appendix 7 1 Final RTL Code The most relevant parts of codes will be shown in this part of the appendix all the other code will be added in the digital appendix for
55. pected on DAT I e STB_O Strobe output indicates a valid data transfer cycle e TGA_O Address tag type provides extra information associated with ADR OG e TGC_O Cycle tag type provides extra information associated with bus cycle e WE 00 Write enable output indicates if current cycle is read or write 2 3 2 Wishbone Operation The Wishbone bus has multiple operating modes The present assignment will focus on the standard single read cycle Figure 2 8 and write cycle Figure 2 9 It is also important to consider that the data sent or received at this time is the same width as the bus itself 16 Bits Both read and write will be explained using the relevant signals for the project 8 2 3 2 1 Single Read Cycle The following description is a summary of the information found on the OpenCores Wishbone manual It explains how the read cycle works on a Wishbone interface The explanation is separated by clock cycles for practical purposes Figure 2 8 represents this bus transaction as well Clock Edge 0 e Master presents valid address on ADR_O e Master negates WE_O to indicate read cycle e Master asserts CYC_O to indicate the start of cycle e Master asserts STB_O to indicate the start of phase Clock Edge 1 e Slave presents valid data on DAT 0 e Slave asserts ACK lin response to STB_O to indicate valid data e Master monitors ACK and prepares to latch data on DAT Clock Edge 2 e Master latches data on
56. peripheral shown in Figure 3 5 was implemented by adding a register file to the Wishbone Slave interface A delay generator was also added to the slave interface to generate scenarios in which the slaves needed to stall the CPU multiple clock cycles Finally all elements of the design were connected together creating the final design for this assignment All of the final code can be found in Appendix 7 1 32 3 3 Testing Process The present chapter explains the testing used throughout the design process and the testing done in the final design A simple overview can be seen in the Methodology section of Chapter 3 The base of all testing done in the assignment was the set of instructions shown in Appendix 7 3 If the expected behavior of an instruction was known then the wave form of a design simulation could be compared to the expected behavior If any discrepancies exist then the modification needed to be inserted into the design This process was done with every instruction giving the basis for the final design obtained on this assignment The initial testing used in the assignment was done by simply hardcoding in the Verilog memory files and simulating them using the RTL design This process was slow and repetitive so the decision was made to create a test flow and a script to make the process more practical and less time consuming The first step to do this was to develop custom assembly code for the processor using Arch C Th
57. practical purposes 7 1 1 CPU module cpu AUTOARG Outputs read data io access instr addr io wr io addr io dout Inputs ram addr clk cs reset io din instr ramrd ack i stb i N include stkpc params v input 4 0 ram addr Ram address input clk G16ck input reset Reset input Cae Chip select input STKPC WORD MSB 0 io din I O Data in input STKPC WORD MSB 0 instr Instruction from Ram input STKPC WORD MSB 0 ramrd Ram read input input ack i Wishbone acknowledge signal input eek Wishbone strobe to signal valid cycle output STKPC INST MSB 0 read data For test bench output STKPC PC MSB 0 instr addr Next instruction address output io wr 1 IO Write 0 IO Read output io access Signal access to IO for WB output STKPC WORD MSB 0 io addr Output address to IO RAM output STKPC WORD MSB 0 io dout Output data to IO RAM Signals for monitoring data and return stack ALU Regs III III RR AA RA RR FF III reg ISTKPC WORD MSB OJ alu a reg ISTKPC WORD MSB OJ alu b wire STKPC WORD MSB OJ alu_c Data Steck Rene MIT NTT LAL reg STKPC DSTACK POINTER MSB 0 dsp Data stack pointer reg STKPC DSTACK POINTER MSB 0 _dsp New data stack pointer reg STKPC DSTACK MSB 0 st0 Top of stack data T 55 RAM reg reg
58. processor implementations give the advantage of having proper documentation that can help in the implementation process 1 3 Assignment Interpretation Following the assignment description and guidelines given by the tutors the following main tasks were identified all mandatory tasks were completed Task 1 mandatory Design a stack processor system including RAM communication bus and I O module Task 2 mandatory Test and simulate the system Task 3 mandatory Successfully synthesize the design Task 4 mandatory Perform place and route of the design Task 5 optional Perform power analysis Task 6 optional Load the design to an FPGA board Task 7 optional Perform energy consumption measurements and compare the results with other processors The above task list was done by both the supervisors and the student after doing the initial contract some differences may exist with the initial problem description however these were the final tasks approved by the supervisors This assignment will set the foundations for the energy harvesting system therefore proper documentation and implementation of the stack processor are the highest priority It is important to mention that all the work done in the assignment needs to be compatible with the design flow used at Atmel Considerable time was needed to learn and become familiar with the design flow though this was not part of the tasks listed for the assignment The d
59. ptions of master s signals and the common signals follow Slave signals are omitted due to the similarity they have with the master signals For a more complete description of the signals please refer to the User Manual 8 Common Signals for slave and master e CLK_I Clock input from the system clock used for synchronizing all activities done within the wishbone bus e RST I Reset input signal from the system causes wishbone interface to restart e DAT 10 Data input array used to pass binary data e DAT_O Data output array used to pass binary data e TGD WK Data tag type used to provide more information associated to DAT VW e TGD OG Data tag type used to provide more information associated to DAT OG Masters Signals e ACK_I Acknowledgement input assertion of this signal indicates termination of bus cycle e ADR_O Address output array used to pass a binary address e CYC_O Cycle bus output assertion of this signal indicates that a valid bus cycle is in progress e STALL I Pipeline stall signal indicates current slave cannot accept the transfer Only used in pipeline mode e ERR I Error input is used to identify an abnormal cycle termination e LOCK_O Lock output when asserted will make the current bus cycle uninterruptable 13 e RTY_I Retry input indicates that interface is not ready to operate and cycle must be retried e SEL O Select output array is used to indicate where valid data is ex
60. raphically represented in Figure 3 3 using a block diagram Instruction Decoder I I n n t t e e r r c c o o n n n n e e c c t t i i o o n n s s Figure 3 3 Initial Architecture Diagram 27 Before explaining the behavior of the architectures some duick notes regarding Figure BLA are needed e The RAM Data Stack Return Stack and Regs modules are drawn twice for practical purposes though they represent the same module e The Regs module represents all registers used for the architecture For a complete list of registers refer to cpu v Some of the most relevant registers are o Program Counter PC o Top of Data Stack o Next in Data Stack o Data Stack Pointer o Top of Return Stack o Return Stack Pointer e The Muxes module consists of several muxes o Next PC Mux o Next Data Stack Pointer Mux o Next Top of Stack Mux o Next ALU A Operand Mux o Next ALU B Operand Mux o Next ALU Operation Mux o Next Return Stack Pointer Mux o Next Top of Return Stack Mux The previous excerpt was explained to simplify the block diagram Even though this is a two cycle architecture to simplify the behavior the following is explained in the steps that were used 1 Reset and Start Signal The system should start with a reset setting all initial values to zero To ensure the system will not be able to advance or update any variable before the reset is over a start signal was implemented The system will not start operating u
61. rules 6 Placement The first placement is done driven by timing 7 Clock Tree Building This step uses the clock tree definition file as a reference 8 Optimization An optional optimization step exists before the routing 9 Routing The process of routing cause changes such as buffer insertion or timing modification 10 Optimization This step will try to fix any timing problems generated by the Routing 11 Generate The last step generates three files a Netlist b Standard Parasitic Exchange Format SPEF File File that contains timing information of the design c Design Exchange Format DEF File File representing the layout of the design The last step generates the files needed to do a timing and power analysis however due to time constraints these were not performed for this assignment The process is discussed in Section 5 39 4 Results The goal of this section is to show the final design of the assignment the modifications to the original design and the results obtained with the final design 4 1 Final Design The final stack processor system was synthetized to target an ASIC module The final system architecture diagram can be seen in Figure 4 1 Figure 4 1 Final Design Architecture The main change from the original design was the change of a Dual Port RAM for two Single Port RAMs that separate the Data Memory and the Program Memory The separation of the Data Stack and Return Stack into
62. sed instruction set that makes it ideal for applications that need a high throughput such as uncompressed video streaming The J1 s simple design light weight code and capability for high throughput make it a great candidate to use as a starting point for the design of an ultra low power stack processor Figure 2 5 shows the basic architecture diagram of the J1 which consists of a data stack D return stack R random access memory unit RAM decoder unit and arithmetic unit 7 The first three are shown twice to simplify graphical representation The J1 was designed for programs written in Forth and implementation of Forth instructions like duplicate swap drop etc is simplified To achieve this the instruction decoding uses specific fields as flags for certain events as Figure 2 6 shows Specific flags include modifying the data or return stack pointer pushing or popping the top of the data or return stack and more For further information please refer to the J1 Paper 7 10 R Top of return stack T Top of data stack N Next of data stack Figure 2 5 J1 Architecture Diagram 7 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 AE tit ALU GIE field width action T N l copy T to N R PC l copy R to the PC T R l copy T to R dstack 2 signed increment data stack rstack 2 signed increment return stack N T I RAM write Figure 2 6 Ji ALU Instruction Decoding 7 11 2 3 Wishbone Bus The W
63. separate modules was done in the design process Originally the stacks would be part of the CPU logic Tasks 1 through 4 from Section 1 3 were completed successfully The stack processor system was implemented designed tested simulated synthetized 41 and place and route was performed Even though not in the assignment tasks a practical test flow was created for the system The implemented stack processor is able to successfully execute all the instruction types mentioned in Section 3 2 3 and in the Appendix 7 3 The system is capable of communicating with the implemented I O modules using the Wishbone Bus using single read and write 16 bit operations this will be demonstrated in Section 4 2 The system was designed to simplify the addition of new instructions 4 2 Simulations Some basic instruction execution is shown to demonstrate the proper behavior of the system The first case consisted of pushing values to the stack using Literal type instructions and then doing an ALU addition instruction Figure 4 2 shows the waveform resulting from the simulation of the addition test Most signals are self explanatory with the exception of st0 New Top of Stack sto Top of Stack sti Next After Top of Stack and dsp Data Stack Pointer The first part of the simulation that needs to be noticed are the reset and start signals Execution will not start until start_signal has been asserted The instructions of the test are lit 4 0x
64. ssignment included a learning curve adapting to the methodology used by the company The design testing synthesis and place and route process had to be automated using scripts This assignment required the understanding of an elaborate design flow used by Atmel learning to use the design flow and incorporating the final design to it took time The original design evolved through the assignment and changed accordingly in response to the testing process Overall a successful implementation of the stack processor system was achieved Some suggestions covering possibilities for future work and optimization of the final design follow 5 1 Power Analysis For a future proper power analysis several steps after the place and route process are required in order to get useful information Figure 5 1 shows some of the steps that need to take place after the Place and Route A back annotated simulation flow checks if any design changes or constraints are violated by changes done in the Place and Route and Synthesis process The back annotated simulation would use the netlist of the Place and Route and the SPEF file with the timing constrains generated by the Place and Route A power analysis should be done in parallel with a timing analysis using the same netlist and timing constraints Finally for a successful power analysis correct stimulation is needed This can be given using a Switching Activity Interchange Format SAIF file which contains the
65. t 0 rsp lt 0 halt r lt 0 end else if stall 0 begin dsp lt dsp pc lt pc sto lt sto rsp lt rsp halt r lt halt end end start signal to initiate after initial reset always posedge clk or posedge reset begin if resetssl bi begin B 0 62 end else begin end end endmodule cpu start signal lt 0 start signal lt 1 7 1 2 Dual Port Ram module ram2ports AUTOARG Outputs data out a data out D Inputs clock addr_a addr_b data in a data in b cs a we a cs_b we b parameter WORD WIDTH 16 parameter ADDR WIDTH 8 parameter RAM SIZE 1 lt lt ADDR WIDTH localparam WORD MSB WORD WIDTH 1 localparam ADDR MSB ADDR_WIDTH 1 output WORD MSB output WORD MSB input clock Port A input ADDR MSB input WORD MSB input input Port B input ADDR MSB input WORD MSB input input reg WORD MSB 0 reg WORD MSB 0 reg WORD MSB 0 reg ADDR MSB 0 reg ADDR MSB 0 0 data out ai 0 data out b 0 addr a 0 data in a cs ai ve aj 0 addr bj 0 data in b cs b ve D data out a data out b memO 0 RAM SIZE 1 addr_a_latched addr b latched event dbg addr a Update of address latches always posedge clock begin if cs_a begin addr
66. t adr o cpu dat o lt dat o cpu end end The data from the WB bus is passed to the CPU always begin dat i cpu lt dat i end always begin if io access 1 amp amp we i cpu 1 begin we o lt we i cpu end else if io access amp amp we i cpu 0 begin we o lt we i cpu end end always posedge clk i begin if rst i 1 bl begin stb o lt 0 cyc Oo lt 0 end else begin if io access 1 gg ack i 0 begin If slave is free cycle start sel o 0 lt 1 cyc o lt 1 stb o lt 1 end else if ack i 1 begin If slave is done finish cycle stb_o lt 0 cyc o lt 0 end end end endmodule wb m 16 7 1 6 Wishbone Slave Module module wb s 16 AUTOARG Outputs 67 dat o ack o err o rty o reg f0 reg fl reg f2 reg f3 reg f4 reg f5 if Inputs rst i clk i dat i adr i cyc i we i stb i sel i cs E include stkpc params v Outputs WB Signals output WB DAT O MSB 0 dat_o Data out output ack o Acknowledgement out output rty o Ready out output err oj Error out not used for simple design output WB DAT O MSB 0 reg fo Outputs used to monitor the register file output WB DAT O MSB 0 reg fl output WB DAT O MSB 0 reg f2 output WB DAT O MSB 0 reg 3 output WB DAT O MSB 0 reg f4 output WB DAT O M
67. ted in 34 a simple flow On the down side it did not cover the corner cases was not exhaustive nor was it randomized 3 4 Synthesis Process The synthesis let us move from the RTL code to a netlist of gates This section first covers the synthesis flow used in the assignment The RAM connection to the system can vary for different targets such as for an FPGA or an ASIC The difference between these two synthesis possibilities and the original design are also discussed The synthesis process used the tool Design Compiler by Synopsis A TCL script was used to automate and simplify the process As shown in Figure 3 7 Design Compiler needed at least three input documents to be able to generate the required outputs e RTL Code Verilog files v that describe the processor architecture and the design overall e Libraries Consisting of a lib file and a db file together they describe all the standard cells to be used for the synthesis and the information needed for the RAM and I O modules if used e Timing Constraints File Define the timing requirements the system needs to fulfill It also includes the definition of the clock and the clock period The previous inputs were given to the Design Compiler tool and it was able to perform the synthesis The result of the synthesis were three files a report the netlist and a constraints file sdc The last two were used as arguments for the Place and Route process The report had i
68. xternal module Write to an external module Returns a 0 if 2 top values of data stack are different else returns OxFFFF no operation 73 Less Than Logical Left Shift Logical Right Shift Unsigned Less Than Over operation Return Stack Copy Return Stack Push Return Stack Pop Stack Depth Swap Operation Subtract One nitt n_sl t m_sr t n ult t over rs cp rs push rs pop stk dep swap sub1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Returns OxFFFF if second of data stack is smaller than top of stack else returns 0 Second of data stack is shifted left by value given by top of stack Second of data stack is shifted right by value given by top of stack Returns OxFFFF if second of data stack is smaller than top of stack else returns 0 Uses unsigned values Value in second of data stack is copied to the top of stack Copies the top of return stack to top of data stack Push value of top of data stack to return stack Drop top of return stack Top of stack is given the value of the data stack pointer in the top 8 bits and the value of the return stack pointer in the 8 lower bits Swaps the second of data stack with the top of data stack Subtracts 1 of top of data stack 74
69. zzz return stack 1 15 0 0004 return stack 2 15 0 zzzz Figure 4 3 Call Simulation The goal of the test is to use the CALL instruction to jump to a specific part of the code and then return with the EXIT instruction The CALL instruction stores the return address in the return registers A behavior similar to the data stack takes place The first location to be used in the return stack is not location zero 44 due to the updating of the Return Stack Pointer rsp By monitoring the signal of Top of Stack sto and the Program Counter pc it is possible to see how the CALL instruction modifies the pc to jump to the respective program address and the previous address is stored in the return stack the value is logically shifted left once before been store therefore the value stored in return_stack_1 is the value 4 which is 2 logically shifted left The program returns to the original address after using the EXIT instruction The instruction that pushes the values 6 and 7 are jumped and the instruction that pushes the value 10 is not executed showing the program returning to the correct location in time The JUMP and CONDITIONAL JUMP instruction types had similar successful behavior and their waveforms are not shown The communication between the CPU and the peripheral using single read and write operations with the Wishbone was completed successfully in two cases with delay from the slave and without delay from the slav

fulltext - DiVA Portal

Contents

Download Pdf Manuals

Related Search

Related Contents