Home

Full Report - ECE - Cornell University

1. PIDs are mapped to ports on the network switch as follows Port PID PORTA 0 PORTB 1 PORTC 2 PORTD 3 define BTx 0b10000000 define BAck 0b01000000 define PTx 0b00100000 define PAck 0b00010000 define Dmask 0611111100 define UBYTE unsigned char ceceive buffer size actual data stored is 1 less since ring buffer needs a 1 byte sentinel define RBUFSIZE 5 this pragma ensures these globals won t go into registers pragma regalloc aos_semaphore sem_net_tx Semaphore for writing from network aos_semaphore sem_net_rx Semaphore for reading to network aos_semaphore sem_net_rbuf Semaphore for reading from receive buffer UBYTE rbuf 16 RBUFSIZE allocate ringbuffer storage per tag address pair ringbuffer rb 16 allocate ringbuffer per tag address pair pragma regalloc Initializes the network interface must be called at startup before switch starts reading void net_init void Int a PORTC 0x00 DDRC 0x33 sem_net_tx aos_semaphore aos_sem_create 1 sem_net_rx aos_semaphore aos_sem_create 1 sem_net_rbuf aos_semaphore aos_sem_create 1 for i 0 i lt 16 itt rhinit ce ip rout i RBUFSEZE gt init ring buffers start listener thread here pragma regalloc static UBYTE txstate 0 TAck BAck pragma regalloc Performs a blocking send Parameters Taddr tag address of recipient 2 bits of tag and
2. cE SE txstate 6 if ready to transmit something txready 0 cound robin arbitration to select from which dest tag queue to transmit For iC It aK dr EI txq txq 1 amp 0x3 select next dest tag for this processor see if it has anything in queue with selected tag if rbisempty rb txq lt lt 2 procnum txready 1 txstate 0 Taddrdata rbget rb txq lt lt 2 procnum get addr data value Taddr UBYTE Taddrdata gt gt 8 Tdata UBYTE Taddrdata break txready if COMPIN amp PAck TAck 38 TAck TAck PAck if txstate 0 COMPORT COMPORT amp Dmask Taddr lt lt 2 put source addr crumb txstatett else if txstate 1 COMPORT COMPORT amp Dmask txq lt lt 2 put dest tag crumb txstatett else if txstate lt 4 put data crumb COMPORT COMPORT amp Dmask Tdata gt gt 8 txstate lt lt 1 amp Oxc txstatett else COMPORT COMPORT amp Dmask Tdata lt lt 2 amp Oxc put data crumb txstate 6 COMPORT COMPORT BTx toggle the transmit signal undef COMPORT undef COMPIN This is the ringbuffer h header file that defines the circular buffer data structure This data structure is used in both the library code and the network switch code with one minor change the data type in the library is one byte while the
3. returns 0 if buffer is full otherwise returns 1 inline unsigned char rbput ringbuffer rb rbdatatype c 40 char newtail newtail rb gt tail 1 rb gt length calculate where the new tail would be if the new tail would make the buffer look empty buffer is full if newtail rb gt head return 0 else rb gt buf rb gt tail c store the data rb gt tail newtail move the tail pointer forward with wraparound return 1 Appendix D Test Programs The protocol test code compiles under WinAVR with avr gcc This code was used to test the first version of the protocol It is interrupt based and allows timing to be adjusted easily Bus controller tester Protocol version 1 0 4 wire bidirectional bus include lt inttypes h gt include lt avr io h gt include lt avr interrupt h gt include lt avr sleep h gt int time 500 int count 0 unsigned char data 0 unsigned char buf enum ready receive display char state ready const unsigned char addr 1 ISR TIMERO_COMP_vect if time lt 0 time 1 every time milliseconds RRR KKK KK KK KK KK receive RRR KKK KKK KK KKK if bit_is_set PINC PINC7 got BTx state receive buf PINC lt lt 4 amp Oxf0 cead upper nibble from data lines PORTC 0600010000 raise P ck if state receive amp amp bit_is_clear PINC PINC7 got BTx low buf PINC amp
4. Microcontroller Based Systems nneeeeeeeeneeeeeeeeeeennrrnnnrneseerenrnnnnnnnseerenne 5 2 Multiprocessor SYSICINS oc oct ect es ee ee ee et 6 3 Comm nication e Ee ed 7 Project TE 9 Ill Hardware Design exeveegseesggEENENEKESESENEEEEENEKENEKESEEESEEEENREENEESEEEAEEEEEENEKENENER 10 TU MOGMPONCIIG tates a ate eae as a ee ee ee 10 2 Evaluation of Communication Opttons 11 3 Network Architecture Version 12 4 Network Architecture Version 3 14 5 Network Architecture Version 3 16 V Software DOSIG IN Aeesteendegbter Gendaeostantdctetinedainhanishated aed siadanndaiadaasdstndiitaineauntenat 17 Eoia ADIE EE 17 2 integration ln e RE 19 3 Network Eeer ee EE 21 4 Considerations for Client Applications ccceccceceeeceeeeeeeeeeeeeeeeteneeeeeeeees 21 Vi Testing TN 22 VIl eebe r aana aa aa a aa raaa ei aparar raean aiaa 23 VINE CONCIUSION BE 24 Appendix A User Manual is fee scinkcete cued euln tuicukncnadente cunteutdnelaulnteatiubaiideuesanetavite 25 1 Green Erde 25 2 API Communication Routines vieticeitectetocteceiceiieett toctenctercteeetneenctinvieeebienianes 26 3 Memory E EE 28 Appendix B Demo Prograim eccccceessseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeneeeeennees 29 Appendix C Network Code ssnsssssnnnnnneenennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnmnnn 32 Appendix D Test Programs eeceeeeeeeeeeeeeeeeeeeeeeeeeeeneeeeeeeeeeeeeeesseneseeneees 41 Appendix E System Ph
5. 2 bits of addr ObTTAA data byte of data to be sent void send_wait UBYTE Taddr UBYTE data UBYTE notdone 1 PORTA 0x02 aos_wait sem_net_tx 33 PORTA 0x03 while notdone if UBYTE UBYTE PINC amp UBYTE BAck UBYTE TAck TAck TAck BAck if txstate 0 PORTA 0x04 PORTC UBYTE UBYTE PORTC amp UBYTE Dmask UBYTE UBYTE Taddr amp UBYTE 0x3 put addr crumb txstatett else if txstate 1 PORTA 0x05 PORTC UBYTE UBYTE PORTC amp UBYTE Dmask UBYTE UBYTE Taddr gt gt 2 amp UBYTE 0x3 put tag crumb txstatett else PORTC UBYTE UBYTE PORTC amp UBYTE Dmask UBYTE UBYTE data gt gt 5 txstate lt lt 1 amp UBYTE 0x3 put data crumb MSBs go first txstate txstate gt 5 0 txstate 1 do 4 iterations of data if txstate 0 notdone 0 PORTC UBYTE PORTC UBYTE PTx toggle the transmit signal aos_Signal sem_net_tx pragma regalloc static UBYTE rxstate 0 RAck 0 pragma regalloc Performs a blocking recieve Modifies global rb tag lt lt 2 addr will contain the received data void receive_wait void UBYTE notdone 1 UBYTE Raddr Rtag Rdata aos_wait sem_net_rx while notdone if UBYTE UBYTE PINC amp UBYTE BTx UBYTE RAck if signal was toggled RAck UBYTE RAck store the last Ack st
6. Oxff 16 MHz CLK 64 prescaler 250 counter 1000 ticks sec TIMSK 2 turn on timer 0 compare match ISR OCRO 250 Compare match interrupt every 1 ms TCCRO 0b00001011 prescaler 64 clear on match sei while wait 0 while 1 PORTD PINC RRR KKK KKK KK KKK receive RRR kkk kkk kkk kk if UCHAR UCHAR PINC amp UCHAR BTx UCHAR RAck if signal was toggled RAck UCHAR RAck UCHAR BTx store the last Ack state toggle if rxstate 0 Raddr UCHAR PINC gt gt 2 amp UCHAR 0x3 read address crumb rxstatett else read data crumb buf UCHAR buf lt lt 2 UCHAR UCHAR PINC gt gt 2 amp UCHAR 0x3 rxstate rxstate gt 4 0 rxstate 1 do 4 iterations of data LAKE display echo BRK KKK KK KK KK RK if rxstate 0 dispcount if UCHAR buf UCHAR data 1 PORTD UCHAR 0xaa else PORTD UCHAR 0x66 PORTD buf PORTD dispcount gt gt 8 amp Oxff PORTC UCHAR PORTC UCHAR PAck toggle the Ack 44 tt transmit RRR KKK KKK KK KKK if UCHAR UCHAR PINC amp UCHAR BAck UCHAR TAck TAck TAck BAck if txstate 0 PORTC UCHAR UCHAR PORTC amp UCHAR Dmask UCHAR UCHAR Taddr amp UCHAR 0x3 put addr crumb txstatett else PORTC UCHAR UCHAR PORTC amp UCHAR Dmask UCHAR UCHAR b
7. is the ringbuffer c source code for the circular buffer implementation Ring circular buffer implementation Author Kalim Moghul kfm9 cornell edu Invariants head tail iff buffer is empty If buffer is not empty head points at next valid element to be consumed tail always points at the next empty element gt There is always one unused element in a full buffer gt length must be greater than 1 GC initializes the given ringbuffer with the supplied array and its length inline void rbinit ringbuffer rb rbdatatype array unsigned char length rb gt buf array rb gt length length rb gt head rb gt tail 0 returns boolean true if the ringbuffer is empty false otherwise inline unsigned char rbisempty ringbuffer rb return rb gt head rb gt tail ceturns boolean true if the ringbuffer is full false otherwise inline unsigned char rbisfull ringbuffer rb return rb gt tail 1 rb gt length rb gt head consumes an element from the buffer ceturns NULL if buffer is empty or a pointer to the array element otherwise inline rbdatatype rbget ringbuffer rb rbdatatype ret if rb gt head rb gt tail return NULL else ret rb gt buf rb gt head index into the array at head pointer move head pointer forward one element with wraparound rb gt head rb gt head 1 rb gt length return ret puts an element into the buffer
8. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ret loop on non blocking receive ret net_recv 1 1 1 buf stop ret top ret HH aK top ret PORTA buf 0 printf Result 1 d r n buf 0 printf Result 2 d r n buf 1 printf Result 3 d r n buf 2 while 1 et net_recv 2 2 1 buf et net_recv 3 3 1 buf This program demonstrated transmission of large data types over a byte oriented network and static load balancing using the PID A partial approximation to pi is computed in parallel in each node and aggregated in the master This program is based on an MPI version provided at 3 Cluster Demo Approximates define PID on AOS with network support Pi include lt mega32 h gt include lt delay h gt include lt stdio h gt include aos h include user_tasks h include aos_core c include aos_task c include aos_semaphore c include aos_mbox c include aos_uart c include aos_network c UBYTE 0 define NUMPROCS UBYTE 4 void task1 void omit if no semaphores or mail used omit if no mailboxes omit if you don t need UART 30 PO DD RA For status LEDs RTA Oxff Oxff USART Init UCSRB 0x18 UBRRL 103 net_init PORTA 0x00 delay_ms 100 program if PID UBYTE 0 Node 0 MASTER float PI 3 141592653589793238462643 float mypi pi h sum x int n
9. 20 i j k n number of approximation intervals UBYTE c buf sizeof float 3 printf Approximation to Pi f r n PI send n to all workers c UBYTE amp n get a byte pointer to the int PORTA Oxff putsf Sending to Pl r net_send 1 0 sizeof int c putsf Sending to P2 r net_send 2 0 sizeof int c putsf Sending to P3 r net_send 3 0 sizeof int c PORTA Ox0f do the partial sum to the approximation h 1 0 float n sum 0 0 for i PID 1 i lt n i NUMPROCS x h double i 0 5 sum 4 0 1 0 x x mypi h sum printf my partial sum is f r n mypi wait for workers to submit partial results PORTA Oxf0 for i 0 i lt sizeof float 3 itt receive_wait should receive 3 floats byte by byte putsf Received a byte r PORTA Oxaa retrieve the partial results from the buffers putsf Pulling from buffers r i is k 0 while i j k sizeof float 3 should receive 3 floats addr 1 starts at buf i net_recv 1 255 sizeof float i buf i addr 2 offset by 1 float j net_recv 2 255 sizeof float j buf sizeof float j addr 3 offset by 2 floats k net_recv 3 255 sizeof float k buf sizeof float sizeof float k 31 sum up results pi mypi float buf float buf sizeof float float buf sizeof float
10. DESIGN OF A MESSAGE PASSING INTERFACE FOR MULTIPROCESSING WITH ATMEL MICROCONTROLLERS A Design Project Report Presented to the Engineering Division of the Graduate School of Cornell University in Partial Fulfillment of the Requirements of the Degree of Master of Engineering Electrical by Kalim Moghul Project Advisor Dr Bruce R Land Degree Date May 2006 Abstract Master of Electrical Engineering Program Cornell University Design Project Report Project Title Design of a Message Passing Interface for Multiprocessing with Atmel Microcontrollers Author Kalim Moghul Abstract Microcontrollers are versatile integrated circuits typically incorporating a microprocessor memory and I O ports on a single chip These self contained units are central to embedded systems design where low cost dedicated processors are often preferred over general purpose processors Embedded designs may require several microcontrollers each with a specific task Efficient communication is central to such systems The focus of this project is the design of a communication layer and application programming interface for exchanging messages among microcontrollers In order to demonstrate the library a small scale cluster computer is constructed using Atmel ATmega32 microcontrollers as processing nodes and an Atmegai6 microcontroller for message routing The communication library is integrated into aOS a preemptive multitasking real time operating system f
11. NOUT PORTC 0x00 DDRC 0x30 DDRD Oxff PORTD Oxff 16 MHz CLK 64 prescaler 250 counter 1000 ticks sec TIMSK 2 turn on timer 0 compare match ISR OCRO 250 Compare match interrupt every 1 ms TCCRO 0b00001011 prescaler 64 clear on match sei while 1 PORTD PINC Version 2 of the protocol was tested with this code Since Version 3 is a simple extension of Version 2 no additional tester code was required This code runs in a while loop and allows the throughput to be measured Network switch tester Protocol version 2 0 2x2 wire unidirectional buses fast version include lt inttypes h gt include lt avr io h gt include lt avr interrupt h gt define BTx 0b10000000 define BAck 0b01000000 define PTx 0b00100000 define P ck 0b00010000 define Dmask 0b11111100 define UCHAR unsigned char int time unsigned c unsigned c unsigned c unsigned c unsigned c const unsi unsigned i unsigned c ISR TIMERO if 500 har txstate har rxstate har data 0 har buf har Raddr gned char Taddr nt dispcount 0 har wait 1 1 0 0 TAck RAck BAck 0 1 _COMP_vect time lt 0 43 int main using PORTC for bus communication Pin Signal Direction 7 ER IN 6 BAck IN TES PTx OUT 4 PAck OUT 3 DIn1 IN 2 DInO IN 1 DOut1 OUT 0 DOutO OUT PORTC 0x00 DDRC 0x33 DDRD Oxff PORTD
12. Ox0Of read lower nibble from data lines PORTC 0x00 drop all signals count 0 state display RRR KKK KKK KK KK display echo 41 RRR kkk kkk kkk kk if state display PORTD buf count time 30 if count 3 count 0 state ready RRR KKK KKK KK KKK transmit RRR KKK KKK KK KKK if state ready if count 0 PORTC 0600100000 datat count else if count 1 if bit_is_set PINC DDRC 0x3f vraise PReq data to be sent PINC6 got BAck set data lines as outputs drop PReq raise PTx and send addr PORTC 0b00010000 count else if count 2 if bit_is_clear PINC PORTC 0600000110 PORTC data gt gt 4 County else if count 3 if bit_is_set PINC addr amp 0x0f PINC6 got BAck low drop PTx and send data 0110 amp Ox0f PINC6 got BAck PORTC 0600011010 raise PTx and send data 1010 PORTC 0b00010000 count else if count 4 if bit_is_clear PINC DDRC 0x30 PORTC 0x00 count 0 int main data amp Ox0f PINC6 got BAck low set data lines as inputs drop all signals 42 using PORTC for bus communication Pin Signal Direction 7 BTX IN 6 BAck IN 5 PReq OUT 1 4 PTx PAck OUT 3 Data3 INOUT 2 Data2 INOUT 1 Datal INOUT 0 Data0d I
13. YTE buffer This is a high level function to transmit variable length messages Sending is a blocking operation meaning that the calling task cannot continue until all bytes are transmitted The address parameter specifies the target node according to the hardware address assignments and tag specifies which tag at the target node will receive data The length parameter specifies how many units of data are transmitted from buffer which holds the message The implementation uses the unsigned char type for these parameters thus allowing 255 tags on 255 addresses and messages up to 255 bytes in length 26 net_recv UBYTE address UBYTE tag UBYTE length UBYTE buffer This is a high level function to The address and tag parameters correspond to the node that sent the data and the local tag that is expected to contain the data respectively The length parameter is the maximum number of bytes to receive Data is copied into the memory location specified by buffer and the function returns a number indicating how many bytes were actually received A value of 255 can be specified for address which will retrieve data from any address The same applies to tag This functionality is useful for collecting data from multiple senders tags or both when ordering of messages does not matter The high level receive function is non blocking and will return O if no data is ready send_wait UBYTE T
14. addr UBYTE data In order to send data net_send calls send_wait in a loop until all bytes are sent The Taddr parameter contains the tag and the address combined and data is the byte to be sent The lower 2 bits of Taddr specify the address the next 2 bits are the tag send_wait is protected by a mutex which limits access to the transmission control and data lines to one thread at a time The function waits until it can enter the critical section before running the network protocol state machine The address tag and data are shifted out 2 bits at a time from most significant bits to least significant After transmission of a byte is complete the mutex lock is released receive_wait The receive_wait function is not called directly by net_receive but must be called either by another thread or the calling thread before net_receive can be expected to return any data The low level receive function simply receives one byte of data and buffers it internally irrespective of address or tag The function and the buffers are also protected mutexes which guarantees that competing threads will not violate the protocol When data is expected in a well defined manner an application would only need to call the low level receive function enough times to ensure that all data was received The high level receive takes care of sorting by tag and address However in some systems the quantity and frequency of data may not be k
15. am for node Transmit The protocol can be modified to support any fixed address tag and data lengths odd lengths are zero padded as a characteristic of the data link layer of a specific implementation Support for variable length messages can be added by implementing a higher level protocol in software V Software Design 1 Library Design The main software component of this project is the message passing library that provides client applications with an interface to the network The API is heavily 17 influenced by the widely used Message Passing Interface MPI standard 3 4 Developers familiar with MPI will be able to understand the interface conventions easily Also existing applications can be ported to the API with minimal effort At the highest level the library provides the C API functions for initialization and for sending and receiving messages The first function to be called by a client application is void net_init which contains implementation specific initialization as well as the default library initialization This procedure first sets the proper I O pin directions and output levels In this design project the library was used to extend the aOS operating system which was written by Anssi Ylatalo 2 and extended by Bruce Land 7 Since the OS is preemptive mutual exclusion mutex semaphores must be used to make the library thread safe Routines specific to aOS are used to initialize these semaphores Interna
16. ams should be written to avoid scenarios where deadlock may occur In particular since send functions do not return until their matching receives are executed nodes can become stuck in the send function It is up to the programmer to ensure that this scenario does not occur by matching sends with receives 3 Memory Management The aOS kernel occupies a large portion of the SRAM which limits the stack space available for running threads In order to keep the network add on lightweight the implementation is conservative in its memory usage Three semaphores and four global variables are needed at a minimum All addresses tags and data arguments are unsigned bytes The network design allows for larger types but these limitations are reasonable for small systems The circular buffers that are used to sort incoming data are variable in size and add as little as 96 bytes to global memory consumption The data stack size can be decreased in CodeVisionAVR to allow room for tasks A data stack of approximately 100 bytes is recommended 28 Appendix B Demo Programs This program sends the same array of numbers from the master node to three workers which sum the array The result is sent back to the master and displayed over the serial port Array transmission and PlID specific code are demonstrated Cluster Demo on AOS with network support Sums an array include lt mega32 h gt include lt delay h gt include lt stdio h gt in
17. ary wires and 10 conductor cable 2 Evaluation of Communication Options The Atmel microcontrollers come with a small selection of built in communication options In order to select the best interconnect each of these options was evaluated against implementing an alternative communication protocol The criteria for evaluation were scalability number of pins required and throughput The first option available on the Atmel chips is the serial peripheral interface SPI This interface requires at least four conductors provides full duplex operation and is synchronous SPI defines one node as a master and the other nodes as slaves Two wires provide the send and receive lines The master is responsible for generating the clock signal on a dedicated line and sending a signal to a slave on its select line thus initiating communication The clock frequency is limited to the main oscillator frequency divided by 4 or 16 MHz 4 4 MHz One bit can be sampled per SPI clock period thus allowing a maximum throughput of 4 Mbits sec 8 bits byte 500 KB sec For connecting two microcontrollers SPI seems to be a good choice It provides high throughput and uses few port pins Since this is a byte oriented protocol though there would need to be at least one byte of overhead for logical addressing This would effectively halve the transfer rate Also the scheme does not scale well Since there can be only one master microcontrollers must take turns i
18. ate toggle if rxstate 0 Raddr UBYTE PINC gt gt 2 amp UBYTE 0x3 read address crumb PORTC UBYTE PORTC UBYTE PAck toggle the Ack rxstatett else if rxstate 1 Rtag UBYTE PINC gt gt 2 amp UBYTE 0x3 read tag crumb PORTC UBYTE PORTC UBYTE PAck toggle the Ack rxstatett else read data crumb Rdata UBYTE Rdata lt lt 2 UBYTE UBYTE PINC gt gt 2 amp UBYTE 0x3 rxstate rxstate gt 5 0 rxstate 1 do 4 iterations of data PORTC UBYTE PORTC UBYTE PAck toggle the Ack 34 if rxstate 0 notdone 0 while 1 aos_wait sem_net_rbuf if rbput rb aos_Signal sem_net_rbuf break aos_Signal sem_net_rbuf aos_Signal sem_net_rx Performs a blocking send of a buffer Rtag lt lt 2 Raddr finished receiving stall here while trying to find space in the ring buffer Rdata Parameters addr address of target tag tag to receive data length number of bytes to transmit from buffer buf pointer to buffer containing message void net_send UBYTE addr UBYTE tag UBYTE length UBYTE buf UBYTE i Taddr Taddr tag lt lt 2 addr amp 0x3 for i 0 i lt length i Send wait Taddr buf i Performs a blocking receive from a specific source Parameters addr address of source to receive from 255 any dfs
19. ation as well as the default library initialization This procedure first sets the proper I O pin directions and output levels Semaphores are allocated for sending receiving and modifying the internal buffers void task1 void net_init 25 Following initialization the program may begin to use the communication interface In typical MPI master slave programs two versions of the program code are needed one to be executed on the master and another for the slaves An effective method of writing code for master slave programs is to include both versions in single file and allow the compiler to select the desired version based on the PID In the example below node 0 is the master and will run the program in the else block while all other nodes will execute the code inside of the if block void task1 void net_init program if PID UBYTE 0 Nodes 1 2 3 worker thread code else Node 0 master thread code while 1 loop since threads should never return Finally since tasks should never return an infinite loop is included as the last line which is the while 1 statement above The program can be compiled for each node by simply modifying the PID 2 API Communication Routines In order to send and receive data over the network the API provides the several functions Examples can be found in Appendix B net_send UBYTE address UBYTE tag UBYTE length UB
20. ayer the network must provide a physical addressing scheme to distinguish between nodes and fixed size frames of data Addresses can be hard coded in each node at this level Additional logical address can be assigned at the network layer which is responsible for allowing variable length messages to be communicated The client application should be able to make use of the network to send and receive such messages via an abstraction known as the application programming interface API ll Project Goals The purpose of the design project is to develop an API and hardware communication layer for message passing with microcontrollers The hardware consists of the Atmel ATmega32 and ATmega16 microcontrollers An STK500 development board is used as the main development and test platform Program 9 code is developed in the C programming language using WinAVR 20060125 release and CodeVisionAVR version 1 24 7f Since this project encompasses both a hardware and software solution there are requirements and constraints on both sides On the hardware side The system must support at least two microcontrollers The communication network should have reasonable latency and speed The system should be relatively low cost For the software side of the project The interface must allow variable length messages to be transmitted between microcontrollers The system must provide a logical addressing scheme The interface should be easy
21. becomes available e g by transmitting something from a full buffer that particular state machine can resume The data is stored as a byte containing both the tag and source address concatenated with a byte of data Each node consumes only 17 bytes of memory for state information so a large portion of the SRAM can be devoted to buffering During testing it was found that at least 384 bytes of buffered data with tags and address could be stored 4 Considerations for Client Applications Client applications that make use of the library for multiprocessing can follow typical MPI master slave programming style 3 A C program using this style might look something like this 21 include lt mega32 h gt include lt stdio h gt include aos h include user_tasks h include aos_core c include aos_task c include aos_semaphore c include aos_network c network communication library define PID UBYTE 0 defines this node s address void task1 void net_init program if PID UBYTE 0 Nodes 1 2 3 worker thread code else Node 0 master thread code while 1 loop since threads should never return Each microcontroller is programmed with this program altering only the PID to match the address of that node After the network initialization is performed the program will diverge depending on whether the microcontroller will be the master of the multiprocessor alg
22. clude aos h include user_tasks h include aos_core c include aos_task c include aos_semaphore c include aos_mbox c include aos_uart c include aos_network c define PID UBYTE 0 void task1 void For status LEDs PORTA Oxff DDRA Oxff USART Init UCSRB 0x18 UBRRL 103 net_init PORTA 0x00 delay_ms 100 omit if no semaphores or mail used omit if no mailboxes omit if you don t need UART program if PID UBYTE 0 Nodes 1 2 3 unsigned char i 0 sum D ret unsigned char buf 1 putsf Receiving r while i lt 16 receive_wait ret net_recv 0 0 1 buf if ret 0 i ret sum buf 0 printf bytes rx d Got d sum now d r n ret buf 0 sum printf Sending result d to addr d tag d r buf 0 0 net_send 0 else unsigned char i PID Le but j stop 0 29 Tag specified by PID Node 0 buf 0 PID sum unsigned char buf PORTA Oxff pu tsf Sending data r for i 0 i lt 16 E pu ne ne ne printf sd buf il ESE TAEI t_send 1 0 16 buf t_send 2 0 16 buf t_send 3 0 16 buf PORTA Ox0f receive_wait pu tsf Got first result r receive_wait pu tsf Got second result r receive_wait pu PORTA O0xf0 tsf Got third result r while stop lt 3 0 1
23. e arbiter Once it receives the acknowledgement it drives the address onto the bus This nibble is read by the arbiter which then selects the target node to receive requests and acknowledgements The arbiter will toggle BAck at the end of each state to signal a change to the next state and 13 the transmitting node toggles PTx which is forwarded to the receiving node as BTx When all data is transferred the nodes return to the ready state The arbiter ensures that only one request for the bus is acknowledged at a time The main benefit to this scheme is that data is transferred in parallel so a significant speed improvement over the serial schemes can be expected However there are some drawbacks The bus is bidirectional which in implementation comes at the cost of extra processor cycles since an instruction must be issued to change the I O direction of the pins Additionally the arbiter must time bus ownership carefully to ensure no two microcontrollers try to drive the bus lines at the same time which means introducing extra delays It was found that even with round robin arbitration and the increased bandwidth of a parallel bus the shared medium remains a bottleneck The problem is particularly pronounced though seemingly unnecessary in many scenarios Consider an application with 4 nodes A B C and D sharing a bus Data flows primarily between nodes A and B and between nodes C and D with less frequent communication between other
24. e state transition diagrams detail the basic flow of the protocol The state machines are highly symmetric with respect to receiving and transmitting Also the state machines for the network switch are nearly identical to the node state machines and are thus omitted 14 P d Ki Ba j P N P eet SW L D N P Ze fe N NM PS AJ Figure 9 Version 2 Topology The state transition diagrams for version 2 of the protocol are do Read Addr from bus Toggle PAck do Drive Addr on bus Toggle PTx BAck Level do Read Data from bus Toggle PAck do Drive Data on bus Toggle PTx BAck Toggle Figure 10 State Diagrams for node Receive and Transmit 15 The pins on each node were mapped as follows Pin Signal Direction 7 Bis IN 6 BAck IN 5 PTX OUT 4 P ck OUT 3 Dini IN 2 Din IN 1 DOutt OUT 0 DOutd OUT There are several benefits of this topology over that of version 1 First overall throughput is increased even though individual links are half as wide as the bus This is because the shared medium has been eliminated so transmitters can initiate data transfer at any time Second addressing is more flexible since data headers can be multiples of 2 bits instead of 4 This reduces the amount of excess overhead due to zero padding the unused portion of an address packet sent over the wider link One drawback of the star network is that more port pins are
25. eets the design goals Specifically a four node configuration was tested which meets the minimum goal of two nodes and communication is reasonably fast with only a two hop delay The system is low cost since no extra hardware is required aside from the inexpensive microcontrollers The software interface successfully allows variable length messages to be communicated using tags and addresses as targets The interface provides an abstraction of the underlying network allowing ease of use The memory footprint can be adjusted to allow enough room for client applications Finally the constraint of using at most 1 port per microcontroller was met The largest drawback to the architecture is that the star network depends heavily on the central node for data routing The network is also limited in size because of the I O pin count In order to improve throughput a specialized piece of hardware such as an FPGA programmed for this purpose could be used instead of a microcontroller A hardware router might be able to route messages at a higher frequency thus reducing latency and would also provide truly independent communication with each node A high throughput backbone for connecting multiple routers might solve the pin count problem 24 Other improvements could be made as well The data packet length could be increased to handle longer messages with less overhead Messages sent from a node to itself must travel through the router in this implemen
26. er of nodes on a shared bus grows collisions become more frequent and throughput decreases Dedicated lines are more costly in terms of hardware but do not exhibit such performance degradation Data can be transmitted in different ways over the link Synchronous protocols require that nodes on the same link have clocks running at the same frequency The clock signal may be sent on a dedicated wire or embedded in the data transmission in which case a clock 8 recovery mechanism must be employed Asynchronous protocols allow nodes on the same link to transmit data without the need for clocks HZP Ring Mesh Fully Connected nao Rei ss Line Figure 6 Network Topologies The network topology defines the subsets of nodes that are connected by common links Topologies can range from a simple line to a fully connected graph The choice of topology determines the latency in number of hops from one node to another and the cost of the network in number of links For a network in which each of N nodes can communicate with each other node a ring with unidirectional links has the lowest cost N links and a fully connected network with bidirectional links has the lowest latency 1 hop between any two nodes Other designs provide a tradeoff between low latency and low cost Tree networks reflect the structure of many divide and conquer algorithms while hub and spoke or star networks embody master worker algorithms 5 At the data link l
27. etric multiprocessing UMA M Figure 2 UMA Architecture FMI PHM PHM NUMA Figure 3 NUMA Architecture CH CH u CH Cluster Figure 4 Cluster Architecture Memory may be shared equally among all processors resulting in a uniform memory access UMA machine or it may be divided among the processors in non uniform memory access NUMA machines At the other end of the spectrum clusters combine several individual computers with a high speed network to give the illusion of a single computer with many processors Grid 6 computers are even more loosely coupled than clusters and may have high latency low bandwidth interconnect Clusters and grids have become popular because they can be constructed from low cost off the shelf computers and local area networks thus making affordable supercomputers possible Clusters and other tightly integrated multiprocessor systems evolved from the same concept but are dissimilar in many ways One of the key differences between clusters and specially designed multiprocessor systems such as CMP and SMP computers is that clusters run an operating system on each node while CMP and SMP computers typically run a single multiprocessor aware operating system Also many small scale CMP and SMP computers are UMA machines but clusters resemble loosely coupled NUMA
28. fine COMPORT PORTB define COMPIN PINB void procl UBYTE procnum 1 The code for node 1 for example would be changed as above BK KKK KK KR KKK kkk kkk kkk kkk kk kkk k kkk kkk k k k k k Project networkswitch Version 3 0 Author Kalim Moghul Chip type ATmegal6 Clock frequency 16 000000 MHz TKK KK A KK e e e e KK e E OK OK d include lt megal6 h gt include lt delay h gt include lt stdio h gt include ringbuffer h define UBYTE unsigned char define UINT unsigned int define BTx 0b10000000 define BAck 0b01000000 define PTx 0b00100000 define PAck 0b00010000 define Dmask 0b11110011 36 define RBUFSIZE 24 pragma regalloc UINT rbuf 16 RBUFSIZE allocate ringbuffer storage per tag address pair ringbuffer rb 16 allocate ringbuffer per dest tag address pair pragma regalloc void proc0 void void procl void void proc2 void void proc3 void void main void intis PORTn Pinouts Pin SignalDirection 7 BTx OUT 6 BAck OUT SS PReq IN 4 PAck IN 3 Data3 OUT 2 Data2 OUT 1 Datal IN 0 Data IN DDRA 0xcc interface for node 0 PORTA 0x00 DDRB O0xcc interface for node 1 PORTB 0x00 DDRC O0xcc interface for node 2 PORTC 0x00 DDRD Oxcc interface for node 3 PORTD 0x00 USART Init for debugging UCSRB 0x18 UBRRL 103 delay_ms 1000 Initialize ring buffers for i 0 i lt 16 itt
29. icate transmission requests and acknowledgements between each node and the bus controller The protocol is asynchronous to reduce the load associated with maintaining a transmission clock Addressing can be more efficient than the previously discussed scheme since it can be accomplished using a single nibble as opposed to a full byte when the number of nodes is 16 or fewer In order to reduce the overhead associated with having each node actively snoop the bus to determine if it is being addressed target addresses are read by the bus controller which then informs the recipient that data will be transmitted The hardware requirements are reasonable since each node microcontroller must give up only one port 8 pins and the arbiter needs 4 4N pins to support N nodes 12 The pins on each node were mapped as follows Pin Signal Direction 7 BTx IN 6 BAck IN 5 Ris OUT 4 Pack OUT 3 Data INOUT 2 Data2 INOUT 1 Datat INOUT 0 Data INOUT The state transition diagrams for the protocol are No TX Addr do Drive Addr Toggle PTx BTx Toggle BTX Level do Read Data 1 2 Toggle PAck do Drive Data 1 2 Toggle PTx BTx Toggle Data2 BAck Level do Drive Data 2 2 Toggle PTx do Read Data 2 2 Toggle PAck BAck Toggle BTx Toggle al Figure 8 State Diagrams for Transmit and Receive In order to send a byte of data the processor toggles its PTx line and waits for an acknowledgement from th
30. l buffering is initialized next followed by an optional background thread which will be explained in more detail later After initialization the client application is free to utilize the high level data functions The first net_send address tag length buffer is used to transmit variable length messages This is a blocking operation meaning that the calling task cannot continue until all bytes are transmitted The address parameter specifies the target node according to the hardware address assignments and tag specifies which tag at the target node will receive data The length parameter specifies how many units of data are transmitted from buffer which holds the message The implementation uses the unsigned char type for these parameters thus allowing 255 tags on 255 addresses and messages up to 255 bytes in length The corresponding function to receive data is net_recv address tag length buffer The address and tag parameters correspond to the node that sent the data and the local tag that is expected to contain the data respectively As in net_send length is the maximum number of bytes to receive Data is copied into the memory location specified by butter and the function returns a number indicating how many bytes were actually received A value of 255 can be specified for address which will retrieve data from any address The same applies to tag This functionality is usefu
31. l for collecting data from multiple senders tags or both when ordering of messages does not matter The high level receive function is non blocking and will return 0 if no data is ready The high level functions described above depend on the low level routines that actually manipulate the signals at the physical level These low level functions are send_wait Taddr data and receive_wait In order to send data net_send calls send_wait in a loop until all bytes are sent The Taddr parameter contains the tag and the address combined and data is the byte to 18 be sent send_wait is protected by a mutex which limits access to the transmission control and data lines to one thread at a time The function waits until it can enter the critical section before running the network protocol state machine described in section IV 5 The address tag and data are shifted out 2 bits at a time from most significant bits to least significant After transmission of a byte is complete the mutex lock is released The receive_wait function is not called directly by net_receive but must be called either by another thread or the calling thread before net_receive can be expected to return any data The low level receive function simply receives one byte of data and buffers it internally irrespective of address or tag The function and the buffers are also protected mutexes which guarantees that competing threads will not violate the pr
32. machines with longer latency interconnect NUMA architectures are more commonly seen in large multiprocessor systems in which a shared memory bus becomes a bottleneck As in clusters each processing node has its own memory but there are usually hardware mechanisms to transfer shared data between processors Clusters lack this hardware sharing ability and unless a unified memory is emulated in software each node in a cluster maintains its own memory address space The implications of this are twofold First distributed memory machines exhibit high performance when a task is divided so that each processor can fit its working set into the local subset of main memory Shared memory machines do not suffer from this since all main memory accesses are equal However since distributed memory machines do not need to share the bus when working with local memory the potential performance improvement over SMP systems is proportional to the number of separate memories Second without a shared memory address space nodes in a cluster must use interconnect to explicitly send and receive messages instead of reading and writing a shared variable This message passing latency can be relatively long compared to the communication latency in tightly integrated systems Thus it is important to optimize both hardware communication efficiency and software communication overhead 6 3 Communication Networks The design of the cluster interconnect is central to perfor
33. mance The parameters of the network include shared vs dedicated medium bidirectional vs unidirectional links synchronous vs asynchronous transmission link width and choice of topology Each parameter must be selected depending on the application requirements and cost limitations OSI Model Data Layer Host Layers Media Layers Figure 5 OSI Model In accordance with the OSI seven layer model the physical layer of the network definition specifies how many wires carry signals between nodes 6 Communication links that transmit data serially typically have fewer wires than those that transmit data in parallel mode and thus are easier and cheaper to implement The data wires may be bidirectional or full duplex meaning that the same wire switches directions to transmit and receive or unidirectional each wires carrying data in one direction only Unidirectional links require twice as many wires to transmit in both directions but offer twice the bandwidth of bidirectional links Additionally the medium may be shared among several nodes rather than just having nodes at each end In this case nodes attempting to transmit may suffer a collision and the data will be garbled Thus some sort of arbitration must be built in to the network This can be accomplished by using an arbiter that receives requests and grants permission to send passing a token from node to node or collision detection with backoff and retry As the numb
34. n on another microcontroller to receive from and transmit to the node and check that the library was functioning properly Serial output over the USART was enabled on the microcontrollers and print statements were inserted as a means of verifying correct dataflow internally The final verification was performed on the system as a whole by running simple multiprocessor applications The system limits for data throughput and number of simultaneous tasks were then measured In this design project the implementation of the network architecture supports four ATmega32 nodes connected to an ATmega16 switch The address and tag fields were each set to two bits and the fixed data length was 1 byte Vil Results Testing of the first version of the protocol was mostly successful except that the protocol test loops would randomly lock up because an invalid transition had taken place in the state machine Debugging output showed no software problems so the electrical signals were analyzed using an oscilloscope The scope showed the presence of high speed switching noise with characteristic spiking at transitions and ringing noise after reaching the high or low voltage level Some transitions were noisy enough that incorrect values may have been read One solution was to use LSI inverters to build soft buffers to clean up the signals This proved to be effective but the protocol was ultimately not used The glitches did not appear in versions 2 and 3 of the prot
35. nitiating communication This can be implemented using a token passing scheme but this adds an extra line to inform the next processor it can be the master Also as the number of microcontrollers grows the number of slave select lines increases per chip An arbiter might be an alternative but the bandwidth is still divided among all microcontrollers making it a poor choice for large networks The next option is the universal synchronous and asynchronous serial receiver and transmitter USART This is a full duplex device with allows both asynchronous and synchronous communication Data can be sized from 5 to 9 bits in length The transmission rate can be adjusted from 2400 bits sec up to 2 Mbits sec Since frames require a start and stop bit the maximum throughput is 200 KB sec Additionally in multiprocessor communication mode an address frame must be sent at the start of a transmission As was the case with SPI this requirement halves the transmission rate If variable messages were allowed at the link layer this would not have as much of a negative impact however Still the scalability problems that preclude the use of SPI apply to the USART as well The final hardware option provided by the microcontrollers is the two wire serial interface TWI which is compatible with the industry standard C interface TWI provides built in addressing and arbitration for systems with multiple masters The clock signal is generated by the master ini
36. nown Thus care must be taken when calling the low level function as it is blocking An optional listener thread can be activated to call the low level receive function In the global scope following would have to be added 27 define AOS_NL_TASK_PRIO AOS_FIRST_TASK_PRIO 1 aos_tcb nl_tchb task control block pointer for the listener thread UBYTE nl_data_stack AOS_TASK_DSTK_SIZE nl_hw_stack AOS_TASK_HSTK_SIZE stacks for listener thread void net_listener void function prototype The listener could be activated in the initialization function with the aOS task creation function nl tech aos_task_create net_listener nl_hw_stack nl_data_stack AOS_NL_TASK_PRIO The task priority should be adjusted to be higher than the program tasks for scheduling Note that in the aOS scheduler higher priorities are represented by lower integer values The listener task continuously runs the state machine to receive data This can be as simple as calling the low level function in an infinite loop void net_listener void while 1 receive_wait aos_sleep 10 allows another task to execute Copying the state machine code into the body of the listener task can further optimize the listener thread In this case care must be taken to avoid deadlock and ensure mutual exclusion if other threads will call the low level function Proper buffering can reduce the chance of deadlock in the system but progr
37. ocol Also it was necessary to introduce short startup delays of approximately 100ms to prevent the microcontrollers from reading incorrect values stemming from unstable voltage levels being present on the control lines when the microcontrollers were switched on After verification of the individual components the entire 4 processor cluster was assembled and programmed to run a multiprocessor application This application distributes an array to the worker nodes in the cluster that then sum the array and return the result to the master node The system was successfully demonstrated to Dr Land Using an oscilloscope the maximum transmission frequency per link was measured to be approximately 10 kHz With 8 links in the system and 2 bits transferred per cycle this gives an aggregate bandwidth of 160 kbps Experimentation with the OS parameters and the data stack size showed that two tasks could easily fit into the ATmega32 SRAM Adding a third task required scaling back the network buffers 23 VIII Conclusion A network architecture and message passing API were developed for Atmel microcontrollers The library was implemented on a system comprising multiple processing nodes and a centralized switch This system was utilized as a cluster computer running multiprocessor applications to demonstrate the functionality of the communication layer Verification showed that the design was successful Figure 13 Four Node Cluster The system m
38. or Atmel microcontrollers Report Approved by Project Advisor Date Executive Summary Microcontrollers are versatile integrated circuits typically incorporating a microprocessor memory and I O ports on a single chip These self contained units are central to embedded systems design where low cost dedicated processors are often preferred over general purpose processors Some embedded designs may incorporate several microcontrollers each with a specific task Efficient communication is central to such systems The focus of this project is the design of a communication layer and application programming interface for exchanging data in a system comprising multiple microcontrollers Since each microcontroller contains a separate memory a message passing system is designed The hardware consists of low cost microcontrollers serving as computing nodes and a network controller to direct messages among them Various network topologies and bus designs are considered The resulting design consists of two dedicated unidirectional links between each node and a centralized message router In order to demonstrate the library a small scale cluster computer is constructed using Atmel ATmega32 microcontrollers as processing nodes and an Atmega16 microcontroller for message routing The communication library is integrated into aOS a preemptive multitasking real time operating system for Atmel microcontrollers Table of Contents l URE OIG UO EEN 5 1
39. orithm or one of the worker threads Masters may perform the same computation as workers but are also responsible Tor distributing workloads and collecting results The compiler will typically perform dead code elimination to compile only the correct subprogram based on the PID VI Testing Individually microcontrollers are difficult to test and debug because of the lack of a human interface In a complex system with multiple microcontrollers verification becomes even more challenging In order to deal with this complexity a unit testing methodology was applied The state machine implementations of the communication protocols were evaluated with small programs that would transfer predictable infinite streams of data that could be verified at the receiver e g looping over all values from 0 to 255 The circular buffers being a generic data structure were tested on a PC using sequences of put and get requests that covered both general cases and corner cases over a range of buffer lengths The network switch was first tested for correct reception and retransmission using just one state machine for one node since all are identical against a test program The echo provided by the switch would indicate whether or not the data was properly processed Then a state machine was added to communicate with a 22 second node to facilitate testing of address handling For the processor nodes testing was performed similarly A simple program would be ru
40. otocol When data is expected in a well defined manner an application would only need to call the low level receive function enough times to ensure that all data was received The high level receive would take care of sorting by tag and address However in some systems the quantity and frequency of data may not be known Thus care must be taken when calling the low level function as it is blocking One solution to this problem is to use interrupts to monitor the receive control line and provide no access to low level receive Another option is to use the operating system s multitasking ability to run a listener thread in the background In this implementation using interrupts would require modifying the core of the OS and this violates the notion of the network library as a simple add on Thus an optional listener is provided Another problem which is also present in MPI is that the message passing API does not guarantee buffering Sends and receives will block until matching receives and sends are performed at the other end This can significantly reduce performance in some cases or even lead to deadlocks For example consider a scenario in which two nodes start sending data to each other at the same time The function calls may appear in the program code appears as follows Node A Node B 1 Send B Send A 2 Recv B Recv A Since the send functions do not return until receives are executed both nodes will become stuck in sending I
41. otograpns ccccccccssesesseeeeeeeeeeeeeeeeeeneeeeeeeeeeeeeeeseeeeees 46 E El 47 I Introduction 1 Microcontroller Based Systems Microcontrollers are highly integrated low cost programmable computers that are used to design embedded systems A typical microcontroller comprises a central processing unit CPU random access memory RAM and input output I O interfaces in a single package Additional features such as on chip analog to digital converters ADC timers and low power consumption make them adaptable enough to be applied to a wide variety of applications The benefit of such tight integration is a small footprint as compared to a more general processor that would require several additional modules and greater wiring complexity to perform the same function The small footprint low cost and high flexibility of microcontrollers make them ideal for integration in everything from automobiles to household appliances These benefits however come at a cost In order to maintain low power consumption clock speeds are typically on the order of tens of MHz Unlike general purpose computers most have firmware to store static programs in place of RAM for dynamic loadable programs Thus the microcontroller must be reprogrammed to run new software Modifiable data such as variables must be stored in volatile memory so on chip RAM may be present In order to maintain the cost effectiveness of the chip RAM is typically only a fe
42. pairings Intuitively the high traffic communication paths are independent but since the bus is shared contention slows down both paths to approximately half the expected throughput This communication pattern occurs frequently enough to warrant a change to the architecture 4 Network Architecture Version 2 Using the lessons learned from the first version a few observations can be made about a protocol that would be optimal for connecting many microcontrollers that generate a lot of traffic on a small subset of the possible paths First dedicated links are preferred over a shared bus Unfortunately dedicated links require more port pins so a tradeoff must be made in both the type of topology employed and the width of the link Second communication should be unidirectional per link to allow the full bandwidth to be used without contention This will effectively double the number of pins per link but will keep throughput at a maximum The second version of the architecture is based on the network switch model in which all nodes interface with a central microcontroller essentially creating a star topology Each node attaches to the switch using two transmit lines and two receive lines as well as two transmit control and two receive control lines thus staying within the 1 port goal This implies two unidirectional links per node with full bandwidth on each 2 bit wide link The communication protocol is asynchronous as in the first version Th
43. rbinit rb i rbuf i RBUFSIZE init ring buffers while 1 profil procl proc2 proc3 end main loop define COMPORT PORTA define COMPIN PINA void proc0 this function s owner should be const but CodeVisionAVR doesn t allow it UBYTE procnum 0 concatenated address and data as stored in ringbuffer static UINT Raddrdata Taddrdata static UBYTE Raddr Rtag Rdata Tdata Taddr static UBYTE rxstate 0 txstate 6 txq 0 txready 0 static UBYTE RAck 0 TAck PAck rxpause 0 UBYTE i 37 LRRRKKKK KKK veceive RRR KK KK KKK ce COMPIN amp PTx RAck rxpause BACK rxpause RAck RAck PTx if rxstate 0 Raddr COMPIN amp 0x3 read dest address crumb COMPORT COMPORT BAck toggle the Ack rxstatett else if rxstate 1 Rtag COMPIN amp 0x3 read dest tag crumb COMPORT COMPORT BAck rxstate else if rxstate lt 5 Rdata Rdata lt lt 2 COMPIN amp 0x3 read data crumb COMPORT COMPORT BAck toggle the Ack if rxstate 5 last data received so construct value for ringbuffer Raddrdata UINT procnum lt lt 8 UINT Rdata rxstatett if rxstate 6 try to put addr data in queue stall if full rxpause rbput rb Rtag lt lt 2 Raddr Raddrdata if rxpause rxstate 0 RRR KKKK KKK transmit RRRKKKEK KKK
44. required on the switch specifically 8N vs 4 4N for any useful number of nodes e N 2 This limits the number of nodes connected to the ATmega16 to a maximum of 4 The number of hops for data to go from source to destination is also increased to 2 but since control signals were mapped through the arbiter in version 1 anyway there is no significant increase in latency 5 Network Architecture Version 3 The final version of the network architecture is largely based on version 2 with some protocol improvements In version 2 each node would prepend a destination address header which was then translated by the switch into a header containing a source address In applications that send single data streams to each address this is sufficient because each stream comes from a unique source However for many applications it is useful to have the network layer support logical addresses or tags which provide multiple logical reception points on each node Several data streams can then be sent to a single node with disambiguation built into the protocol The final packet format is The state transition diagram is modified minimally to include tag sends and receives The receive state machine is similar and is omitted 16 BAck Level do Drive Addr on bus Toggle PTx BAck Toggle More Addr bits do Drive Tag on bus Toggle PTx do Drive Data on bus Toggle PTx BAck Toggle More Data hits Figure 11 State Diagr
45. running the background thread is to make the listener the highest priority thread and have it sleep so that lower priority threads have a chance to execute A balance between computation and I O must be struck in order for this to be efficient 20 3 Network Switch The network switch lies at the core of the network Its responsibility is to direct data to the appropriate destinations and notify the recipient of the identity of the sender In this implementation an ATmega16 microcontroller runs a program that implements the network communication protocol An instance of the protocol state machine is instantiated for each client node Each state machine has a receive phase in which it accepts data from the node if it requests to transmit and a corresponding transmit phase when the switch starts sends data that is targeted at that node Though any transmit or receive state machine may block for any given node the program design ensures that no other state machines suffer as a result In this particular implementation a circular buffer is allocated per tag address pair to reduce the occurrence of deadlock as explained above The switch uses a store and forward policy in which each byte of data is fully received before it is transmitted The receive state machine will read the target address tag and data before checking if the data can be buffered This is done to allow the nodes to continue processing in case the switch must stall Once memory
46. sizeof float printf pi is approximately f Error is f r n pi pi PI end PID 0 else Nodes 1 3 WORKERS float mypi h sum x inte ipi UBYTE c putsf Waiting for n r receive n value from master for i 0 i lt sizeof int i receive_wait putsf Got a byte r i 0 C UBYTE amp n putsf Pulling from buffer r while i sizeof int i net_recv 0 255 do the partial sum approximation h 1 0 float n sum DD for PID de Ake ck i lt n x h sum double i 0 5 4 0 1 0 x x mypi h sum send the partial result UBYTE amp mypi putsf Sending the result r net_send 0 0 sizeof float c end program done never return while 1 Appendix C Network Code The aOS code may be obtained from 7 sizeof int should receive 1 int byte by byte byte pointer to the int retrieve n from the receive buffer i c i NUMPROCS oyte pointer to the float This is the main library code contained in aos_network c Network communication code Protocol version 3 0 2x2 wire unidirectional buses tags and addresses supported Written by Kalim Moghul kfm9 include ringbuffer h Pinout for bus communication 32 Pin Signal Direction 7 BTx IN 6 BAck IN 115 PTx OUT 4 PAck OUT 3 DIn1 IN 2 DInOo IN 1 DOut1 OUT 0 DOut 0 OUT
47. switch requires two bytes The type definition of rodatatype below would be changed to unsigned char for the library Ring circular buffer implementation Author Kalim Moghul kfm9 cornell edu Invariants head tail iff buffer is empty If buffer is not empty head points at next valid element to be consumed tail always points at the next empty element gt There is always one unused element in a full buffer gt length must be greater than 1 Sh typedef unsigned int rbdatatype define storage type of ringbuffer typedef struct ringbuf rbdatatype buf points to data array unsigned char length length of data array unsigned char head tail producer and consumer indices ringbuffer initializes the given ringbuffer with the supplied array and its length inline void rbinit ringbuffer rb rbdatatype array unsigned char length ceturns boolean true if the ringbuffer is empty false otherwise inline unsigned char rbisempty ringbuffer rb rceturns boolean true if the ringbuffer is full false otherwise 39 inline unsigned char rbisfull ringbuffer rb consumes an element from the buffer returns NULL if buffer is empty or a pointer to the array element otherwise inline rbdatatype rbget ringbuffer rb puts an element into the buffer returns 0 if buffer is full otherwise returns 1 inline unsigned char rbput ringbuffer rb rbdatatype c include ringbuffer c This
48. t is up to the programmer to ensure that this scenario does not occur but many such situations can be resolved with good buffering 2 Integration into aOS The aOS kernel occupies a large portion of the SRAM which limits the stack space available for running threads In order to keep the network add on lightweight the implementation is conservative in its memory usage Three 19 semaphores and four global variables are needed at a minimum All addresses tags and data arguments are unsigned bytes The network design allows for larger types but these limitations are reasonable for small systems The circular buffers that are used to sort incoming data are variable in size and add as little as 96 bytes to global memory consumption Working Registers I O Registers 60h or 100h DSTACKEND Data Stack Y initial value 60h for 100h Data Stack Size Global Variables 60h or 100h HSTACKEND Data Stack Size Global Var Size Hardware Stack SP initial value _HEAP_START_ SRAM End Figure 12 SRAM Memory Map The preemptive multitasking abilities of aOS allow a background listener thread to be created for the purpose of receiving data from the network and maintaining the illusion of non blocking receives for the task threads The aOS thread model runs a scheduler that will only swap a new thread in if it is put into a runnable state and its priority is higher than the current thread One approach to successfully
49. tag local tag to receive data from 255 any length maximum number of bytes to receive Returns The number of bytes actually received Modifies buf contains the received data UBYTE net_recv UBYTE addr UBYTE tag UBYTE length UBYTE buf UBYTE count 0 temp st et sa ea i j aos_wait sem_net_rbuf determine tag and source address ranges to get data from if tag 255 amp amp addr 255 any tag any source st 0 et 3 sa 0 ea 3 else if tag 255 any tag one source st 0 et 3 sa ea addr else if addr 255 one tag any source st et tag sa 0 ea 3 else one tag one source st et tag 35 store data add to ring buffer sa ea addr for i st i lt et amp amp count lt length i for j sa j lt ea amp amp count lt length j do temp rbget rb i lt lt 2 j if temp NULL buf count temp count t while count lt length amp amp temp aos_Signal sem_net_rbuf return count This is the source code for the network switch Note that each processor gets its own version of void procN void where N is the node number The code for procO is provided below This code can be copied for proc1 proc2 and proc3 with slight modifications Each should receive its own unique PORT and PIN function name and procnum assignment de
50. tation In order to reduce latency local messages should be routed internally This would also reduce memory usage Finally broadcast support could be added Some applications would benefit from sending one command to the switch to transmit the data to all buffers rather than having to transmit to all nodes sequentially as in the current implementation Appendix A User Manual 1 Program Structure The network code is designed to be included in a C program targeted at an Atmel ATmega32 microcontroller This code compiles under CodeVisionAVR as an extension to aOS Minor changes consisting mainly of removing aOS dependent code such as semaphores would allow use in standalone applications A C program written for aOS would start with the following include directives include lt mega32 h gt include lt stdio h gt include aos h include user_tasks h include aos_core c include aos_task c include aos_semaphore c include aos_network c network communication library The aos_network c supplies the network API The next step is to assign a processor ID PID number to the node The following line would indicate that this processor is node zero define PID UBYTE 0 defines this node s address This is followed by the first task in the program code denoted by the task1 function This task is responsible for initializing the network with the void net_init which contains implementation specific initializ
51. tiating communication and can reach frequencies up to 400 kHz Address packets are part of the TWI protocol as are variable length messages However in order to support arbitration all 11 masters must use packets of the same length In a byte oriented network the overhead of 9 bit address and data packets effectively gives a maximum throughput of 400 kbits sec 18 bits byte 22 2 KB sec Due to the arbitration built into the protocol a separate arbiter is unnecessary However with many masters operating on the same bus contention would become prohibitive and the throughput would be reduced considerably Since the above methods all have significant drawbacks alternative network designs were implemented and evaluated 3 Network Architecture Version 1 In all three shared bus interfaces discussed above serial transmission reduces the throughput considerably Thus the first attempt at a new network architecture uses a parallel bus to address this problem Also since cooperative arbitration can be slow and may indefinitely prevent a processor from gaining the medium a hardware arbitration unit with a fairness guarantee is used when transmission privileges over a link are contested ARB Figure 7 Version 1 Topology The first version of the network architecture utilizes a 4 wire shared bus to transfer a nibble 4 bit chunk of data at a time Four additional control wires are used per node to commun
52. to utilize in applications The implementation should fit into the limited amount of on chip memory and allow enough free memory to implement client applications The implementation must provide a good balance between I O and computation Additionally the number of wires needed for all communication links is constrained by the number of pins on a single microcontroller Ideally port pin usage should be as low as possible preferably 1 I O port to allow the microcontrollers to have enough I O resources to interface with external system components These constraints will in part determine the network topology that will be implemented using the microcontrollers lll Hardware Design 1 Components This design project makes use of the Atmel ATmega32 and ATmegai6 microcontrollers These microcontrollers have an 8 bit RISC architecture running at a maximum of 16 MHz with most instructions executing within 1 2 cycles and active power consumption on the order of tens of milliamps Instructions are stored in 32KB or 16KB of flash memory and variables are manipulated in 2KB or 1KB of SRAM for the ATmega32 and ATmega16 respectively Both offer various peripherals such as on chip timers USART SPI IC and ADC The chips are packaged in 40 pin PDIPs and have 4 I O ports with 8 pins per port 1 The system was developed on the STK500 development board and the final design was assembled on prototype boards The chips were connected using ordin
53. uf gt gt 4 txstate lt lt 1 amp UCHAR 0x3 put data crumb txstate txstate gt 4 0 txstate 1 do 4 iterations of data if txstate 0 datatt PORTC UCHAR PORTC UCHAR PTx toggle the transmit signal end main loop 45 Appendix E System Photographs Atmega16 Network Switch 1 Power and Data Cables hi dicen Benes ae Ns Node 3 4 Node Cluster 46 Dngnnnnnnnnn d Bn nn nnnn 16 MHz crystal 22pf capacitors Processing Node References 1 Atmel ATmega32 and ATmega16 Datasheets Rev 2503H 03 05 Accessed March 2006 lt http Awww atmel com products avr gt 2 AOS a RTOS for AVR Accessed March 2006 lt http www tietomyrsky fi projektit aos gt 3 The Message Passing Interface MPI Standard Accessed March 2006 lt http www unix mcs anl gov mpi gt 4 MPI 2 Extensions to the Message Passing Interface Accessed March 2006 lt http www mpi forum org gt 5 Hennessy John L and Patterson David A Computer Architecture A Quantitative Approach 3 ed Morgan Kaufmann 2003 6 Peterson Larry L and Davie Bruce S Computer Networks A Systems Approach 3 ed Morgan Kaufmann 2003 7 A preemptive multitasking OS for Atmel Mega32 microcontrollers Accessed March 2006 lt http instruct1 cit cornell edu courses ee476 RT OS index html gt 47
54. w kilobytes in size and the data path may be only 8 or 16 bits wide Harvard Architecture von Neumann Architecture 3 3 UnitieY Memor Figure 1 Harvard and von Neumann architectures The disjoint program and data memories in a typical microcontroller imply a Harvard architecture as opposed to a von Neumann architecture 5 In the von Neumann model instructions and data are stored in a unified memory Memory bandwidth is divided among program and data accesses since they share the same bus The design complexity is reduced however since memory is not specialized A Harvard architecture on the other hand has separate buses to 5 each memory which can improve performance Most general purpose computers have a Harvard on chip L1 cache memory and a von Neumann main memory As will become clear the choice of architecture will have implications for the type of multiprocessing model that can be implemented 2 Multiprocessor Systems The simplest definition of a multiprocessor system is a computer that has more than one CPU However this belies the breadth of technologies and configurations that consist of multiple CPUs At the highest level of integration multiple CPU cores can be placed on a single silicon die to allow chip multiprocessing CMP Identical CPUs may also be packaged separately but share a common memory interface in symmetric multiprocessing SMP computers CPUs need not be identical as in the case of asymm

Full Report - ECE - Cornell University

Contents

Download Pdf Manuals

Related Search

Related Contents