Home

Microprocessor fault-tolerance via on-the-fly partial

1. IV CONCLUSIONS AND FUTU K systems By using the freeze amp res we can tolerate transient anc partial reconfiguratio using the appropriate t outonly causes performance degradation when executing some specific tasks The prototype presented in this paper can be extended with a more complex architecture in order to cover the faults happening in any pipeline stage and could be the starting point to evaluate more complex recovery techniques for FPGA based designs REFERENCES 1 M Pflanz and H T Vierhaus Online check and recovery techniques for dependable embedded processors IEEE Micro vol 21 pp 24 40 Sept 2001 2 A Doumar and H Ito Detecting diagnosing and tolerating faults in sram based field programmable gate arrays a survey IEEE Trans VLSI Syst vol 11 pp 386 405 June 2003 3 P Sedcole P Y K Cheung G A Constantinides and W Luk Run time integration of reconfigurable video processing systems IEEE Trans VLSI Syst vol 15 pp 1003 1016 Sept 2007 16 Xilinx Tay 4 A Tumeo S Borgio D Bosisio M Monchiero G Palermo F Fer randi and D Sciuto A multiprocessor self reconfigurable jpeg2000 en coder in Proc IEEE International Symposium on Parallel Distributed Processing IPDPS 2009 pp 1 8 May 23 29 2009 5 J A Clark and D K Pradhan Fault injection a method for validating computer system dependability Computer vol
2. 28 pp 47 56 June 1995 6 S B Ko and J C Lo Efficient realization of parity prediction func tions in fpgas J Electron Test vol 20 no 5 pp 489 499 2004 7 M Portolan and R Leveugle A highly flexible hardened rtl processor core based on leon in Proc 8th European Conference on Radiation and Its Effects on Components and Systems RADECS 2005 pp J7 1 J7 6 Sept 19 23 2005 8 M Liu W Kuehn Z Lu and A Jantsch Run time partial reconfig uration speed investigation and architectural design space exploration in Proc International Conference on Field Programmable Logic and Applications FPL 2009 pp 498 502 Aug 2009 9 A S Stefano Di Carlo Paolo Prinetto A fpga based reconfigurable software archveecture for highly dependable sys ATS Conference 09 415 June 23 26 2002 J Karlsson On latching proba is in combinational networks in Proc 4H PGA Configuration User Guide v1 11 ed 2009 arly Access Partial Reconfiguration User Guide v1 2 ed 2008 inx Inc Virtex 4 Family Overview v3 0 ed 2007 C
3. Its pipeline is com ured with a mini no cache Due to the particular description style used in Gaisler libraries 14 the whole processor pipeline is described in a behavioral way with just 2 processes one implementing the full pipeline stages functionality and the other one implement ing the sequential logic This particular style showed both advantages and disadvan tages for the purposes of our design on one side it allowed the easy implementation of the pipeline freezing logic with a simple modification to the sequential process on the other one it required additional reverse engineering work on the combinational process in order to isolate the ALU shifter logic from the other logic implemented in the Execute stage of the pipeline Once the functionality has been clearly identified a new entity the hardware SFU with the minimal interface to provide the same functionality has been developed which implements the following operations e Logic AND NAND OR NOR XOR XNOR e Arithmetic ADD SUB e Shift This unit has been set to be mapped on the reconfigurable area in case of fault detection Moreover following the approach suggested in 7 another sample of this unit has been inserted together with parity calculators on the input and on the output in order to provide error detection leaving to the synthesis tools the task of optimizing the resulting logic This simple sche s been used because of j
4. and at the end of the communication with the device II CASE STUDY AND EXPERIMENTAL RESULTS As a proof of concepts we implemented the architecture described in the previous section using a SoC based on the LEON3 CPU 10 11 in order to tolerate the ALU faults by replacing a previously allocated DES Data Encryption Standard crypto core According to the taxonomy previously introduced we identify the ALU as a critical RFU while the DES crypto core as a non critical RFU Both of them have the corresponding SFUs hardware for the ALU software for the DES core We decided to use this example because of the self contained and general nature of both components which also have a comparable area occupation The resulting design has been synthesized and tested on a Xilinx Virtex 4 FPGA 17 A Processor architecture adaptation LEON3 is a highly configurable 32 bit processor core conforming to the SPARC V8 architecture It is designed for embedded applications combining high performance with lo complexity and low power consumption The LEON CPU is widely used in the aerospace 1 and its fault tolerant version has been validated th radiation tests 12 Anyway the implemented techniques address typical radiation induc volving sequential logic only This chox the fact that the probability of a logic propagating into regist reported in 13 This dss mode 12 1 using 2 C overhead thus its use 1s demanding in terms o
5. system memory or an external slower memory such as a flash depends on many factors us the system availability some master interfaces able to read the main and a totally SRAM based device in which The results presented in 8 show that M based solution is several orders of Slave interface reaching the physical P Internal Configuration Access the results also show that a well oreover the Reconfiguration Manager is responsible for the monitoring of CPU error signal starting a reconfigura tion once a fault has been detected as permanent and has to deassert the freeze signal once the reconfiguration has completed D Software Support In general the proposed solution allows a total processor unawareness about the reconfiguration Anyway the software needs to be informed about which core is currently placed in side the reconfigurable area in order to avoid communicating with a peripheral that is not present anymore in the system This can be accomplished by reading the Reconfigurable Area Status Register which holds the signature of the currently available core If a core is not available its function is performed by a software Spare Functional Unit The switching concept is introduced in 9 and has been adapted to keep into account the fact that the system cannot rely on the CPU during the reconfiguration To avoid any ambiguity the presence check should be done by the driver at the beginning
6. EECE EEL EEE EEE ELLE ELLE Out data freeze Figure 5 Simulation of the pipeline freezing logic detail of ALU operands To simulate a fault the error signal has been connected to one of the board switches instead of the error detection logic This allowed us to cause a full reconfiguration cycle during the execution of the program As the simulation results show in Figure 5 the freezing logic allows keeping constant the combinational circuits inputs during the reconfiguration The figure shows the detail of the clock and freeze signals together with the ALU operands Once the freeze signal is de asserted the computation resumes without any further operation Using the ICAP interface reprogramming the reconfig urable area with the External ALU partial bitstream required on average 0 5 ms which in terms of bandwidth is equivalent to about 150MB s This figure is acceptable for most systems and scales almost linearly with clock frequency as far as the memory throughput doesn t become a bottleneck Full Initial Static DES 581 8 KB External ALU 67 2 KB Blanking bitstream Table II BITSTREAM SIZES Obiviously the actual reconfiguration time depends on the size of the partial bitstream Table III which becomes anoth important parameter from a system point of view tryin minimize the area occupied by a component not only to improve device utilization but could lead to a shi reconfiguration time as well
7. Politecnico di Torino Microorocessor fault tolerance via on the fly partial recontigura tion Published in the Proceedings of the IEEE 15th European Test Symposium ETS 24 28 May 2010 Praga CZ N B This is a copy of the ACCEPTED version of the manuscript The final PUBLISHED manuscript is available on IEEE Xplore URL http ieeexplore ieee org stamp stamp jsp tp amp arnumber 5512759 DOI 10 1109 ETSYM 2010 5512759 2000 IEEE Personal use of this material is permitted Permission from IEEE must be obtained for all other uses in any current or future media including reprinting republishing this material for advertising or promotional purposes creating new collective works for resale or redistribution to servers or lists or reuse of any copyrighted component of this work in other works Microprocessor fault tolerance via on the fly partial reconfiguration Stefano Di Carlo Andrea Miele Paolo Prinetto Antonio Trapanese Politecnico di Torino Dipartimento di Automatica e Informatica I 10129 Torino Italy e mail firstname lastname polito it Abstract This paper presents a novel approach to exploit FPGA dynamic partial reconfiguration to improve the fault tolerance of complex microprocessor based systems with no need to statically reserve area to host redundant components The proposed method not only improves the survivability of the system by allowing the online replacement of defective key parts of the pr
8. circuits while the reconfiguration takes place The freezing technique can be used by the Reconfiguration Manager to discriminate transient faults from permanent ones thus avoiding unnec essary reconfigurations and transient error propagation at the same time The system awareness about current system configuration is granted through the Reconfigurable Area Status Register which can be read by other modules to perform operations coherently with the available hardware resources applied to a real world AMBA AHB gt Reconfigurable LEON3 Area gaea lt DES AMBA APB eptual block diagram of the proposed architecture Such architecture addresses the error detection and recovery O combinational logic of a processor s pipeline stage It requires a reconfigurable area which is initially preallocated to a device of the SoC a properly adapted processor and a Reconfiguration Manager in charge of managing the resource allocation of the reconfigurable area Once an error is detected the CPU pipeline is freezed in order to keep all the combinational logic inputs constant and the recovery process starts This process consists in waiting few clock cycles before starting the reconfiguration thus ensuring that the error is caused by a permanent fault In case of transient fault once the logic starts to behave normally the error signal is immediately de asserted and the execution resume
9. logic is responsible of both blocking the CPU operation during the reconfiguration and restoring it once it is completed During normal operatio the error signal is connected both to the Reconfigurati Manager and the pipeline freeze signal as soon error signal is asserted the pipeline is blocked arid window the error is still present the Reconfi overrides the freeze signal to ensure tf occur during reconfiguration doesn and starts the reconfiguration atic ari reconfigurable black bex During place amp been set and the bus by using bus macros to conn part which is instantiated figurable area need almost no modification compared to a normal design The only requirement is that they must share the same entity interface declaration the APB DES interface must contain the ALU interface signals and the external ALU interface must contain the APB signals even if these are left unconnected inside the entity In order to reduce the number of signals crossing the reconfigurable area boundary the APB interface has been splitted in a static and a dynamic part the static one contains fixed configuration data used for PnP Plug and Play device detection while the dynamic part is used for actual data transfers interrupt routing and device addressing and is the only one that actually passes the reconfigurable area boundary through bus macros D Experimental Results The system has been synthesized
10. lt FPGA SoCs is the glitch free uration feature of modern FPGAs a feature allows time sharing of a figurable area on the FPGA fabric between found to be defective it is on the fly one mapped into a reconfigurable area Failure Fault occurrence Error latency Fault latency Effective fault Maximum latency for fault tolerance mechanism ce Error Detection Failure activation activation activation Figure 1 Error detection timing QIEEE 2001 from 1 Using as reference the taxonomy introduced in 5 and de picted in Figure 1 the proposed architecture aims at detecting an error and at correcting the fault that generated it through the means of reconfiguration before the error propagates and causes a failure Typically the time between the Detection Activation and Failure Activation is too small to accomodate a device reconfiguration thus a freezing technique is used to delay the failure activation as much as needed to perform the required operations In this way to resume the execution there is no need for complex rollback or restoration operations The paper is organized as follows in Section 2 we present the general concepts behind our proposed architecture Section 3 presents a case study with related experimental results while in Section 4 we draw some conclusions and outline some possible extensions and future work II PROPOSED ARCHITECTURE The fa
11. ocessor but also provides performance gracefvi degradation by executing in software the tasks that were execute in hardware before a fault and the subsequent reconfi happened The advantage of the proposed approz thanks to a hardware hypervisor the CPU is tetall Embedded systems are nowdays wid cK us safety critical applications that 1 impos Vveryi tr ct often con Several solutions do erance depending on em often requires compro ance fault detection delay etc Moreover i developed to be applied on ASICs Application Specific Inte grated Circuit while nowdays several custom embedded SoCs System on Chip are realized on FPGAs Field Programmable Gate Array Even if the proposed techniques still mantain their effectiveness they are not optimized to take advantage of most recent FPGA technologies and they do not consider the nature of FPGA specific faults Although the symptoms of transient or permanent faults on FPGAs can be categorized in component and control faults as proposed in 1 their source can be different and more complex to discover than on traditional ASICs In a FPGA design a permanent fault does not necessarily identify a damaged circuit it can be generated by a bit flipping of the FPGA configuration memory Even if a more general approach could be used traditional fault detection diagnosis and recovery methods for SRAM based FPGAs tend to be very ablin new and generalized fau
12. on the Xilinx ML403 board equipped with the XC4VFX12 FPGA 17 and several experiments have been executed in order to validate the concepts introduced in previous sections The final design occupied almost all of the logic resources available on the FPGA as reported in Table I and the operating frequency remained unchanged 66MHz compared to the static reference design provided with Gaisler libraries for that board ries FSices 5472 5426 Reconfigurable Area Capacity MY DES A External ALU MARAN YO NS Table I FPGA RESOURCES AJ K tem has been tes WSS C prototype program O which executed a fu yption decryption cycle the gram included a ities to measure the execution times and t use the software version 1f the hardware de surements showed that the system is able to keep working normally after a reconfiguration since only the DES encryption decryption time is affected Table II shows one order of magnitude performance degradation due to the lack of hardware support It is interesting to note that the resulting layout kept working at the original clock frequency which has been set with a wide safety margin over critical path length ck FTLFLrFL_riri s_ TUPI PUPP UPOTICTOPITPOTIiPPeieriiiiiiiy Asn OPAC EASES URSES ASSURE SUPE SSE a ARSE CSAS SGA SE AAS E CASES ESAS ECHR ECE StARER aE AE Ante ante CES Se Sane Ate E in_data opt 4 AAS OOOONNEO KONICA DE in_dataop2 BFS005 50ess4EAr O02FTETO XO0MFESD PR
13. s normally on the next clock cycle The conceptual recovery process is described in Algorithm 1 A Reconfigurable Area Dimensioning positioning and connecting the reconfig urable area to the rest of the SoC are left to the designer and depend on the particular modules normal and spare that should allocated on it Once the reconfigurable area position is locked the fixed interfaces need to be properly placed Algorithm 1 General description of the recovery process if error detected freeze the pipeline wait for N clock cycles if error still present reconfigure with spare units unfreeze the pipeline keeping into account the maximum path length optimization both inside and outside the reconfigurable module Since the reconfigurable area size imposes an upper limit to the number of boundary crossing nets for example bus macros on Xilinx FPGAs 16 it may be necessary to use multiplexers and demultiplexers to use the same nets for multiple purposes B Processor In general to allow the online replacement of a p the combinatorial logic of a CPU s pipeline stag permanent faults on the ALU which is the Execute stage of the pipeline Online concurrent error detectiqi can be used as oa in 7 At hi added with outputs directly connecte which will have a significant second module as reported i The error signal mj inform the reconfiguratix to be connected to the detected all aS zis tain the pre
14. ts straightforward implementatio d acceptable area sophisticated atchitecture a Berger In a more O X Kai ggested in KN COD onnect the external ALU Iriven by the original signals o take into account that if the freeze signal is high no action Cea S Practically the sequential process has been modified Nas Reconfigurable Reconfiguration Manager Write Back Figure 4 Detail of the modified LEON3 processor pipeline B Reconfiguration Manager Architecture The Reconfiguration Manager is conceptually split into 3 different subsystems e storage memory read and configuration memory write systems e processor pipeline freezing control e software support facilities The component is equipped with 2 different interfaces to the AMBA bus At the same time it is an AHB master and an APB slave the master interface is used to fetch the configuration bitstream from the storage memory the main system memory or another external memory while the slave interface is used to provide the software support facilities Access to the FPGA configuration memory is ensured through the 32 bit Virtex 4 ICAP port 17 which is instan tiated as a black box and mapped on the target device during place amp routing A buffer is used to store temporary bitstream data in order to achieve high performance since this port can write up 32 bits per clock cycle 15 Finally the pipeline freezing control
15. ult tolerant architecture presented in this paper com prises a set of Replaceable Functional Units RFU a set of Spare Functional Units SFU a Reconfiguration Manager and a Reconfigurable Area Based on health status monitoring and other parameters the Reconfiguration Manager decides if to use a SFU instead of the corresponding Replaceable Functional Unit Replaceable Functional Units can be critical or non critical and corresponding Spare Functional Units can be hardware or software critical RFUs are equipped with a concurrent err detection system and can be replaced by a hardware sp ni while non critical RFUs may have or not a BIST faci can be replaced by a software SFU Critical RFUs are initially hardwired on the de are placed in a non reconfigurable area and Functional Unit FU2 when FU1 SFU is placed inside the Recon previously allocated FU2 whi y gorresponding 4 erasing the Software Reconfigurable Area Figure 2 General Functional Units swapping architecture The Reconfiguration Manager holds a repository of Spare Functional Units in the main system memory or a mass storage device corresponding to an archive of device con figuration bitstreams for hardware SFU and an archive of compiled modules for software SFU The process of swapping a RFU with a SFU happens with the system being totally unaware of it since the Reconfigura tion Manager has the ability to freeze the involved
16. viously illustrated above this allows automatic tolerance of transient faults if the error signal is deasserted within the few clock cycles that are counted between the error detection and the reconfiguration initiation then the computation will resume without any loss or costly rollback operation In the same way if the fault is permanent the computation will resume after the time needed by the reconfiguration still without any extra rollback operation In addition the top level CPU interface must contain all ALU inputs and outputs which need to be properly connected to the Execute stage signals and to the reconfigurable area interface a multiplexer will select between the output coming from the internal ALU and the one coming from the external spare ALU The ALU input signals will be simply connected to the reconfigurable module interface C Reconfiguration Manager The Reconfiguration Manager is the component responsible for the management of the reconfigurable area In particular it is responsible of the following tasks e access the storage memory containing all reconfigurable modules bitstreams e access the configuration port of the reconfigurable area e monitor the CPU health status and manage the sequence of reconfiguration steps e mantain and export the information about the currently available device in the reconfigurable area In order to read the storage memory which could be the main

Microprocessor fault-tolerance via on-the-fly partial

Contents

Download Pdf Manuals

Related Search

Related Contents