erek
[H]F Junkie
- Joined
- Dec 19, 2005
- Messages
- 10,255
There’s just something to like about RISC-V
Source: https://semiengineering.com/fault-a...ents-in-a-fault-tolerant-risc-v-soc-harv-soc/
“Recent research has shown interest in adopting the RISC-V processors for high-reliability electronics, such as aerospace applications. The openness of this architecture enables the implementation and customization of the processor features to increase their reliability. Studies on hardened RISC-V processors facing harsh radiation environments apply fault tolerance techniques in the processor core and peripherals, exploiting system redundancies. In prior work, we present a hardened RISC-V System-on-Chip (SoC), which could detect and correct radiation-induced faults with limited fault awareness. Therefore, in this work, we propose solutions to extend the fault observability of the SoC implementation by providing error detection and monitoring. For this purpose, we introduce observation features in the redundant structures of the system, enabling the report of valuable information that supports enhanced radiation testing and support the application to perform actions to recover from critical failures. Thus, the main contribution of this work is a solution to improve fault awareness and the analysis of the fault models in the system. In order to validate this solution, we performed complementary experiments in two irradiation facilities, comprehending atmospheric neutrons and a mixed-field environment, in which the system proved to be valuable for analyzing the radiation effects on the processor core and its peripherals. In these experiments, we were able to obtain a range of error reports that allowed us to gain a deeper understanding of the faults mechanisms, as well as improve the characterization of the SoC.
Keywords:
RISC-V; System-on-Chip; dependability; radiation effects; radiation testing; neutrons; mixed-field
1. IntroductionThe increasing dependence on electronic systems developing complex and critical tasks in modern technologies creates several challenges related to reliability requirements. Many systems are based on powerful processing units that must be fail-safe and guarantee continuous service delivery. Each application sector must comply with standards and guidelines to meet such requirements as seen in the automotive [1], aerospace [2,3], and military [4] industries. With the introduction of the RISC-V architecture, several projects and research efforts were initiated to adopt these novel RISC-based processors in many application domains. In particular, the open and modular nature of the architecture provided traction for its adoption in high reliability and critical systems [5].
Within these application domains, several reliability requirements are derived from the environmental conditions, in which these systems are exposed: temperature variations, pressure profile, mechanical stress, and ionizing radiation. For avionics flying in high altitudes or in orbit, ionizing radiation seriously threatens the dependable operation of such systems [6,7]. The interaction of ionizing particles with electronic devices generates a plethora of effects. Single-Event Effects (SEEs) are an important phenomena, which induce transient, intermittent, and permanent faulty behaviors [8]. For processors, these effects can be observed as corrupted bits, wrong calculations, and transients, which may result in application output error, data corruption, unexpected termination, and hangs [9].
In order to ensure the dependability of these processing systems, several fault tolerance techniques are applied to the processors’ architecture and the surrounding peripherals. These techniques exploit temporal, spatial, and informational redundancies [10]. Recent work explored and analyzed the effectiveness of these techniques applied to the RISC-V architecture with fault injection campaigns, implemented as soft-core modules inside Field-Programmable Gate Array (FPGA) devices. In [11,12], the authors explore different approaches for improving the architectural elements of the processor. In [13,14], RISC-V cores with Triple Modular Redundancy (TMR) were implemented and validated, achieving significant reliability improvements but with a high resource utilization penalty. As seen in [15], the design of a lockstep RISC-V was proposed to address safety-critical applications. Other works presented hybrid solutions with similar strategies to find an optimal trade-off between performance, resource utilization, and reliability, such as [16,17,18]. In prior work, we proposed a fault-tolerant implementation of a RISC-V system [19,20] designed for FPGAs, known as HARV-SoC, where hybrid architectural redundancy techniques were applied, compared, and evaluated.
These fault-tolerant designs are usually validated through fault injection campaigns, either with simulation environments, emulation strategies, software frameworks, or real stimuli (e.g., exposition to radiation in particle accelerators). All those strategies offer valuable data to enable the reliability assessment of complex systems [9,21]. For application domains with demanding dependable systems (i.e., aerospace, military), performing fault injection campaigns with real stimuli is a mandatory step to meet standards criteria [3]. These campaigns are mostly adopted during the development and validation phases of new systems. For this, designers instrument their systems with observation points [20] and prepare meaningful benchmarks [22] to enable observation and measuring of the system’s fault sensitivity. However, most fault-tolerant RISC-V implementations explored in the literature focus mainly on tolerating faults, making in-depth reliability analysis complicated due to the lack of information. As a result, this approach leads to a limited understanding of these complex systems in harsh radiation environments. However, implementing observation structures is a challenging task, given that hard-core processors do not have customization capabilities, and soft-core processors in FPGAs require additional configuration structures that are also susceptible to failures.
In this work, we propose a strategy targeting enhanced error tracking in the HARV-SoC by extending the fault observability of the SoC implementation through runtime error detection. For this purpose, we monitor critical structures of the SoC architecture to report relevant information about the errors triggered by radiation-induced events. This solution allows a better understanding of the underlying impacts of SEEs in the design compared to alternative strategies. Furthermore, it enables more efficient use of the hardening countermeasures and provides the means for the application to perform actions to recover in case of critical failures. To implement and validate the concept, we instrument the HARV-SoC with this solution and evaluate the observability effectiveness through neutron and mixed-field irradiation campaigns. It is worth mentioning that this solution presented many technical challenges since many internal structures had to be prepared for this purpose. Furthermore, prior field expertise was important in guiding the definition of the information to be monitored and reported in an effective manner.
The remaining of the paper is structured as follows: Section 2 presents the related work; Section 3 describes key aspects of our RISC-V implementation and its fault tolerance and awareness features; Section 4 presents the proposed experimental strategy; Section 5 presents the results and analysis; Section 6 discusses these results and outcomes; and Section 7 concludes the work.”
