Cite this as
Takahashi M, Anang Y, Watanabe Y (2020) A Hazard Analysis Method for Embedded Control Software with STPA. Trends Comput Sci Inf Technol 5(1): 082-096. DOI: 10.17352/tcsit.000029This paper proposes an analysis method for hazards that are occurred by interactions between hardware and software when using an apparatus installed an Embedded Control Software (EBSW). Hazard means a state that negatively affects the apparatus when some bad conditions are satisfied. Especially, the purpose of the method is clarifying the EBSW portions that cause the hazards. The outline of the proposed method is as follows; (1)Develop EBSW specifications written in Unified Modeling Language (UML) and accident information, (2) Conduct safety analysis (System-theoretic Process Analysis: STPA) by inputting EBSW specifications and accident information, and generate the list of hazards and hazard scenarios, (3) Develop sequence diagrams corresponding to the hazard scenarios, and clarify program portions (Hazard Causal Factor: HCF) that are causes of the hazards, and (4) Conduct Failure Mode and Effects Analysis (FMEA), and apply countermeasures to avoid occurrences of the hazards. As a result of applying this method to the sample EBSW, we can confirm that the safety EBSW is developed.
At fast, the technical terms that are used in this paper are explained. The accident means an event that causes the loss of the target system, and the loss means a negative effect on the users, environments, missions, and target system. The hazard means the system’s state that negatively affects the target system when some bad conditions are satisfied.
Recently, industrial products, such as cars, medical apparatuses, and aerospace apparatuses, are developed as the systems that are combined the hardware and software, and their configuration of the apparatuses and controls become complex. As a result, unintended accidents occur when using the industrial products. Those accidents occur when hazards that are occurred by interactions between hardware and software when using an apparatus and some negative conditions that cause the accident are satisfied. This accident model is called as Systems-Theoretic Accident Model and Process (STAMP) model. Additionally, based on the STAMP model, the safety analysis method that clarifies hazards and hazard scenarios is called STAMP based Process Analysis (STPA) [1].
This paper proposes a method that clarifies the hazards and proposes safety countermeasures after completing the development of the functional specifications for Embedded Control Software (ECSW). In the proposed method, STPA is conducted by inputting the ECSW system specifications that are consisted use-case diagrams and class diagrams that are written in Unified Modeling Language (UML). As a result of conducting STPA, hazards are clarified, and hazard scenarios are developed. Sequence diagrams corresponding to the hazard-scenarios are developed and the Hazard Causal Factors (HCFs) are clarified. In this case, the reasons of the HCFs are the execution of methods and/or the non-execution of methods in the class. Based on the STAMP model, the safety analysis method that clarifies the hazards and the hazard scenarios is called a System-Theoretic Process Analysis (STPA).
The organization of this paper is explained below. Section 2 describes the related works. Section 3 describes the outline of the proposed method. Section 4 describes the applications and evaluations of the proposed method. And section 5 describes future works.
This section describes the previous studies and STAMP/STPA.
Previous studies: The previous studies classify into the development of standards for safety ECSW in the various industrial products and the safety analysis methods.At first, the standards to develop safety ECSW in the various industrial products were explained. The accidents for the industrial products that required the safety in high level gave the negative impacts to the human’s lives and the environments. The regulatory authorities required the observance of the development processes corresponding to the development standard to the manufacturers. Additionally, the regulatory authorities required enough safety analysis for the industrial products. As for such development processes, for examples, JIS T2304 [2], IEC62394 [3], IEC82304-1 [4] in the medical device domain were established, Good Automated Manufacturing Practice [5] in the pharmaceutical production system domain was established, ISO26262 [6] was established in the automobile domain, and DO-178C [7] and JAXA JMR-001 [8] were established in the aerospace domain. As those standards did not describe the detail of the concrete safety analysis procedures, it often occurred that the additional tasks were required because of the misunderstanding of the standard.
At second, the various safety analysis methods were explained. Takahashi et al. proposed a method that clarified all accidents that might occur and decide the countermeasures to solve them using the Failure Mode and Effects Analysis (FMEA) [9]. Weber et al. proposed a fault detection method for the avionics software written in assembler using the Fault Tree Analysis (FTA) [10]. Leveson et al. showed that the Fault Tree (FT) could be developed by preparing the FT templates corresponding to the essential instructions of the ECSW and combining those FT templates [11,12]. Takahashim, et al. proposed the development rules that developed FT automatically by tracing the process that caused the accident and combining the FT templates [13]. Pai et al. proposed the method that calculated the reliability of the system by inputting the design specifications written in the UML [14]. Though those methods were to clarify the cause of the failure of the component level of the industrial product, the complex failures that arose from the interactions between the components could not be dealt with. For this problem, Leveson et al. proposed the method that could be dealt with the complex failures (accidents) that arose from the interactions between the components. The details of this method were explained in the next section.
This section describes the STAMP model and STPA [1].
Figure 1 shows the STAMP model. The STAMP model describes that the system consists of the controller, process model, and controlled process. The process model shows the state of the controlled process that the controller supposes. The controller sends Control Actions (CAs) to the controlled process based on the state of the process model, and the controller changes the state of the process model based on the sent CAs. The controlled process transits the inner state based on the received CA, and the controlled process returns the result as the Feedback Data (FBD) to the controller. In the case that the state of the process model matches the state of the controlled process, the system is in the safe state. In the case that the case that the state of the process model does not match the state of the controlled process, the system is in the unsafe state. At that time, hazards occur.
The procedure of STPA is explained as follows. At first, the accidents and hazards of the target system are defined. Additionally, the Safety Constraints (SAs) are defined. At second, the Control Structure Diagrams (CSDs) are developed. Figure 2 shows an example of the CSD. The CSD defines the components (subsystems and apparatuses) that are necessary to realize the SCs and the interactions (CA and FBD) between components. At third, Unsafe CAs (UCAs) are defined. The CAs that is necessary to conduct SCs in the CSD are identified. UCAs are derived by applying “the 4 keywords to identify the UCAs that cause the hazards (such as not providing, providing, too fast/too late, inappropriate execution sequence, too fast/too long)” to the identified CAs. At fourth, the conditions that every UCA causes hazard are clarified. The controllers and the controlled processes related to the each UCA are extracted from the CSD, and the control loop related to the UCA is clarified. UCA in the control loop is applied to the guide word one by one, and it is considered whether the UCA applied the guide word causes the hazard. Figure 3 shows the 11 guide words that cause the HCF in the control loop. In the case that the hazard occurs, the conditions that cause the hazard are clarified. Those conditions are HCFs. Additionally, the scenarios that include the processes from the occurrence of the HCF to the hazard are developed. At last, the countermeasures that do not cause hazard are developed by considering the hazard scenario.
This section describes the outline of the proposed method. The subsection A describes the whole outline, and the subsection B describes each task that consists the proposed method.
Figure 4 shows the outline of the proposed method. The proposed method can be applied after the completion of the requirement definition and the functional design (completion of the development of the use-case diagrams and the class diagrams). The proposed method consists of the four tasks. At first, “development of the UML system specification” task describes the information related to the system’s element, configuration, and control. At second, “development of the hazard scenario using STPA” task decides the accidents, hazards, SCs, and hazard scenarios related to the target system. At third, “development of the sequence diagrams corresponding to the hazard scenario and the assignment of the HCF to the classes” task develops the sequence diagrams corresponding to the hazard scenario based on the information of the use-case diagrams and the class diagrams of the ECSW. As a result, the portions that are the causes of the hazards (HCF) are clarified. At last, “conduction of the FMEA for each HCF” task conducts FMEA to each HCF, evaluates the negative impacts of the accident, and conducts the countermeasures that do not occur the HCF (not to occur the hazards), if necessary.
Development of the UML system specifications: “Development of the UML system specifications” task develops the use-case diagrams and the class diagrams for the target system. Here in after, those diagrams are called the UML system specifications. Use-case diagrams describe the target ECSW and the apparatuses (hardware) that have the interactions between the ECSW. The apparatuses are used when developing the sequence diagrams in “development of the sequence diagrams corresponding to the hazard scenario and assignment of the HCF to the classes “. The class diagrams describe the classes and the methods in ECSW. Those are used when developing the sequence diagram similarly.
Development of the hazard scenario: “Development of the hazard scenario using STPA” task decides the HCFs considering the UML system specifications, accidents, hazards, and HCs and develop the hazard scenarios.
At first, the target accident is decided considering the usage of the target system. The hazard that causes the accident and the conditions that the hazard causes the accident are decided. Then the SCs are defined based on the conditions that hazard causes the accident.
At second, the CSD is developed from the use-case diagrams and the class diagrams in the UML system specifications. The components in the CSD are the actors in the use-case diagrams and the classes in the class diagrams. The CAs between the components shows the method invocation between the classes that have the relations, and the direction of the CA corresponds to the direction of the inductivity. The data between components shows the return value of the invocated method. Figure 5 shown the correspondence between the UML system specifications and CSD.
At third, UCAs are derived from all combinations of CAs in the CSD and “the 4 keywords to identify the UCAs that cause the hazards”. Table 1 shows the diagnostic table that is used for identifying UCAs. The SCs that conflict with the UCA are written in the cells of the table.
At fourth, the control loop that causes the hazards with the UCA and CSD is identified, 11 guide words that have the possibility to become HCF are applied to UCA in the control loop, and the combination of the UCA and the guide word are evaluated whether it would be a hazard. In the case that it becomes the hazard, the conditions (HCFs) are investigated and clarified. Furthermore, the process leading to the hazard is defined as the hazard scenario.
“Development of the sequence diagrams corresponding to the hazard scenario and assignment of the HCF to the classes” task develops the sequence diagrams corresponding to the hazard scenario. The lifelines in the sequence diagrams are the actors in the use-case diagrams and the classes in the ECSW. The messages sent and received between the lifelines are the method of the class in the ECSW. The direction of the messages corresponds to the direction of the inductivity in the class diagrams. As a result of sending and receiving the messages according to the developed sequence diagrams, the hazard occurs. Therefore, the execution and/or non-execution of the method according to the sequence diagram are considered as the HCFs, and those HCFs are assigned to the methods in the class that receives the message. As assigning HCFs into methods in the classes for all hazard scenarios, the HCFs (methods) in each class are clarified. Figure 6 shows an example of assigning HCFs into the classes.
“Conduction of the FMEA for each HCF” task conducts functional level FMEA to the HCFs that are assigned to the methods of each class and evaluates the negative impact to the ECSW when HCF occurs. In the case that the negative impact is big, the causes of the HCFs are clarified and the countermeasures that reduce the negative impact are planned and conducted.
The function level FMEA for the ECSW is explained [9]. The failure modes of the ECSW are that the methods of the ECSW do not perform the original functionalities. Because the ECSW is software, there is no case that the ECSW does not perform the functionalities by aging (deviation of the function). The reasons why the ECSW does not perform the functionalities are the case that the function is used incorrectly (deviation of the execution conditions) and/or that the data outside of the range are inputted (deviation of the use conditions). Those are considered as the failure modes of the ECSW. So, the standard failure modes and standard safety countermeasures are decided by analyzing the FMEA results for the existing systems. Table 2 shows the list of the standard failure modes and standard safety countermeasures.
FMEA procedure for the ECSW is as follows. The method that is HCF in the class is investigated whether each standard failure mode can be applied. In the case that applicable standard failure modes exist, the standard safety countermeasures corresponding to the standard failure mode are selected, and the countermeasures are applied to the methods. Finally, the severity, the incidence, and the discovery rate of the method are decided. In the case that the degree of risk priority can be acceptable, selection and application of the safety countermeasures are finished. The risk evaluation matrix shown in Figure 7 is used to decide the risk priority.
The safety analysis for the railroad crossing control system is conducted to evaluate the proposed method. The subsection A describes the outline of the application case, and the subsection B describes the application results and the evaluation.
The safety analysis for the railroad crossing control system is conducted. The railroad crossing control system is as same as the system that the Information-technology Promotion Agency (IPA) uses as an analytical example for conducting STPA [15]. Because the IPA example does not describe the ECSW that controls opening/closing the railroad crossing and rumbling/stopping the alarm device, the authors assume the configuration of the ECSW. Figure 8 shows the outline of the railroad crossing. The railroad crossing consists of the control apparatus, the railroad crossing & the alarm device, and the sensors (two alarm start sensors, such as A and B, and one alarm stop sensor, such as C. Those sensors cannot detect the direction of the train.). The requirements for the railroad crossing control system are as follows.
• When the ECSW detects the train using the alarm start sensors A or B, the ECSW starts alarm after a certain period of time.
• When the ECSW detects the train using alarm stop sensor C, the ECSW stops the alarm after a certain period of time.
• When the train moves from A to C, the alarm start sensor B is masked (not to detect the train).
• When the train moves from B to C, the alarm start sensor A is masked (not to detect the train).
Figure 9 shows the outline of the railroad crossing control system that the authors assume. The use-case diagram shows that the train actor and the sensor actors use the control railroad crossing. The class diagram shows that the railroad crossing control system consists of the railroad crossing control class, the sensor class, and the railroad crossing & alarm device. Additionally, the sensor class has two subclasses, such as the alarm start sensor and the alarm stop sensor. The railroad crossing control class decides the CAs that are sent to the railroad crossing & alarm device based on the FBD from the sensors. The sensor classes send the FBD when the sensor classes detect the train. The railroad crossing & alarm device class controls the railroad crossing and alarm device when the railroad crossing & alarm device classes receive the CA.
Case 1: The train crashes the pedestrian or the car (accident 1: A1), the railroad crossing does not close when the train exists on the railroad (hazard: H1).
The train from A passes the alarm start sensor A, passes the alarm stop sensor C, and stops. Then the train is detached into two parts, such as the front part and the rear part. The front part goes to B, and the rear part returns to A.
Case 2: Accident and hazard is as same as Case 1.
The train from A passes the alarm start sensor A, but the execution of the closeBar&start alarm method is delayed for some reason. After the train passes the alarm stop sensor C, the execution of the closeBar&startAlarm method starts tardily.
Case 3: Accident and hazard is as same as Case 1.
The first train from A passes the start alarm sensor A and passes the stop alarm sensor C. Then the start alarm sensor A and B are masked. The second train passes the start alarm sensor A immediately after the first train passes the stop alarm sensor.
The results of the evaluation are described below.
At first, the UML system specifications that are shown in Figure 9 are developed. STPA is conducted by inputting the UML system specification. The following task number two to five are the same procedure in section 3.B.2).
At second, the accidents, hazards, SCs are identified. In this case, it is considered that the accident is “the train crashes the pedestrian or the car “, the hazard is “the railroad crossing does not close when the train exists on the railroad “, and the SC is “the railroad crossing must close when the train exists on the railroad (SC1)”.
At third, the CSD is developed. The components of the railroad crossing control system are the railroad control, the railroad crossing & alarm device, the start alarm sensor A and B, the stop alarm sensor C, and the train. All CAs, FBD and input/output information between those components are described into the CSD. Figure 10 shows the CSD.
At fourth, the UCAs are derived. The guide words that identify the UCA are applied to the CAs in the CSD of Figure 10, and the UCAs are clarified. Table 3 shows the results of the extracts of the UCA. Here in after, the case that “The train passes the railroad crossing when not rumbling warning. (the bar of the railroad crossing does not close.) [UCA1], [SC1 violation]” is analyzed.
At fifth, it investigates whether the UCA causes the hazard (whether the UCA violates the SC). The 11 guide words that identify the HCF are applied to the UCAs in the CSD one by one, and each UCA is investigated whether it causes the hazard. Figure 11 shows the results that the guide words that identify the HCFs are assigned to the CSD. As a result, it is found that six guide words are applicable to the railroad crossing control system. Here, those six guide words are applied to all UCAs, and it is investigated that the UCAs cause the Hazards. Table 4 shows the result. As for the UCA1, UCA1 causes the hazard when applying the guide word (2) “inappropriate, inefficient or missing control” and the guide word (4) “process input missing or wrong “. Concretely, as for the guide word (2), it is considered the following situations; “the control action for the railroad crossing control when the train turns after passing the railroad is inappropriate and it causes the hazard” or “the competition between the continuing to stop the alarm and the indicating to start the alarm causes the hazard”. As for the guide word (4), it is considered the following situation; “As a result of the defect of the start alarm sensor A, the loss of the message from start alarm sensor A to the control apparatus causes the hazard”. The hazard scenarios corresponding to those cases are developed. Figure 12 shows the hazard scenario in the case that “the control algorithm for the railroad crossing control when the train turns after passing the railroad is inappropriate and it causes the hazard”.
At sixth, the sequence diagrams corresponding to the hazard scenario are developed, and the HCFs are assigned to the classes. This task is the same as the task stated in the section3.B.3) Here, the sequence diagram when after the train from A turns to A after the train passes the stop-warning sensor C is developed. Figure 13 shows the details of the hazard scenario. Figure 14 shows the sequence diagrams when it occurs. In Figure 14, after the train passes the stop alarm sensor C, the start alarm sensor A and B are masked, the rear part of the train turns and passes the start alarm sensor A. At this time, as the bar of the railroad crossing is open and the alarm device stop rumbling, even if the railroad crossing control issues the new CA of openBar&stopAlarm method to the railroad crossing apparatus, the railroad crossing does not work. Consequently, because the train enters the railroad crossing when the bar of the railroad crossing is opened, it becomes the hazard. Here, it is assumed that the functions of the railroad crossing & alarm device class, the start alarm sensor class, and the stop alarm sensor class are simply sent CAs to the apparatuses through the input/output interface and there is no trouble of the hardware. As a result, the events of the hazard scenario are assigned only to the railroad crossing control class (HCFs are assigned to the methods in the railroad crossing control class). Considering the sequence diagram, the methods that are assigned to this class are existTrain method and existNoTrain method.
At seventh, the FMEA is conducted considering the sequence diagrams. The existTrain method invokes a closeBar&startAlarm method of the railroad crossing & alarm device class. Even if this method is invoked, the bar of the railroad crossing is still closed, and the alarm device is only rumbling. Therefore, as there is a low possibility when this hazard occurs, the countermeasures for this event are not applied. On the other hand, the existNoTrain method invokes an openBar&stopAlarm method in the railroad crossing & alarm device. Generally, the existTrain method and existNoTrain method should be carried out in pairs. Additionally, the existTrain method and existNoTrain method should be invoked alternately. “The startup conditions for functions are not prepared” in Table 2 can be applied. Therefore, the setting of the startup conditions and the setting of non-startup conditions are applicable as the standard countermeasures. For example, the state transition diagram is added to the railroad crossing control class (Figure 15). In the case that the message of existNoTrain method is received when the state is in the waiting the train passing, the following countermeasures are conducted; issue the emergency message to the safety supervisor (the method that issues the alarm is added to the railroad crossing control class), close the bar of the railroad crossing, and rumble the alarm. Those countermeasures reduce the rate of the incident that causes the hazard.
The UML system specifications, the accidents, the hazards, SCs and CSD are same as the CASE 1. CASE2 corresponds to the case that “The train arrive the railroad crossing before rumbling the warning. (closing the bar is too late.) [UCA2], [SC1 violation])” in Table 3. It is investigated whether the UCA2 is the hazard or not (UCA2 violates SC1).
As a result of applying 11 guide words that identify the HCFs to CAs in the CSD, it is found that the case that guideword “(3) delayed operation” occurs becomes the hazard. Figure 16 shows the details of the hazard scenario, and Figure 17 shows the sequence diagrams.
In Figure 17, after the train passes the start alarm sensor A, a closeBar&startAlarm method is invocated. Because the invocation of the closeBar&startAlarm method is delayed for some reason, the train enters the railroad crossing when the bar of the railroad crossing is opened, and the alarm is stopped. The hazard occurs when the existTrain method in the railroad crossing class does not invocate the openBar&stopAlarm method in the railroad crossing and alarm device class. FMEA is conducted for the case. When the existTrain method invocates the closeBar&startAlarm method, the method must be invocated at top priority. “The startup conditions for functions are not prepared” in Table 2 can be applied. Therefore, the setting of the startup conditions and the setting of non-startup conditions are applicable as the standard countermeasures. For example, the closeBar&startAlarm method is invocated at the beginning of the existTrain method, or other methods are not invocated when the existTrain method is running. Those countermeasures reduce the rate of the incident that causes the hazard.
The UML system specifications, the accidents, the hazards, SCs and CSD are same as the CASE 1. CASE3 corresponds to the case that “When the train does not arrive, the startMask instruction invocates and the warning does not rumbling. [UCA4], [SC1 violation]” in Table 3. It is investigated whether the UCA4 is the hazard or not (UCA4 violates SC1). As a result of applying 11 guide words that identify the HCFs to CAs in the CSD, it is found that the case that guideword “(2) inappropriate, inefficient or missing control action “ occurs becomes the hazard. Figure 18 shows the details of the hazard scenario, and Figure 19 shows the sequence diagrams.
In Figure 19, after the first train passes the stop alarm sensor C, the start alarm sensor A and B are masked. When the first train passes the start alarm sensor B, the start alarm sensor B is released to be masked. After this situation, though the second train enters and passes the start alarm sensor A, the closeBar&startAlarm method is not invocated because the start alarm sensor A is masked. As a result, the second train enters the railroad crossing that the bar is opened and the alarm is not rumbled, and this situation becomes the hazard.
According to the sequence diagram, when the first Train passes the stop alarm senor C, the start alarm sensor A and B are masked. After this situation, as the start alarm sensor A is still masked, the closeBar&startAlarm method is not invocated when the second Train passes the start alarm sensor A. This case corresponds to the failure of the execution conditions. Therefore, the setting of the startup conditions and the setting of non-startup conditions are applicable as the standard countermeasures. For example, the start alarm sensor A and B are masked and released at the same timing, and the sensors that are not involved are not masked. This countermeasure reduces the rate of the incident that causes the hazard.
It is found that the adequate countermeasures are similarly applied to the other hazard scenarios. As the result of applying the proposed method, the hazards of the railroad crossing control system can be clarified, and the appropriate countermeasures to avoid occurring the hazards can be found. Consequently, the risks that the hazards occur are reduced, and the safety of the target system becomes improved. On the other hand, because there are many hazard scenarios, it is found that an efficient method for investigating the countermeasures is required. Additionally, it is found that there is a probability that the conflicts between the countermeasures occur because the proposed method decide the countermeasures corresponding to each hazard scenario. For that reason, it is found that the method that checked the conflict between the countermeasures is required.
In these case studies, we conducted the design modifications of the ECSW to avoid occurring the hazard. Regarding this problem, it could be also possible to solve it by establishing the standard operation procedure (rules) that did not permit the detach and/or turn of the train between the sensor A and B. Actually, when deciding the countermeasures, the safety, the cost and the development time must be considered, and adequate countermeasures, such as the modification of the standard operation procedures, the design modification of hardware, or the design modification of the software, should be selected. That is, the safer mechanisms need to be developed efficiently.
This paper proposes a safety analysis method cooperating with the UML, STPA, and FMEA. The proposed method analyzes the causes of the hazards that are occurred by the interactions between the system components and proposes the countermeasures that avoid occurring the hazards. As a result of the application of the proposed method, the safer system can be developed. On the other hand, it is found that the proposed method requires a long time for analyzing hazard and planning the countermeasures. Especially, in the case when analyzing the hazards of the complex system, because the system includes many hardware and software, and the system has many hazards and the hazard scenarios, it would occur the problem that the decided countermeasures applying the proposed method have conflicts in each other. In the future, we will propose a method that describes the SCs using logical expressions and analyzes them automatically using a logical calculation. As a result, a mechanism that will be able to conduct adequate and efficient safety analysis will be developed. Additionally, we will apply the proposed method to the larger system, clarify the weak points of the method, propose the countermeasures, feedback them into the proposed method, and improve the proposed method.
This research was supported by Grant-in-Aid for Scientific Research (C) of the Japan Society for the Promotion of Science “Integrated Analysis Method for hazard caused by software interaction cooperating with multiple safety analysis methods.”
Subscribe to our articles alerts and stay tuned.
PTZ: We're glad you're here. Please click "create a new query" if you are a new visitor to our website and need further information from us.
If you are already a member of our network and need to keep track of any developments regarding a question you have already submitted, click "take me to my Query."