Kahneman is a Nobel Prize winner in economics. His focus is on personal decision making, especially the biases and heuristics used by the unconscious mind as it forms intuitive opinions. Biases lead to regular (systematic) errors in decision making. Kahneman and Amos Tversky developed prospect theory, a model of choice, that helps explain why real people make decisions that are different from those of the rational man of economics.
Kahneman is a psychologist so his work focuses on the individual; many of his observations are not immediately linkable to safety culture (a group characteristic). But even in a nominal group setting, individuals are often very important. Think about the lawyers, inspectors, consultants and corporate types who show up after a plant incident. What kind of biases do they bring to the table when they are evaluating your organization's performance leading up to the incident?
The book* has five parts, described below. Kahneman reports on his own research and then adds the work of many other scholars. Many of the experiments appear quite simple but provide insights into unconscious and conscious decision making. There is a lot of content so this is a high level summary, punctuated by explicative or simply humorous quotes.
Part 1 describes two methods we use to make decisions: System 1 and System 2. System 1 is impulsive, intuitive, fast and often unconscious; System 2 is more analytic, cautious, slow and controlled. (p. 48) We often defer to System 1 because of its ease of use; we simply don't have the time, energy or desire to pore over every decision facing us. Lack of desire is another term for lazy.
System 1 often operates below consciousness, utilizing associative memory to link a current stimulus to ideas or concepts stored in memory. (p. 51) System 1's impressions become beliefs when accepted by System 2 and a mental model of the world takes shape. System 1 forms impressions of familiarity and rapid, precise intuitions then passes them on to System 2 to accept/reject. (pp. 58-62)
System 2 activities take effort and require attention, which is a finite resource. If we exceed the attention budget or become distracted then System 2 will fail to obtain correct answers. System 2 is also responsible for self-control of thoughts and behaviors, another drain on mental resources. (pp. 41-42)
Biases include a readiness to infer causality, even where none exists; a willingness to believe and confirm in the absence of solid evidence; succumbing to the halo effect where we project a coherent whole based on an initial impression; and problems caused by WYSIATI** including basing conclusions on limited evidence, overconfidence, framing effects where decisions differ depending on how information and questions are presented and base-rate neglect where we ignore widely-known data about a decision situation. (pp. 76-88)
Heuristics include substituting easier questions for the more difficult ones that have been asked, letting current mood affect answers on general happiness and allowing emotions to trump facts. (pp. 97-103)
Part 2 explores decision heuristics in greater detail, with research and examples of how we think associatively, metaphorically and causally. A major topic throughout this section is the errors people tend to make when handling questions that have a statistical dimension. Such errors occur because statistics requires us to think of many things at once, which System 1 is not designed to do, and a lazy or busy System 2, which could handle this analysis, is prone to accept System 1's proposed answer. Other errors occur because:
We make incorrect inferences from small samples and are prone to ascribe causality to chance events. “We are far too willing the reject the belief that much of what we in life is random.” (p. 117) We are prone to attach “a causal interpretation to the inevitable fluctuations of a random process.” (p. 176) “There is more luck in the outcomes of small samples.” (p. 194)
We fall for the anchoring effect, where we see a particular value for an unknown quantity (e.g., the asking price for a used car) before we develop our own value. Even random anchors, which provide no relevant information, can influence decision making.
People search for relevant information when asked questions. Information availability and ease of retrieval is a System 1 heuristic but only System 2 can judge the quality and relevance of retrieved content. People are more strongly affected by ease of retrieval and go with their intuition when they are, for example, mentally busy or in a good mood. (p. 135) However, “intuitive predictions tend to be overconfident and overly extreme.” (p. 192)
Unless we know the subject matter well, and have some statistical training, we have difficulty dealing with situations that require statistical reasoning. One research finding “illustrates a basic limitation in the ability of our mind to deal with small risks: we either ignore them altogether or give them far too much weight—nothing in between.” (p. 143) “There is one thing you can do when you have doubts about the quality of the evidence: let your judgments of probability stay close to the base rate.” (p. 153) “. . . whenever the correlation between two scores is imperfect, there will be regression to the mean. . . . [a process that] has an explanation but does not have a cause.” (pp. 181-82)
Finally, and the PC folks may not appreciate this, but “neglecting valid stereotypes inevitably results in suboptimal judgments.” (p. 169)
Part 3 focuses on specific shortcomings of our thought processes: overconfidence, fed by the illusory certainty of hindsight, in what we think we know, and underappreciation of the role of chance in events.
“Subjective confidence in a judgment is not a reasoned evaluation of the probability that this judgment is correct. Confidence is a feeling.” (p. 212) Hindsight bias “leads observers to assess the quality of a decision not by whether the process was sound but by whether its outcome was good or bad. . . . a clear outcome bias.” (p. 203) “. . . the optimistic bias may well be the most significant of the cognitive biases.” (p. 255) “The optimistic style involves taking credit for success but little blame for failure.” (p. 263)
“The sense-making machinery of System 1 makes us see the world as more tidy, predictable, and coherent than it really is.” (p. 204) “. . . reality emerges from the interactions of many different agents and force, including blind luck, often producing large and unpredictable results.” (p. 220) “An unbiased appreciation of uncertainty is a cornerstone of rationality—but it is not what people and organizations want. . . . Acting on pretended knowledge is often the preferred solution.” (p. 263)
And the best quote in the book: “Professional controversies bring out the worst in academics.” (p. 234)
Part 4 contrasts the rational people of economics with the more complex people of psychology, in other words, the Econs vs. the Humans. Kahneman shows how prospect theory opened a door between the two disciplines and contributed to the start of the field of behavioral economics.
Economists adopted expected utility theory to prescribe how decisions should be made and describe how Econs make choices. In contrast, prospect theory has three cognitive features: evaluation of choices is relative to a reference point, outcomes above that point are gains, below that point are losses; diminishing sensitivity to changes; and loss aversion, where losses loom larger than gains. (p. 282) In practice, loss aversion leads to risk-averse choices when both gains and losses are possible, and diminishing sensitivity leads to risk taking when sure losses are compared to a possible larger loss. “Decision makers tend to prefer the sure thing over the gamble (they are risk averse) when the outcomes are good. They tend to reject the sure thing and accept the gamble (they are risk seeking) when both outcomes are negative.” (p. 368)
“The fundamental ideas of prospect theory are that reference points exist, and that losses loom larger than corresponding gains.” (p. 297) “A reference point is sometimes the status quo, but it can also be a goal in the future; not achieving the goal is a loss, exceeding the goal is a gain.” (p. 303) Loss aversion is a powerful conservative force.” (p. 305)
When people do consider vary rare events, e.g., a nuclear accident, they will almost certainly overweight the probability in their decision making. “ . . . people are almost completely insensitive to variations of risk among small probabilities.” (p. 316) “. . . low-probability events are much more heavily weighted when described in terms of relative frequencies (how many) than when stated is more abstract terms of . . . “probability” (how likely).” (p. 329) Framing of questions evoke emotions, e.g., “losses evokes stronger negative feelings than costs.” (p. 364) But “[r]eframing is effortful and System 2 is normally lazy.” (p. 367) As an exercise, think about how anti-nuclear activists and NEI would frame the same question about the probability and consequences of a major nuclear accident.
There are some things an organization can do to improve its decision making. It can use local centers of over optimism (Sales dept.) and loss aversion (Finance dept.) to offset each other. In addition, an organization's decision making practices can require the use an outside view (i.e., a look at the probabilities of similar events in the larger world) and a formal risk policy to mitigate against known decision biases. (p. 340)
Part 5 covers two different selves that exist in every human, the experiencing self and the remembering self. The former lives through an experience and the latter creates a memory of it (for possible later recovery) using specific heuristics. Our tendency to remember events as a sample or summary of actual experience is a factor that biases current and future decisions. We end up favoring (fearing) a short period of intense joy (pain) over a long period of moderate happiness (pain). (p. 409)
Our memory has evolved to represent past events in terms of peak pain/pleasure during the events and our feelings when the event is over. Event duration does not impact our ultimate memory of an event. For example, we choose future vacations based on our final evaluations of past vacations even if many of our experiences during the past vacations were poor. (p. 389)
In a possibly more significant area, the life satisfaction score you assign to yourself is based on a small sample of highly available ideas or memories. (p. 400) Ponder that the next time you take or review responses from a safety culture survey.
Our Perspective
This is an important book. Although not explicitly stated, the great explanatory themes of cause (mechanical), choice (intentional) and chance (statistical) run through it. It is filled with nuggets that apply to the individual (psychological) and also the aggregate if the group shares similar beliefs. Many System 1 characteristics, if unchecked and shared by a group, have cultural implications.***
We have discussed Kahneman's work before on this blog, e.g., his view that an organization is a factory for producing decisions and his suggestion to use a “premortem” as a partial antidote for overconfidence. (A premortem is an exercise the group undertakes before committing to an important decision: Imagine being a year into the future, the decision's outcome is a disaster. What happened?) For more on these points, see our Nov. 4, 2011 post.
We have also discussed some of the topics he raises, e.g., the hindsight bias. Hindsight is 20/20 and it supposedly shows what decision makers could (and should) have known and done instead of their actual decisions that led to an unfavorable outcome, incident, accident or worse. We now know that when the past was the present, things may not have been so clear-cut.
Kahneman's observation that the ability to control attention predicts on-the-job performance (p. 37) is certainly consistent with our reports on the characteristics of high reliability organizations (HROs).
“The premise of this book is that it is easier to recognize other people's mistakes than our own.” (p. 28) Having observers at important, stressful decision making meetings is useful; they are less cognitively involved than the main actors and more likely to see any problems in the answers being proposed.
Critics' major knock on Kahneman's research is that it doesn't reflect real world conditions. His model is “overly concerned with failures and driven by artificial experiments than by the study of real people doing things that matter.” (p. 235) He takes this on by collaborating with a critic in an investigation of intuitive decision making, specifically seeking to answer: “When can you trust a self-confident professional who claims to have an intuition?” (p. 239) The answer is when the expert acquired skill in a predictable environment, and had sufficient practice with immediate, high-quality feedback. For example, anesthesiologists are in a good position to develop predictive expertise; on the other hand, psychotherapists are not, primarily because a lot of time and external events can pass between their prognosis for a patient and ultimate results. However, “System 1 takes over in emergencies . . .” (p. 35) Because people tend to do what they've been trained to do in emergencies, training leading to (correct) responses is vital.
Another problem is that most of Kahneman's research uses university students, both undergraduate and graduate, as subjects. It's fair to say professionals have more training and life experience, and have probably made some hasty decisions they later regretted and (maybe) learned from. On the other hand, we often see people who make sub-optimal, or just plain bad decisions even though they should know better.
There are lessons here for managers and other would-be culture shapers. System 1's search for answers is mostly constrained to information consistent with existing beliefs (p. 103) which is an entry point for culture. We have seen how group members can have their internal biases influenced by the dominant culture. But to the extent System 1 dominates employees' decision making, decision quality may suffer.
Not all appeals can be made to the rational man in System 2 even though a customary, if tacit, assumption of managers is they and their employees are rational and always operating consciously, thus new experiences will lead to expected new values and beliefs, new decisions and improved safety culture. But it may not be this straightforward. System 1 may intervene and managers should be alert to evidence of System 1 type thinking and adjust their interventions accordingly. Kahneman suggests encouraging “a culture in which people look out for one another as they approach minefields.” (p. 418)
We should note Systems 1 and 2 are constructs and “do not really exist in the brain or anywhere else.” (p. 415) System 1 is not Dr. Morbius' Id monster.**** System 1 can be trained to behave differently, but it is always ready to provide convenient answers for a lazy System 2.
The book is long, with small print, but the chapters are short so it's easy to invest 15-20 min. at a time. One has to be on constant alert for useful nuggets that can pop up anywhere—which I guess promotes reader mindfulness. It is better than Blink, which simply overwhelmed this reader with a cloudburst of data showing the informational value of thin slices and unintentionally over-promoted the value of intuition. (see pp. 235-36) And it is much deeper than The Power of Habit, which we reviewed last February.
(Common sense is nothing more than a deposit of prejudices laid down by the mind before you reach eighteen. Attributed to Albert Einstein)
* D. Kahneman, Thinking, Fast and Slow (New York: Farrar, Straus and Giroux, 2011).
** WYSIATI – What You See Is All There Is. Information that is not retrieved from memory, or otherwise ignored, may as well not exist. (pp. 85-88) WYSIATI means we base decisions on the limited information that we are able or willing to retrieve before a decision is due.
*** A few of these characteristics are mentioned in this report, e.g., impressions morphing into beliefs, a bias to believe and confirm, and WYSIATI errors. Others include links of cognitive ease to illusions of truth and reduced vigilance (complacency), and narrow framing where decision problems are isolated from one another. (p. 105)
**** Dr. Edward Morbius is a character in the 1956 sci-fi movie Forbidden Planet.
Wednesday, December 18, 2013
Monday, November 11, 2013
Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson
In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory. The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics. It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.
Part I
Part I sets the stage for a new safety paradigm. Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6) Traditional accident analysis techniques are no longer sufficient. They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10) Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).** We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.
Event-based models are simply inadequate. There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it. The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24) Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19) In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)
Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors. “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination. When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent. This dependence may be related to common systemic factors that do not appear in an event chain. Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35) “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36)
The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38) In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46)
Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51) Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52) This is reminiscent of Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.
When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56) In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64) Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60)
Part II
Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes). For our purposes we don't need to spend much space on this material. “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”*** It attempts to achieve the goals listed at the end of Part I.
STAMP treats safety in a system as a control problem, not a reliability one. Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76) Controls may be physical or social, including cultural. There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87) “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)
Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system. There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.
Part III
Part III describes in detail how STAMP can be applied. There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section. Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of STAMP over traditional accident analysis techniques.
We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.
Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421) Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk. The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40) One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26) Such an approach should be based on the organization's comprehensive hazards analysis.****
The safety culture discussion is also familiar. (pp. 426-33) Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust. In such a culture, Leveson says an early warning system for migration toward states of high risk can be established. A section on Just Culture is taken directly from Dekker's work. The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.
Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.
Our Perspective
Overall, we like this book. It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial. The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog. The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
However, there are only a few direct references to nuclear. The examples in the book are mostly from aerospace, aviation, maritime activities and the military. Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory), shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs. Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428) She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7) In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective. TMI is used to illustrate specific system design/operation problems.
We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists. First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12) But I'm not aware of any HRO work that has been done in an organization that is patently unsafe. HRO asserts that reliability follows from practices that recognize and contain emerging problems. She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers. Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44) Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could. Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.” In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410) I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals. Overall, this appears to be academic posturing and feather fluffing.
We offer no opinion on the efficacy of using Leveson's STAMP approach. She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii) You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.
* N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011) The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed. All quotes in this post were retyped from the original text.
** We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier. These authors are all aware of others' publications and contributions. Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text.
*** Nancy Leveson informal bio page.
**** “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157) The hazards analysis identifies all major hazards the system may confront. Baseline safety requirements follow from the hazards analysis. Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk. The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)
Part I
Part I sets the stage for a new safety paradigm. Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6) Traditional accident analysis techniques are no longer sufficient. They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10) Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).** We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.
Event-based models are simply inadequate. There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it. The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24) Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19) In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)
Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors. “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination. When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent. This dependence may be related to common systemic factors that do not appear in an event chain. Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35) “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36)
The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38) In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46)
Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51) Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52) This is reminiscent of Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.
When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56) In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64) Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60)
Part II
Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes). For our purposes we don't need to spend much space on this material. “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”*** It attempts to achieve the goals listed at the end of Part I.
STAMP treats safety in a system as a control problem, not a reliability one. Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76) Controls may be physical or social, including cultural. There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87) “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)
Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system. There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.
Part III
Part III describes in detail how STAMP can be applied. There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section. Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of STAMP over traditional accident analysis techniques.
We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.
Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421) Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk. The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40) One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26) Such an approach should be based on the organization's comprehensive hazards analysis.****
The safety culture discussion is also familiar. (pp. 426-33) Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust. In such a culture, Leveson says an early warning system for migration toward states of high risk can be established. A section on Just Culture is taken directly from Dekker's work. The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.
Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.
Our Perspective
Overall, we like this book. It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial. The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog. The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
However, there are only a few direct references to nuclear. The examples in the book are mostly from aerospace, aviation, maritime activities and the military. Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory), shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs. Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428) She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7) In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective. TMI is used to illustrate specific system design/operation problems.
We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists. First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12) But I'm not aware of any HRO work that has been done in an organization that is patently unsafe. HRO asserts that reliability follows from practices that recognize and contain emerging problems. She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers. Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44) Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could. Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.” In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410) I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals. Overall, this appears to be academic posturing and feather fluffing.
We offer no opinion on the efficacy of using Leveson's STAMP approach. She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii) You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.
* N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011) The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed. All quotes in this post were retyped from the original text.
** We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier. These authors are all aware of others' publications and contributions. Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text.
*** Nancy Leveson informal bio page.
**** “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157) The hazards analysis identifies all major hazards the system may confront. Baseline safety requirements follow from the hazards analysis. Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk. The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)
Posted by
Lewis Conner
1 comments. Click to view/add.
Tuesday, October 29, 2013
NRC Outreach on the Safety Culture Policy Statement
An NRC public meeting |
One NRC presentation covered the SC Common Language Initiative.** The presenter remarked that an additional SC trait, Decision Making (DM), was added during the development of the common language. In our Feb. 28, 2013 review of the final common language document, we praised the treatment of DM; it is a principal creator of artifacts that reflect an organization's SC.
The INPO presentation noted that “After the common language effort was completed in January, 2013, INPO published Revision 1 of 12-012, which includes all of the examples developed during the common language workshop.” (p. 6) We reviewed that INPO document here.
But here's the item that got our attention. During a presentation on NRC outreach, an industry participant cautioned the NRC to not put policy statements into regulatory documents because policy statements are an expectation, not a regulation. The senior NRC person at the meeting agreed with the comment “and the importance of the NRC not overstepping the Commission’s direction that implementing the SCPS is not a regulatory requirement, but rather the Commission’s expectations.” (p. 4)
We find the last comment disingenuous. We have previously posted on how the NRC has created de facto regulation of SC.*** In the absence of clear de jure regulation, licensees and the NRC end up playing “bring me another rock” until the NRC accepts a licensee's pronouncements, as verified by NRC inspectors. For an example of this convoluted kabuki, read Bob Cudlin's Jan. 30, 2013 post on how Palisades' efforts to address a plant incident finally gained NRC acceptance, or at least an NRC opinion that Palisades' SC was “adequate and improving.”
We'll keep you posted on SCPS-related activities.
* D.J. Sieracki to R.P. Zimmerman, “Summary of the August 7, 2013, Public Meeting between the U.S. Nuclear Regulatory Commission Staff and Stakeholders to Exchange Information and Discuss Ongoing Education and Outreach Associated with the Safety Culture Policy Statement” (Oct. 1, 2013). ADAMS ML13267A385. We continue to find it ironic that the SCPS is administered by the NRC's Office of Enforcement. Isn't OE's primary focus on people and companies who violate the NRC's regulations?
** “The common language initiative uses the traits from the SCPS as a basic foundation, and contains definitions and examples to describe each trait more fully.” (p. 3)
*** For related posts, please click the "Regulation of Safety Culture" label.
Posted by
Lewis Conner
1 comments. Click to view/add.
Labels:
INPO,
NRC,
Regulation of Safety Culture,
SC Policy Statement
Friday, October 18, 2013
When Apples Decay
In our experience education is perceived as a continual process, accumulating knowledge progressively over time. A shiny apple exemplifies the learning student or an inspiring insight (see Newton, Sir Isaac). Less consideration is given to the fact that the educational process can work in reverse leading to a loss of capability over time. In other words, the apple decays. As Martin Weller notes on his blog The Ed Techie, “education is about selling apples...we need to recognise and facilitate learning that takes ten minutes or involves extended participation in a community over a number of years.”*
This leads us to a recent Wall Street Journal piece, “Americans Need a Simple Retirement System”.** The article is about the failure of educational efforts to improve financial literacy. We admit that this is a bit out of context for nuclear safety culture; nevertheless it provides a useful perspective that seems to be overlooked in within the nuclear industry. The article notes:
“The problem is that, like all educational efforts, financial education decays over time and has negligible effects on behavior after 20 months. The authors suggest that, given this decay, “just in time” financial education...might be a more effective way to proceed.”
We tend to view the safety culture training provided at nuclear plants to be of the 10 minute variety, selling apples that may vary in size and color but are just apples. Additional training is routinely prescribed in response to findings of inadequate safety culture. Yet we cannot recall a single reported instance where safety culture issues were associated with inadequate or ineffective training in the first place. Nor do we see explicit recognition that such training efforts have very limited half lives, creating cycles of future problems. We have blogged about the decay of training based reinforcement (see our March 22, 2010 post) and the contribution of decay and training saturation to complacency (see our Dec. 8, 2011 post).
The fact that safety culture knowledge and “strength” decays over time is just one example of the dynamics associated with safety management. Arguably one could assert that an effective learning process itself is a (the?) key to building and maintaining strong safety culture. And further that it is consistently missing in current nuclear industry programs that emphasize indoctrination in traits and values. It’s time for better and more innovative approaches - not just more apples.
* M. Weller, "The long-awaited 'education as fruit' metaphor," The Ed Techie blog (Sept. 10, 2009). Retrieved Oct. 18, 2013.
** A.H. Munnell, "Americans need a simple retirement system," MarketWatch blog (Oct. 16, 2013). Retrieved Oct. 18, 2013.
This leads us to a recent Wall Street Journal piece, “Americans Need a Simple Retirement System”.** The article is about the failure of educational efforts to improve financial literacy. We admit that this is a bit out of context for nuclear safety culture; nevertheless it provides a useful perspective that seems to be overlooked in within the nuclear industry. The article notes:
“The problem is that, like all educational efforts, financial education decays over time and has negligible effects on behavior after 20 months. The authors suggest that, given this decay, “just in time” financial education...might be a more effective way to proceed.”
We tend to view the safety culture training provided at nuclear plants to be of the 10 minute variety, selling apples that may vary in size and color but are just apples. Additional training is routinely prescribed in response to findings of inadequate safety culture. Yet we cannot recall a single reported instance where safety culture issues were associated with inadequate or ineffective training in the first place. Nor do we see explicit recognition that such training efforts have very limited half lives, creating cycles of future problems. We have blogged about the decay of training based reinforcement (see our March 22, 2010 post) and the contribution of decay and training saturation to complacency (see our Dec. 8, 2011 post).
The fact that safety culture knowledge and “strength” decays over time is just one example of the dynamics associated with safety management. Arguably one could assert that an effective learning process itself is a (the?) key to building and maintaining strong safety culture. And further that it is consistently missing in current nuclear industry programs that emphasize indoctrination in traits and values. It’s time for better and more innovative approaches - not just more apples.
* M. Weller, "The long-awaited 'education as fruit' metaphor," The Ed Techie blog (Sept. 10, 2009). Retrieved Oct. 18, 2013.
** A.H. Munnell, "Americans need a simple retirement system," MarketWatch blog (Oct. 16, 2013). Retrieved Oct. 18, 2013.
Posted by
Bob Cudlin
0
comments. Click to view/add.
Labels:
Complacency,
Safety Culture,
System Dynamics
Monday, October 14, 2013
High Reliability Management by Roe and Schulman
This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market. CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world. In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.
The High Reliability Management (HRM) Model
The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3) System Volatility refers to the magnitude and rate of change of CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall). Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs.
System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each. All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid. Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes. Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)
High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance. Some strategies can be substituted for others. It is a dynamic but manageable environment.
High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance. They run from pillar to post; it is highly stressful. Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error. Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills. It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)
Low Volatility and Low Options mean generation and demand are not changing quickly. The critical feature here is demand has been reduced by load shedding. The operators have exhausted all other strategies for maintaining balance. It is a command-and-control approach, effected by declaring a Stage 3 grid situation and run using formal rules and procedures. It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished.
Low Volatility and High Options is an HRM's preferred mode. Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available. Procedures based on analyzed conditions exist and are used. There are few, if any, surprises. Learning can occur but it is incremental, the result of new methods or analysis. Performance is important and system behavior operates within a narrow bandwidth. Loss of attention (complacency) is a risk. Is this starting to sound familiar? This is the domain of High Reliability Organization (HRO) theory and practice. Nuclear power operations is an example of an HRO. (pp. 60-62)
Lessons for Nuclear Operations
Nuclear plants work hard to stay in the Low Volatility/High Options mode. If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62) In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses. Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.** Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.
In an HRO, trial and error is not an acceptable method for trying out new options. No one wants cowboy operators in the control room. But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233) In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)
The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear. The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system. In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance. In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126)
The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)
Implications for Safety Culture
Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2) Both organizational types need a focus on operations as the central activity. Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.
The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants. Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220)
The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC. We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?”
Conclusion
This review gives short shrift to the authors' broad and deep description and analysis of CAISO.**** The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.
The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management. But it's a good read and full of insightful observations, e.g., the description of CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)
(High Reliability Management was recommended by a Safetymatters reader. If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)
* E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008) This book reports the authors' study of CAISO from 2001 through 2006.
** By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment. Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.
The importance of considering low-probability, major consequence events is argued by Taleb (see here) and Dédale (see here).
*** Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits.
The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone. For more on Cynefin, see here and here.
**** For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO. As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)
Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits. This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.
The High Reliability Management (HRM) Model
The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3) System Volatility refers to the magnitude and rate of change of CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall). Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs.
System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each. All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid. Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes. Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)
High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance. Some strategies can be substituted for others. It is a dynamic but manageable environment.
High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance. They run from pillar to post; it is highly stressful. Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error. Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills. It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)
Low Volatility and Low Options mean generation and demand are not changing quickly. The critical feature here is demand has been reduced by load shedding. The operators have exhausted all other strategies for maintaining balance. It is a command-and-control approach, effected by declaring a Stage 3 grid situation and run using formal rules and procedures. It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished.
Low Volatility and High Options is an HRM's preferred mode. Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available. Procedures based on analyzed conditions exist and are used. There are few, if any, surprises. Learning can occur but it is incremental, the result of new methods or analysis. Performance is important and system behavior operates within a narrow bandwidth. Loss of attention (complacency) is a risk. Is this starting to sound familiar? This is the domain of High Reliability Organization (HRO) theory and practice. Nuclear power operations is an example of an HRO. (pp. 60-62)
Lessons for Nuclear Operations
Nuclear plants work hard to stay in the Low Volatility/High Options mode. If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62) In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses. Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.** Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.
In an HRO, trial and error is not an acceptable method for trying out new options. No one wants cowboy operators in the control room. But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233) In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)
The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear. The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system. In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance. In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126)
The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)
Implications for Safety Culture
Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2) Both organizational types need a focus on operations as the central activity. Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.
The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants. Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220)
The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC. We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?”
Conclusion
This review gives short shrift to the authors' broad and deep description and analysis of CAISO.**** The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.
The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management. But it's a good read and full of insightful observations, e.g., the description of CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)
(High Reliability Management was recommended by a Safetymatters reader. If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)
* E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008) This book reports the authors' study of CAISO from 2001 through 2006.
** By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment. Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.
The importance of considering low-probability, major consequence events is argued by Taleb (see here) and Dédale (see here).
*** Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits.
The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone. For more on Cynefin, see here and here.
**** For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO. As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)
Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits. This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.
Posted by
Lewis Conner
1 comments. Click to view/add.
Labels:
Complacency,
Cynefin,
HRO,
Management,
Mental Model,
Normalization of Deviance,
References,
Simulation,
Taleb
Friday, September 27, 2013
Four Years of Safetymatters
Aztec Calendar |
Systems View
We have consistently considered safety culture (SC) in the nuclear industry to be one component of a complicated socio-technical system. A systems view provides a powerful mental model for analyzing and understanding organizational behavior.
Our design and explicative efforts began with system dynamics as described by authors such as Peter Senge, focusing on characteristics such as feedback loops and time delays that can affect system behavior and lead to unexpected, non-linear changes in system performance. Later, we expanded our discussion to incorporate the ways systems adapt and evolve over time in response to internal and external pressures. Because they evolve, socio-technical organizations are learning organizations but continuous improvement is not guaranteed; in fact, evolution in response to pressure can lead to poorer performance.
The systems view, system dynamics and their application through computer simulation techniques are incorporated in the NuclearSafetySim management training tool.
Decision Making
A critical, defining activity of any organization is decision making. Decision making determines what will (or will not) be done, by whom, and with what priority and resources. Decision making is directed and constrained by factors including laws, regulations, policies, goals, procedures and resource availability. In addition, decision making is imbued with and reflective of the organization's values, mental models and aspirations, i.e., its culture, including safety culture.
Decision making is intimately related to an organization's financial compensation and incentive program. We've commented on these programs in nuclear and non-nuclear organizations and identified the performance goals for which executives received the largest rewards; often, these were not safety goals.
Decision making is part of the behavior exhibited by senior managers. We expect leaders to model desired behavior and are disappointed when they don't. We have provided examples of good and bad decisions and leader behavior.
Safety Culture Assessment
We have cited NRC Commissioner Apostolakis' observation that “we really care about what people do and maybe not why they do it . . .” We sympathize with that view. If organizations are making correct decisions and getting acceptable performance, the “why” is not immediately important. However, in the longer run, trying to identify the why is essential, both to preserve organizational effectiveness and to provide a management (and mental) model that can be transported elsewhere in a fleet or industry.
What is not useful, and possibly even a disservice, is a feckless organizational SC “analysis” that focuses on a laundry list of attributes or limits remedial actions to retraining, closer oversight and selective punishment. Such approaches ignore systemic factors and cannot provide long-term successful solutions.
We have always been skeptical of the value of SC surveys. Over time, we saw that others shared our view. Currently, broad-scope, in-depth interviews and focus groups are recognized as preferred ways to attempt to gauge an organization's SC and we generally support such approaches.
On a related topic, we were skeptical of the NRC's SC initiatives, which culminated in the SC Policy Statement. As we have seen, this “policy” has led to back door de facto regulation of SC.
References and Examples
We've identified a library of references related to SC. We review the work of leading organizational thinkers, social scientists and management writers, attempt to accurately summarize their work and add value by relating it to our views on SC. We've reported on the contributions of Dekker, Dörner, Hollnagel, Kahneman, Perin, Perrow, Reason, Schein, Taleb, Vaughan, Weick and others.
We've also posted on the travails of organizations that dug themselves into holes that brought their SC into question. Some of these were relatively small potatoes, e.g., Vermont Yankee and EdF, but others were actual disasters, e.g., Massey Energy and BP. We've also covered DOE, especially the Hanford Waste Treatment and Immobilization Plant (aka the Vit plant).
Conclusion
We believe the nuclear industry is generally well-managed by well-intentioned personnel but can be affected by the natural organizational ailments of complacency, normalization of deviation, drift, hubris, incompetence and occasional criminality. Our perspective has evolved as we have learned more about organizations in general and SC in particular. Channeling John Maynard Keynes, we adapt our models when we become aware of new facts or better ways of looking at the data. We hope you continue to follow Safetymatters.
Tuesday, September 24, 2013
Safety Paradigm Shift
We came across a provocative and persuasive presentation by Jean Pariès Dédale, "Why a Paradigm Shift Is Needed" from the IAEA Experts Meeting in May of this year.* Many of the points resonate with our views on nuclear safety management; in particular complexity, the fallacy of the "predetermination envelope"- making a system more reliable within its design envelope but more susceptible outside that envelope; deterministic and probabilistic rationalization that avoids dealing with complexity of the system; and unknown-unknowns. We also believe it will take a paradigm shift, however unlikely it may be at least in the U.S. nuclear industry. Interestingly, Dédale does not appear to have a nuclear power background and develops his paradigm argument across multiple events and industries.
Dédale poses a very fundamental question: since the current safety construct has shown vulnerabilities to actual off-normal events should the response be, do more of the same but better and with more rigor? Or should the safety paradigm itself be challenged? The key issue underlying the challenge to this construct is how to cope with complexity. He means complexity in the same terms we have posted about numerous times.
Dédale notes “The uncertainty generated by the complexity of the system itself and by its environment is skirted through deterministic or probabilistic rationality.” (p. 8) We agree. Any review of condition reports and Tech Spec variances indicates a wholesale reliance on risk based rationale for deviations from nominal requirements. And the risk based argument is almost always based on an estimated small probability of an event that would challenge safety, often enhanced by a relatively short exposure time frame. As we highlighted in a prior post, Nick Taleb has long cautioned against making decisions based on assessments of probabilities, which he asserts we cannot know, versus consequences which are (sometimes uncomfortably) knowable.
How does this relate to safety management issues including culture?
We see a parallel between constructs for nuclear safety and safety culture. The nuclear safety construct is constrained both in focus and evolution, heavily reliant on the design basis philosophy (what Dédale labels “predetermination fallacy”) dating back to the 1960s. Little has changed over the succeeding 50 years; even the advent of PRA has been limited to “informing” the implementation of this approach. Safety culture has emerged over the last 10+ years as an added regulatory emphasis though highly constrained in its manifestation as a policy statement. (It is in fact still quite difficult to square the NRC’s characterization of safety culture as critical to safety** yet stopping way short of any regulation or requirements.) The definitional scope of safety culture is expressed in a set of traits and related values and behaviors. As with nuclear safety it has a limited scope and relies on abstractions emphasizing, in essence, individual morality. It does not look beyond people to the larger environment and “system” within which people function. This environment can bring to bear significant influences that can challenge the desired traits and values of safety culture policy and muddle their application to decisions and actions. The limitations can be seen in the assessments of safety culture (surveys and similar) as well as the investigation of specific events, violations or non-conformances by licensees and the NRC. We’ve read many of these and rarely have we encountered any probing of the “why” associated with perceived breakdowns in safety culture.
One exception and a very powerful case in point is contained in our post dated July 29, 2010. The cited reference is an internal root cause analysis performed by FPL to address employee concerns and identified weaknesses in their corrective action program. They cite production pressures as negatively impacting employee trust and recognition, and perceptions of management and operational decisions. FPL took steps to change the origin and impact of production pressures, relieving some of the burden on the organization to contain those influences within the boundaries of safe operation.
Perhaps the NRC believes that it does not have the jurisdiction to probe these types of issues or even require licensees to assess their influence. Yet the NRC routinely refers to “licensee burden” - cost, schedule, production impacts - in accepting deviations from nominal safety standards.**** We wonder if a broader view of safety culture in the context of the socio-technical system might better “inform” both regulatory policy and decisions and enhance safety management.
* J.P. Dédale, "Why a Paradigm Shift Is Needed," IAEA International Experts’ Meeting on Human and Organizational Factors in Nuclear Safety in the Light of the Accident at the Fukushima Daiichi Nuclear Power Plant, Vienna May 21-24, 2013.
** The NRC’s Information Notice 2013-15 states that safety culture is “essential to nuclear safety in all phases…”
*** "NRC Decision on FPL (Part 2)," Safetymatters (July 29, 2010). See slide 18, Root Cause 2 and Contributing Causes 2.2 and 2.4.
**** 10 CFR 50.55a(g)(6)(i) states that the Commission may grant such relief and may impose such alternative requirements as it determines is authorized by law and will not endanger life or property or the common defense and security and is otherwise in the public interest, given the consideration of the burden upon the licensee (emphasis added).
Dédale poses a very fundamental question: since the current safety construct has shown vulnerabilities to actual off-normal events should the response be, do more of the same but better and with more rigor? Or should the safety paradigm itself be challenged? The key issue underlying the challenge to this construct is how to cope with complexity. He means complexity in the same terms we have posted about numerous times.
Dédale notes “The uncertainty generated by the complexity of the system itself and by its environment is skirted through deterministic or probabilistic rationality.” (p. 8) We agree. Any review of condition reports and Tech Spec variances indicates a wholesale reliance on risk based rationale for deviations from nominal requirements. And the risk based argument is almost always based on an estimated small probability of an event that would challenge safety, often enhanced by a relatively short exposure time frame. As we highlighted in a prior post, Nick Taleb has long cautioned against making decisions based on assessments of probabilities, which he asserts we cannot know, versus consequences which are (sometimes uncomfortably) knowable.
How does this relate to safety management issues including culture?
We see a parallel between constructs for nuclear safety and safety culture. The nuclear safety construct is constrained both in focus and evolution, heavily reliant on the design basis philosophy (what Dédale labels “predetermination fallacy”) dating back to the 1960s. Little has changed over the succeeding 50 years; even the advent of PRA has been limited to “informing” the implementation of this approach. Safety culture has emerged over the last 10+ years as an added regulatory emphasis though highly constrained in its manifestation as a policy statement. (It is in fact still quite difficult to square the NRC’s characterization of safety culture as critical to safety** yet stopping way short of any regulation or requirements.) The definitional scope of safety culture is expressed in a set of traits and related values and behaviors. As with nuclear safety it has a limited scope and relies on abstractions emphasizing, in essence, individual morality. It does not look beyond people to the larger environment and “system” within which people function. This environment can bring to bear significant influences that can challenge the desired traits and values of safety culture policy and muddle their application to decisions and actions. The limitations can be seen in the assessments of safety culture (surveys and similar) as well as the investigation of specific events, violations or non-conformances by licensees and the NRC. We’ve read many of these and rarely have we encountered any probing of the “why” associated with perceived breakdowns in safety culture.
One exception and a very powerful case in point is contained in our post dated July 29, 2010. The cited reference is an internal root cause analysis performed by FPL to address employee concerns and identified weaknesses in their corrective action program. They cite production pressures as negatively impacting employee trust and recognition, and perceptions of management and operational decisions. FPL took steps to change the origin and impact of production pressures, relieving some of the burden on the organization to contain those influences within the boundaries of safe operation.
Perhaps the NRC believes that it does not have the jurisdiction to probe these types of issues or even require licensees to assess their influence. Yet the NRC routinely refers to “licensee burden” - cost, schedule, production impacts - in accepting deviations from nominal safety standards.**** We wonder if a broader view of safety culture in the context of the socio-technical system might better “inform” both regulatory policy and decisions and enhance safety management.
* J.P. Dédale, "Why a Paradigm Shift Is Needed," IAEA International Experts’ Meeting on Human and Organizational Factors in Nuclear Safety in the Light of the Accident at the Fukushima Daiichi Nuclear Power Plant, Vienna May 21-24, 2013.
** The NRC’s Information Notice 2013-15 states that safety culture is “essential to nuclear safety in all phases…”
*** "NRC Decision on FPL (Part 2)," Safetymatters (July 29, 2010). See slide 18, Root Cause 2 and Contributing Causes 2.2 and 2.4.
**** 10 CFR 50.55a(g)(6)(i) states that the Commission may grant such relief and may impose such alternative requirements as it determines is authorized by law and will not endanger life or property or the common defense and security and is otherwise in the public interest, given the consideration of the burden upon the licensee (emphasis added).
Posted by
Bob Cudlin
0
comments. Click to view/add.
Labels:
IAEA,
Mental Model,
Safety Culture,
Taleb
Tuesday, September 17, 2013
Even Macy’s Does It
We have long been proponents of looking for innovative ways to improve safety management training for nuclear professionals. We’ve taken the burden to develop a prototype management simulator, NuclearSafetySim, and made it available to our readers to experience for themselves (see our July 30, 2013 post). In the past we have also noted other industries and organizations that have embraced simulation as an effective management training tool.
An August article in the Wall Street Journal* cites several examples of new approaches to manager training. Most notable in our view is Macy’s use of simulations to have managers gain decision making experience. As the article states:
“The simulation programs aim to teach managers how their daily decisions can affect the business as a whole.”
We won’t revisit all the arguments that we’ve made for taking a systems view of safety management, focusing on decisions as the essence of safety culture and using simulation to allow personnel to actualize safety values and priorities. All of these could only enrich, challenge and stimulate training activities.
A Clockwork Magenta
On the other hand what is the value of training approaches that reiterate INPO slide shows, regulatory policy statements and good practices in seemingly endless iterations? Brings to mind the character Alex, the incorrigible sociopath in A Clockwork Orange with an unusual passion for classical music.** He is the subject of “reclamation treatment”, head clamped in a brace and eyes pinned wide open, forced to watch repetitive screenings of anti-social behavior to the music of Beethoven’s Fifth. We are led to believe this results in a “cure” but does it and at what cost?
Nuclear managers may not be treated exactly like Alex but there are some similarities. After plant problems occur and are diagnosed, managers are also declared “cured” after each forced feeding of traits, values, and the need for increased procedure adherence and oversight. Results still not satisfactory? Repeat.
* R. Feintzeig, "Building Middle-Manager Morale," Wall Street Journal (Aug. 7, 2013). Retrieved Sept. 24, 2013.
** M. Amis, "The Shock of the New:‘A Clockwork Orange’ at 50," New York Times Sunday Book Review (Aug. 31, 2013). Retrieved Sept. 24, 2013.
An August article in the Wall Street Journal* cites several examples of new approaches to manager training. Most notable in our view is Macy’s use of simulations to have managers gain decision making experience. As the article states:
“The simulation programs aim to teach managers how their daily decisions can affect the business as a whole.”
We won’t revisit all the arguments that we’ve made for taking a systems view of safety management, focusing on decisions as the essence of safety culture and using simulation to allow personnel to actualize safety values and priorities. All of these could only enrich, challenge and stimulate training activities.
A Clockwork Magenta
On the other hand what is the value of training approaches that reiterate INPO slide shows, regulatory policy statements and good practices in seemingly endless iterations? Brings to mind the character Alex, the incorrigible sociopath in A Clockwork Orange with an unusual passion for classical music.** He is the subject of “reclamation treatment”, head clamped in a brace and eyes pinned wide open, forced to watch repetitive screenings of anti-social behavior to the music of Beethoven’s Fifth. We are led to believe this results in a “cure” but does it and at what cost?
Nuclear managers may not be treated exactly like Alex but there are some similarities. After plant problems occur and are diagnosed, managers are also declared “cured” after each forced feeding of traits, values, and the need for increased procedure adherence and oversight. Results still not satisfactory? Repeat.
* R. Feintzeig, "Building Middle-Manager Morale," Wall Street Journal (Aug. 7, 2013). Retrieved Sept. 24, 2013.
** M. Amis, "The Shock of the New:‘A Clockwork Orange’ at 50," New York Times Sunday Book Review (Aug. 31, 2013). Retrieved Sept. 24, 2013.
Posted by
Bob Cudlin
0
comments. Click to view/add.
Labels:
Management,
Nuclearsafetysim,
Simulation
Subscribe to:
Posts (Atom)