Showing posts with label HRO. Show all posts
Showing posts with label HRO. Show all posts

Monday, April 13, 2015

Safety-I and Safety-II: The Past and Future of Safety Management by Erik Hollnagel

This book* discusses two different ways of conceptualizing safety performance problems (e.g., near-misses, incidents and accidents) and safety management in socio-technical systems.  This post describes each approach and provides our perspective on Hollnagel’s efforts.  As usual, our interest lies in the potential value new ways of thinking can offer to the nuclear industry.

Safety-I

This is the common way of looking at safety performance problems.  It is reactive, i.e., it waits for problems to arise** and analytic, e.g., it uses specific methods to work back from the problem to its root causes.  The key assumption is that something in the system has failed or malfunctioned and the purpose of an investigation is to identify the causes and correct them so the problem will not recur.  A second assumption is that chains of causes and effects are linear, i.e., it is actually feasible to start with a problem and work back to its causes.  A third assumption is that a single solution (the “first story”) can be found. (pp. 86, 175-76)***  Underlying biases include the hindsight bias (p. 176) and the belief that the human is usually the weak link. (pp. 78-79)  The focus of safety management is minimizing the number of things that go wrong.

Our treatment of Safety-I is brief because we have reported on criticism of linear thinking/models elsewhere, primarily in the work of Dekker, Woods et al, and Leveson.  See our posts of Dec. 5, 2012; July 6, 2013; and Nov. 11, 2013 for details.

Safety-II

Safety-II is proposed as a different way to look at safety performance.  It is proactive, i.e., it looks at the ways work is actually performed on a day-to-day basis and tries to identify causes of performance variability and then manage them.  A key cause of variability is the regular adjustments people make in performing their jobs in order to keep the system running.  In Hollnagel’s view, “Finding out what these [performance] adjustments are and trying to learn from them can be more important than finding the causes of infrequent adverse outcomes!” (p. 149)  The focus of safety management is on increasing the likelihood that things will go right and developing “the ability to succeed under varying conditions, . . .” (p. 137).

Performance is variable because, among other reasons, people are always making trade-offs between thoroughness and efficiency.  They may use heuristics or have to compensate for something that is missing or take some steps today to avoid future problems.  The underlying assumption of Safety-II is that the same behaviors that almost always lead to successful outcomes can occasionally lead to problems because of performance variability that goes beyond the boundary of the control space.  A second assumption is that chains of causes and effects may be non-linear, i.e., a small variance may lead to a large problem, and may have an emergent aspect where a specific performance variability may occur then disappear or the Swiss cheese holes may momentarily line up exposing the system to latent hazards. (pp. 66, 131-32)  There may be multiple explanations (“second stories”) for why a particular problem occurred.  Finally, Safety-II accepts that there are often differences between Work-as-Imagined (esp. as imagined by folks at the blunt end) and Work-as-Done (by people at the sharp end). (pp. 40-41)***

The Two Approaches

Safety-I and Safety-II are not in some winner-take-all competitive struggle.  Hollnagel notes there are plenty of problems for which a Safety-I investigation is appropriate and adequate. (pp. 141, 146)

Safety-I expenditures are viewed as a cost (to reduce errors). (p. 57)  In contrast, Safety-II expenditures are viewed as bona fide investments to create more correct outcomes. (p. 166)

In all cases, organizational factors, such as safety culture, can impact safety performance and organizational learning. (p. 31)

Our Perspective

The more complex a socio-technical entity is, the more it exhibits emergent properties and the more appropriate Safety-II thinking is.  And nuclear has some elements of complexity.****  In addition, Hollnagel notes that a common explanation for failures that occur in a System-I world is “it was never imagined something like that could happen.” (p. 172)  To avoid being the one in front of the cameras saying that, it might be helpful for you to spend a little time reflecting on how System-II thinking might apply in your world.

Why do most things go right?  Is it due to strict compliance with procedures?  Does personal creativity or insight contribute to successful plant performance?  Do you talk with your colleagues about possible efficiency-thoroughness trade-offs (short cuts) that you or others make?  Can thinking about why things go right make one more alert to situations where things are heading south?  Does more automation (intended to reduce reliance on fallible humans) actually move performance closer to the control boundary because it removes the human’s ability to make useful adjustments?  Has any of your root cause evaluations appeared to miss other plausible explanations for why a problem occurred?

Some of the Safety-II material is not new.  Performance variability in Safety-II builds on Hollnagel’s earlier work on the efficiency-thoroughness trade-off (ETTO) principle.  (See our Jan. 3, 2013 post.)   His call for mindfulness and constant alertness to problems is straight out of the High Reliability Organization playbook. (pp. 36, 163-64)  (See our May 3, 2013 post.)

A definite shortcoming is the lack of concrete examples in the Safety-II discussion.  If someone has tried to do this, it would be nice to hear about it.

Bottom line, Hollnagel has some interesting observations although his Safety-II model is probably not the Next Big Thing for nuclear safety management.

 

*  E. Hollnagel, Safety-I and Safety-II: The Past and Future of Safety Management  (Burlington, VT: Ashgate , 2014)

**  In the author’s view, forward-looking risk analysis is not proactive because it is infrequently performed. (p. 57) 

***  There are other assumptions in the Safety-I approach (see pp. 97-104) but for the sake of efficiency, they are omitted from this post.

****  Nuclear power plants have some aspects of a complex socio-technical system but other aspects are merely complicated.   On the operations side, activities are tightly coupled (one attribute of complexity) but most of the internal organizational workings are complicated.  The lack of sudden environmental disrupters (excepting natural disasters) means they have time to adapt to changes in their financial or regulatory environment, reducing complexity.

Thursday, September 4, 2014

DNFSB Hearings on Safety Culture, Round Two

DNFSB Headquarters
On August 27, 2014 the Defense Nuclear Facilities Safety Board (DNFSB) convened the second of three hearings “to address safety culture at Department of Energy defense nuclear facilities and the Board’s Recommendation 2011–1, Safety Culture at the Waste Treatment and Immobilization Plant.”*  The first hearing was held on May 28, 2014 and heard from industry and federal government safety culture (SC) experts; we reviewed that hearing on June 9, 2014.  The second hearing received SC expert testimony from the U.S. Navy, the U.S. Chemical Safety and Hazard Investigation Board and academia.  The following discussion reviews the presentations in the order they were made to the board. 


Adm. Norton's (Naval Safety Center) presentation** on the Navy’s SC programs was certainly comprehensive with 32 slides for a half-hour talk (plus 22 backup slides).  It appears the major safety focus has been on aviation but the Center’s programs also address the afloat communities (surface, submarine and diving) and Marines.  The programs make heavy use of surveys and unit visits in addition to developing and presenting training and workshops.  Not surprisingly, the Navy stresses the importance of leadership, especially personal involvement and commitment, in creating a strong SC.  They recognize that implementing a strong SC faces a direct challenge from other organizational values such as the warfighter mentality*** and softer challenges in areas such as IT (where there are issues with multiple systems and data problems).

Program strengths include the focus on leadership (leadership drives climate, climate drives cultural change) and the importance of determining why mishaps occurred.  The positive influence of a strong SC on decision making is implied.

Program weaknesses can be inferred from what was not mentioned.  For example, there was no discussion of the importance of fixing problems or identifying hard-to-see technical problems.  More significantly, there was no mention of High Reliability Organization (HRO) attributes, a real head-scratcher given that some of the seminal work on HROs was conducted on aircraft carriers. 

Adm. Eccles' (Navy ret.) presentation**** basically reviews the Navy’s SUBSAFE program and its focus on compliance with program requirements from design through operations.  Eccles notes that ignorance, arrogance and complacency are challenges to maintaining an effective program.


Mr. Griffon's (Chemical Safety Board Member) presentation***** illustrates the CSB’s straightforward approach to investigating incidents, as reflected in the following quotes:

“Intent of CSB investigations are to get to root cause(s) and make recommendations toward prevention.” (p. 3)

While searching for root causes the CSB asks: “Why conditions or decisions leading to accident were seen as normal, rational, or acceptable prior to the accident.” (p. 4)


CSB review of incident-related artifacts includes two of our hot button issues, Process Safety Management action item closure (akin to a CAP) and the repair backlog. (p. 5)  Griffon reviews major incidents, e.g., Texas City and Deepwater Horizon.  For Deepwater, he notes how certain decisions were (deliberately) incompletely informed, i.e., did not utilize readily available relevant information, and thus are indicative of an inadequate SC. (p. 16)  Toward the end Griffon observes that “Safety culture study/change must consider inequalities of power and authority.” (p. 19)  That seems obvious but it doesn’t often get said so clearly.

We like the CSB’s approach.  There is no new information here but it’s a quick read of what basic SC should and shouldn’t be.


Prof. Meshkati's (Univ. of S. Cal.) presentation^ compares the cultures at TEPCO’s Fukushima Daiichi plant and Tohoku Electric’s Onagawa plant.  It is mainly a rehash of the op-ed Meshkati co-authored back in March 2014 (and we reviewed on March 19, 2014.)  The presentation adds something we pointed out as an omission in that op-ed, viz., that TEPCO’s Fukushima Daini plant eventually managed to shut down safely after the earthquake and tsunami.  Meshkati notes approvingly that Daini personnel exhibited impromptu, but prudent, decision-making and improvisation, e.g., by flexibly applying emergency operation procedures. (p. 37)

Prof. Sutcliffe (John Hopkins Univ.) co-authored an important book on High Reliability Organizations (which we reviewed on May 3, 2013) and this academically-oriented presentation^^ draws on her earlier work.  It begins with a familiar description of culture and how its evolution can be influenced.  Importantly it shows rewards (including money) as a key input affecting the link between leaders’ philosophy and employees’ behavior. (p. 6) 

Sutcliffe discusses how failure to redirect action (in a situation where a change is needed) can result from failure of foresight or sensemaking, or being overcome by dysfunctional momentum.  She includes a lengthy example featuring wildland firefighters that illustrates the linkages between cues, voiced concerns, search for disparate perspectives, situational reevaluation and redirected actions.  It’s worth a few minutes of your time to flip through these slides.

Our Perspective

For starters, the Naval Safety Center's
activities may be too bureaucratic, with too many initiatives and programs, and focused mainly on compliance with procedures, rules, designs, etc.  It’s not clear what SC lessons can be learned from the Navy experience beyond the vital role of leadership in creating a cultural vision and attempting to influence behavior toward that vision.

The other presenters added nothing that was not already available to you, either through Safetymatters or from observing SC tidbits in the information soup that flows by everyone these days.

Subsequent to the first hearing we reported that Safety Conscious Work Environment (SCWE) issues exist at multiple DOE sites (see our July 8, 2014 post).  This should increase the sense of urgency associated with strengthening SC throughout DOE.  However, our bottom line remains the same as after the first hearing: “The DNFSB is still trying to figure out the correct balance between prescription and flexibility in its effort to bring DOE to heel on the SC issue.  SC is a vital part of the puzzle of how to increase DOE line management effectiveness in ensuring adequate safety performance at DOE facilities.” 


*  DNFSB Aug. 27, 2014 Public Hearing on Safety Culture and Board Recommendation 2011-1.  There is a video of the hearing available.

**  K.J. Norton (U.S. Navy), “The Naval Safety Center and Naval Safety Culture,“ presentation to DNFSB (Aug. 27, 2014).

***  “Anything, anywhere, anytime…at any cost”—desirable warfighter mentality perceived to conflict with safety.” (p. 11)

****  T. J. Eccles (U.S. Navy ret.), “A Culture of Safety: Submarine Safety in the U. S. Navy,” presentation to DNFSB (Aug. 27, 2014).

*****  M.A. Griffon (Chem. Safety Bd.), “CSB Investigations and Safety Culture,” presentation to DNFSB (Aug. 27, 2014).

^  Najm Meshkati, “Leadership and Safety Culture: Personal Reflections on Lessons Learned,” presentation to DNFSB (Aug. 27, 2014).  Prof. Meshkati was also the technical advisor to the National Research Council’s safety culture lessons learned from Fukushima report which we reviewed on July 30, 2014.

^^  K.M. Sutcliffe, “Leadership and Safety Culture,” presentation to DNFSB (Aug. 27, 2014).

Monday, November 11, 2013

Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson

In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory.  The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics.  It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.

Part I
     
Part I sets the stage for a new safety paradigm.  Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6)   Traditional accident analysis techniques are no longer sufficient.  They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10)  Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).**   We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.

Event-based models are simply inadequate.  There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it.  The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24)  Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19)  In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)

Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors.  “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination.  When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent.  This dependence may be related to common systemic factors that do not appear in an event chain.  Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35)  “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36) 

The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38)  In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46) 

Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51)  Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52)  This is reminiscent of  Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.

When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56)  In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64)  Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60) 

Part II

Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes).  For our purposes we don't need to spend much space on this material.  “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”***   It attempts to achieve the goals listed at the end of Part I.

STAMP treats safety in a system as a control problem, not a reliability one.  Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76)  Controls may be physical or social, including cultural.  There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87)  “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)

Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system.  There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.

Part III

Part III describes in detail how STAMP can be applied.  There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section.  Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of  STAMP over traditional accident analysis techniques. 

We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.

Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421)  Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk.  The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40)  One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26)  Such an approach should be based on the organization's comprehensive hazards analysis.****

The safety culture discussion is also familiar. (pp. 426-33)  Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust.  In such a culture, Leveson says an early warning system for migration toward states of high risk can be established.  A section on Just Culture is taken directly from Dekker's work.  The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.

Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.

Our Perspective
 

Overall, we like this book.  It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial.  The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog.  The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
 

However, there are only a few direct references to nuclear.  The examples in the book are mostly from aerospace, aviation, maritime activities and the military.  Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory),  shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs.  Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428)  She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7)  In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective.  TMI is used to illustrate specific system design/operation problems.
 

We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists.  First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12)  But I'm not aware of any HRO work that has been done in an organization that is patently unsafe.  HRO asserts that reliability follows from practices that recognize and contain emerging problems.  She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers.  Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44)  Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could.  Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.”  In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410)  I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals.  Overall, this appears to be academic posturing and feather fluffing.
 

We offer no opinion on the efficacy of using Leveson's STAMP approach.  She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
 

In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii)  You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.


*  N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011)  The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed.  All quotes in this post were retyped from the original text.


**  We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier.  These authors are all aware of others' publications and contributions.  Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text. 

***  Nancy Leveson informal bio page.


****  “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157)  The hazards analysis identifies all major hazards the system may confront.  Baseline safety requirements follow from the hazards analysis.  Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk.  The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)

Monday, October 14, 2013

High Reliability Management by Roe and Schulman

This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market.  CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world.  In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.

The High Reliability Management (HRM) Model

The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3)  System Volatility refers to the magnitude and rate of change of  CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall).  Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs. 

System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each.  All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid.  Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes.  Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)

High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance.  Some strategies can be substituted for others.  It is a dynamic but manageable environment.

High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance.  They run from pillar to post; it is highly stressful.  Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error.  Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills.  It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)

Low Volatility and Low Options mean generation and demand are not changing quickly.  The critical feature here is demand has been reduced by load shedding.  The operators have exhausted all other strategies for maintaining balance.  It is a command-and-control approach, effected by declaring a  Stage 3 grid situation and run using formal rules and procedures.  It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished. 

Low Volatility and High Options is an HRM's preferred mode.  Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available.  Procedures based on analyzed conditions exist and are used.  There are few, if any, surprises.  Learning can occur but it is incremental, the result of new methods or analysis.  Performance is important and system behavior operates within a narrow bandwidth.  Loss of attention (complacency) is a risk.  Is this starting to sound familiar?  This is the domain of High Reliability Organization (HRO) theory and practice.  Nuclear power operations is an example of an HRO. (pp. 60-62)          

Lessons for Nuclear Operations 


Nuclear plants work hard to stay in the Low Volatility/High Options mode.  If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62)  In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses.  Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.**  Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.   

In an HRO, trial and error is not an acceptable method for trying out new options.  No one wants cowboy operators in the control room.  But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233)  In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)  

The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear.  The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system.  In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance.  In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126) 

The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)

Implications for Safety Culture

Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2)  Both organizational types need a focus on operations as the central activity.  Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.

The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants.  Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220) 

The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC.  We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?” 

Conclusion

This review gives short shrift to the authors' broad and deep description and analysis of CAISO.****  The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.

The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management.  But it's a good read and full of insightful observations, e.g., the description of  CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)

(High Reliability Management was recommended by a Safetymatters reader.  If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)

*  E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008)  This book reports the authors' study of CAISO from 2001 through 2006. 

**  By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment.  Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.

The importance of considering low-probability, major consequence events is argued by Taleb (see here) and D├ędale (see here).

***  Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits. 

The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone.  For more on Cynefin, see here and here.

****  For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO.  As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)

Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits.  This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.

Thursday, August 29, 2013

Normal Accidents by Charles Perrow

This book*, originally published in 1984, is a regular reference for authors writing about complex socio-technical systems.**  Perrow's model for classifying such systems is intuitively appealing; it appears to reflect the reality of complexity without forcing the reader to digest a deliberately abstruse academic construct.  We will briefly describe the model then spend most of our space discussing our problems with Perrow's inferences and assertions, focusing on nuclear power.  

The Model

The model is a 2x2 matrix with axes of coupling and interactions.  Not surprisingly, it is called the Interaction/Coupling (IC) chart.

“Coupling” refers to the amount of slack, buffer or give between two items in a system.  Loosely coupled systems can accommodate shocks, failures and pressures without destabilizing.  Tightly coupled systems have a higher risk of disastrous failure because their processes are more time-dependent, with invariant sequences and a single way of achieving the production goal, and have little slack. (pp. 89-94)

“Interactions” may be linear or complex.  Linear interactions are between a system component and one or more other components that immediately precede or follow it in the production sequence.  These interactions are familiar and, if something unplanned occurs, the results are easily visible.  Complex interactions are between a system component and one or more other components outside the normal production sequence.  If unfamiliar, unplanned or unexpected sequences occur, the results may not be visible or immediately comprehensible. (pp. 77-78)

Nuclear plants have the tightest coupling and most complex interactions of the two dozen systems Perrow shows on the I/C chart, a population that included chemical plants, space missions and nuclear weapons accidents. (p. 97)

Perrow on Nuclear Power

Let's get one thing out of the way immediately: Normal Accidents is an anti-nuke screed.  Perrow started the book in 1979 and it was published in 1984.  He was motivated to write the book by the TMI accident and it obviously colored his forecast for the industry.  He reviews the TMI accident in detail, then describes nuclear industry characteristics and incidents at other plants, all of which paint an unfavorable portrait of the industry.  He concludes: “We have not had more serious accidents of the scope of Three Mile Island simply because we have not given them enough time to appear.” (p. 60, emphasis added)  While he is concerned with design, construction and operating problems, his primary fear is “the potential for unexpected interactions of small failures in that system that makes it prone to the system accident.” (p. 61)   

Why has his prediction of such serious accidents not come to pass, at least in the U.S.?

Our Perspective on Normal Accidents

We have several issues with this book and the author's “analysis.”

Nuclear is not as complex as Perrow asserts 


There is no question that the U.S. nuclear industry grew quickly, with upsized plants and utilities specifying custom design combinations (in other words, limited standardization).  The utilities were focused on meeting significant load growth forecasts and saw nuclear baseload capacity as an efficient way to produce electric power.  However, actually operating a large nuclear plant was probably more complex than the utilities realized.  But not any more.  Learning curve effects, more detailed procedures and improved analytic methods are a few of the factors that led to a greater knowledge basis for plant decision making.  The serious operational issues at the “problem plants” (circa 1997) forced operators to confront the reality that identifying and permanently resolving plant problems was necessary for survival.  This era also saw the beginning of industry consolidation, with major operators applying best methods throughout their fleets.  All of these changes have led to our view that nuclear plants are certainly complicated but no longer complex and haven't been for some time.    

This is a good place to point out that Perrow's designation of nuclear plants as the most complex and tightest coupled systems he evaluated has no basis in any real science.  In his own words, “The placement of systems [on the interaction/coupling chart] is based entirely on subjective judgments on my part; at present there is no reliable way to measure these two variables, interaction and coupling.” (p. 96)

System failures with incomprehensible consequences are not the primary problem in the nuclear industry

The 1986 Chernobyl disaster was arguably a system failure: poor plant design, personnel non-compliance with rules and a deficient safety culture.  It was a serious accident but not a catastrophe.*** 

But other significant industry events have not arisen from interactions deep within the system; they have come from negligence, hubris, incompetence or selective ignorance.  For example, Fukushima was overwhelmed by a tsunami that was known to be possible but was ignored by the owners.  At Davis-Besse, personnel ignored increasingly stronger signals of a nascent problem but managers argued that in-depth investigation could wait until the next outage (production trumps safety) and the NRC agreed (with no solid justification).  

Important system dynamics are ignored 


Perrow has some recognition of what a system is and how threats can arise within it: “. . . it is the way the parts fit together, interact, that is important.  The dangerous accidents lie in the system, not in the components.” (p. 351)  However, he is/was focused on interactions and couplings as they currently exist.  But a socio-technical system is constantly changing (evolving, learning) in response to internal and external stimuli.  Internal stimuli include management decisions and the reactions to performance feedback signals; external stimuli include environmental demands, constraints, threats and opportunities.  Complacency and normalization of deviance can seep in but systems can also bolster their defenses and become more robust and resilient.****  It would be a stretch to say that nuclear power has always learned from its mistakes (especially if they occur at someone else's plant) but steps have been taken to make operations less complex. 

My own bias is Perrow doesn't really appreciate the technical side of a socio-technical system.  He recounts incidents in great detail, but not at great depth and is often recounting the work of others.  Although he claims the book is about technology (the socio side, aka culture, is never mentioned), the fact remains that he is not an engineer or physicist; he is a sociologist.

Conclusion

Notwithstanding all my carping, this is a significant book.  It is highly readable.  Perrow's discussion of accidents, incidents and issues in various contexts, including petrochemical plants, air transport, marine shipping and space exploration, is fascinating reading.  His interaction/coupling chart is a useful mental model to help grasp relative system complexity although one must be careful about over-inferring from such a simple representation.

There are some useful suggestions, e.g., establishing an anonymous reporting system, similar to the one used in the air transport industry, for nuclear near-misses. (p. 169)  There is a good discussion of decentralization vs centralization in nuclear plant organizations. (pp. 334-5)  But he says that neither is best all the time, which he considers a contradiction.  The possibility of contingency management, i.e., using a decentralized approach for normal times and tightening up during challenging conditions, is regarded as infeasible.

Ultimately, he includes nuclear power with “systems that are hopeless and should be abandoned because the inevitable risks outweigh any reasonable benefits . . .” (p. 304)*****  As further support for this conclusion, he reviews three different ways of evaluating the world: absolute, bounded and social rationality.  Absolute rationality is the province of experts; bounded rationality recognizes resource and cognitive limitations in the search for solutions.  But Perrow favors social rationality (which we might unkindly call crowdsourced opinions) because it is the most democratic and, not coincidentally, he can cite a study that shows an industry's “dread risk” is highly correlated with its position on the I/C chart. (p. 326)  In other words, if lots of people are fearful of nuclear power, no matter how unreasonable those fears are, that is further evidence to shut it down.

The 1999 edition of Normal Accidents has an Afterword that updates the original version.  Perrow continues to condemn nuclear power but without much new data.  Much of his disapprobation is directed at the petrochemical industry.  He highlights writers who have advanced his ideas and also presents his (dis)agreements with high reliability theory and Vaughn's interpretation of the Challenger accident.

You don't need this book in your library but you do need to be aware that it is a foundation stone for the work of many other authors.

 

*  C. Perrow, Normal Accidents: Living with High-Risk Technologies (Princeton Univ. Press, Princeton, NJ: 1999).

**  For example, see Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness Trade-Off (reviewed here); Woods, Dekker et al, Behind Human Error (reviewed here); and Weick and Sutcliffe, Managing the Unexpected: Resilient Performance in an Age of Uncertainty (reviewed here).  It's ironic that Perrow set out to write a readable book without references to the “sacred texts” (p. 11) but it appears Normal Accidents has become one.

***  Perrow's criteria for catastrophe appear to be: “kill many people, irradiate others, and poison some acres of land.” (p. 348)  While any death is a tragedy, reputable Chernobyl studies report fewer than 100 deaths from radiation and project 4,000 radiation-induced cancers in a population of 600,000 people who were exposed.  The same population is expected to suffer 100,000 cancer deaths from all other causes.  Approximately 40,000 square miles of land was significantly contaminated.  Data from Chernobyl Forum, "Chernobyl's Legacy: Health, Environmental and Socio-Economic Impacts" 2nd rev. ed.  Retrieved Aug. 27, 2013.  Wikipedia, “Chernobyl disaster.”  Retrieved Aug. 27, 2013.

In his 1999 Afterword to Normal Accidents, Perrow mentions Chernobyl in passing and his comments suggest he does not consider it a catastrophe but could have been had the wind blown the radioactive materials over the city of Kiev.

****  A truly complex system can drift into failure (Dekker) or experience incidents from performance excursions outside the safety boundaries (Hollnagel).

*****  It's not just nuclear power, Perrow also supports unilateral nuclear disarmament. (p. 347)

Saturday, July 6, 2013

Behind Human Error by Woods, Dekker, Cook, Johannesen and Sarter

This book* examines how errors occur in complex socio-technical systems.  The authors' thesis is that behind every ascribed “human error” there is a “second story” of the context (conditions, demands, constraints, etc.) created by the system itself.  “That which we label “human error” after the fact is never the cause of an accident.  Rather, it is the cumulative effect of multiple cognitive, collaborative, and organizational factors.” (p. 35)  In other words, “Error is a symptom indicating the need to investigate the larger operational systems and the organizational context in which it functions.” (p. 28)  This post presents a summary of the book followed by our perspective on its value.  (The book has a lot of content so this will not be a short post.)

The Second Story

This section establishes the authors' view of error and how socio-technical systems function.  They describe two mutually exclusive world views: (1) “erratic people degrade an otherwise safe system” vs. (2) “people create safety at all levels of the socio-technical system by learning and adapting . . .” (p. 6)  It should be obvious that the authors favor option 2.

In such a world “Failure, then, represents breakdowns in adaptations directed at coping with complexity.  Indeed, the enemy of safety is not the human: it is complexity.” (p. 1)  “. . . accidents emerge from the coupling and interdependence of modern systems.” (p. 31) 

Adaptation occurs in response to pressures or environmental changes.  For example, systems are under stakeholder pressure to become faster, better, cheaper; multiple goals and goal conflict are regular complex system characteristics.  But adaptation is not always successful.  There may be too little (rules and procedures are followed even though conditions have changed) or too much (adaptation is attempted with insufficient information to achieve goals).  Because of pressure, adaptations evolve toward performance boundaries, in particular, safety boundaries.  There is a drift toward failure. (see Dekker, reviewed here)

The authors present 15 premises for analyzing errors in complex socio-technical systems. (pp. 19-30)  Most are familiar but some are worth highlighting and remembering when thinking about system errors:

  • “There is a loose coupling between process and outcome.”  A “bad” process does not always produce bad outcomes and a “good” process does not always produce good outcomes.
  • “Knowledge of outcome (hindsight) biases judgments about process.”  More about that later.
  • “Lawful factors govern the types of erroneous actions or assessments to be expected.”   In other words, “errors are regular and predictable consequences of a variety of factors.”
  • “The design of artifacts affects the potential for erroneous actions and paths towards disaster.”  This is Human Factors 101 but problems still arise.  “Increased coupling increases the cognitive demands on practitioners.”  Increased coupling plus weak feedback can create a latent failure.

Complex Systems Failure


This section covers traditional mental models used for assessing failures and points out the putative inadequacies of each.  The sequence-of-events (or domino) model is familiar Newtonian causal analysis.  Man-made disaster theory puts company culture and institutional design at the heart of the safety question.  Vulnerability develops over time but is hidden by the organization’s belief that it has risk under control.  A system or component is driven into failure.  The latent failure (or Swiss cheese) model proposes that “disasters are characterized by a concatenation of several small failures and contributing events. . .” (p. 50)  While a practitioner may be closest to an accident, the associated latent failures were created by system managers, designers, maintainers or regulators.  All these models reinforce the search for human error (someone untrained, inattentive or a “bad apple) and the customary fixes (more training, procedure adherence and personal attention, or targeted discipline).  They represent a failure to adopt systems thinking and concepts of dynamics, learning, adaptation and the notion that a system can produce accidents as a natural consequence of its normal functioning.

A more sophisticated set of models is then discussed.  Perrow's normal accident theory says that “accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled.” (p. 61)  Such systems structurally confuse operators and prevent them from recovering when incipient failure is discovered.  People are part of the Perrowian system and can exhibit inadequate expertise.  Control theory sees systems as composed of components that must be kept in dynamic equilibrium based on feedback and continual control inputs—basically a system dynamics view.  Accidents are a result of normal system behavior and occur when components interact to violate safety constraints and the feedback (and control inputs) do not reflect the developing problems.  Small changes in the system can lead to huge consequences elsewhere.  Accident avoidance is based on making system performance boundaries explicit and known although the goal of efficiency will tend to push operations toward the boundaries.  In contrast, the authors would argue for a different focus: making the system more resilient, i.e., error-tolerant.**  High reliability theory describes how how-hazard activities can achieve safe performance through leadership, closed systems, functional decentralization, safety culture, redundancy and systematic learning.  High reliability means minimal variations in performance, which in the short-term, means safe performance but HROs are subject to incidents indicative of residual system noise and unseen changes from social forces, information management or new technologies. (See Weick, reviewed here)

Standing on the shoulders of the above sophisticated models, resilience engineering (RE) is proposed as a better way to think about safety.  According to this model, accidents “represent the breakdowns in the adaptations necessary to cope with the real world complexity. (p. 83)  The authors use the Columbia space shuttle disaster to illustrate patterns of failure evident in complex systems: drift toward failure, past success as reason for continued confidence, fragmented problem-solving, ignoring new evidence and intra-organizational communication breakdowns.  To oppose or compensate for these patterns, RE proposes monitoring or enhancing other system properties including: buffering capacity, flexibility, margin and tolerance (which means replacing quick collapse with graceful degradation).  RE “focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment.” (p. 93)  In practice, that means detecting signs of increasing risk, having resources for safety available, and recognizing when and where to invest to offset risk.  It also requires focusing on organizational decision making, e.g., cross checks for risky decisions, the safety-production-efficiency balance and the reporting and disposition of safety concerns.  “Enhancing error tolerance, detection and recovery together produce safety.” (p. 26)

Operating at the Sharp End

An organization's sharp end is where practitioners apply their expertise in an effort to achieve the organization's goals.  The blunt end is where support functions, from administration to engineering, work.  The blunt end designs the system, the sharp end operates it.  Practitioner performance is affected by cognitive activities in three areas: activation of knowledge, the flow of attention and interactions among multiple goals.

The knowledge available to practitioners arrives as organized content.  Challenges include: organization may be poor, the content may be incomplete or simply wrong.  Practitioner mental models may be inaccurate or incomplete without the practitioners realizing it, i.e., they may be poorly calibrated.  Knowledge may be inert, i.e., not accessed when it is needed.  Oversimplifications (heuristics) may work in some situations but produce errors in others and limit the practitioner's ability to account for uncertainties or conflicts that arise in individual cases.  The discussion of heuristics suggests Hollnagel, reviewed here.

Mindset is about attention and its control.” (p. 114)  Attention is a limited resource.  Problems with maintaining effective attention include loss of situational awareness, in which the practitioner's mental model of events doesn't match the real world, and fixation, where the practitioner's initial assessment of  a situation creates a going-forward bias against accepting discrepant data and a failure to trigger relevant inert knowledge.  Mindset seems similar to HRO mindfulness. (see Weick)

Goal conflict can arise from many sources including management policies, regulatory requirements, economic (cost) factors and risk of legal liability.  Decision making must consider goals (which may be implicit), values, costs and risks—which may be uncertain.  Normalization of deviance is a constant threat.  Decision makers may be held responsible for achieving a goal but lack the authority to do so.  The conflict between cost and safety may be subtle or unrecognized.  “Safety is not a concrete entity and the argument that one should always choose the safest path misrepresents the dilemmas that confront the practitioner.” (p. 139)  “[I]t is difficult for many organizations (particularly in regulated industries) to admit that goal conflicts and tradeoff decisions arise.” (p. 139)  Overall, the authors present a good discussion of goal conflict.

How Design Can Induce Error


The design of computerized devices intended to help practitioners can instead lead to greater risks of errors and incidents.  Specific causes of problems include clumsy automation, limited information visibility and mode errors. 

Automation is supposed to increase user effectiveness and efficiency.  However, clumsy automation creates situations where the user loses track of what the computer is set up to do, what it's doing and what it will do next.  If support systems are so flexible that users can't know all their possible configurations, they adopt simplifying strategies which may be inappropriate in some cases.  Clumsy automation leads to more (instead of less) cognitive work, user attention is diverted to the machine instead of the task, increased potential for new kinds of errors and the need for new user knowledge and judgments.  The machine effectively has its own model of the world, based on user inputs, data sensors and internal functioning, and passes that back to the user.

Machines often hide a mass of data behind a narrow keyhole of visibility into the system.  Successful design creates “a visible conceptual space meaningfully related to activities and constraints in a field of practice.” (p. 162)  In addition, “Effective representations highlight  'operationally interesting' changes for sequences of behavior . . .” (p. 167)  However, default displays typically do not make interesting events directly visible.

Mode errors occurs when an operator initiates an action that would be appropriate if the machine were in mode A but, in fact, it's in mode B.  (This may be a man-machine problem but it's not the machine's fault.)  A machine can change modes based on situational and system factors in addition to operator input.  Operators have to maintain mode awareness, not an easy task when viewing a small, cluttered display that may not highlight current mode or mode changes.

To cope with bad design “practitioners adapt information technology provided for them to the immediate tasks at hand in a locally pragmatic way, . . .” (p. 191)  They use system tailoring where they adapt the device, often by focusing on a feature set they consider useful and ignoring other machine capabilities.  They use task tailoring where they adapt strategies to accommodate constraints imposed by the new technology.  Both types of adaptation can lead to success or eventual failures. 

The authors suggest various countermeasures and design changes to address these problems. 

Reactions to Failure

Different approaches for analyzing accidents lead to different perspectives on human error. 

Hindsight bias is “the tendency for people to 'consistently exaggerate what could have been anticipated in foresight.'” (p. 15)  It reinforces the tendency to look for the human in the human error.  Operators are blamed for bad outcomes because they are available, tracking back to multiple contributing causes is difficult, most system performance is good and investigators tend to judge process quality by its outcome.  Outsiders tend to think operators knew more about their situation than they actually did.  Evaluating process instead of outcome is also problematic.  Process and outcome are loosely coupled and what standards should be used for process evaluation?  Formal work descriptions “underestimate the dilemmas, interactions between constraints, goal conflicts, and tradeoffs present in the actual workplace.” (p. 208)  A suggested alternative approach is to ask what other practitioners would have done in the same situation and build a set of contrast cases.  “What we should not do, . . . is rely on putatively objective external evaluations . . . such as . . . court cases or other formal hearings.  Such processes in fact institutionalize and legitimate the hindsight bias . . . leading to blame and a focus on individual actors at the expense of a system view.” (pp. 213-214)

Distancing through differencing is another risk.  In this practice, reviewers focus on differences between the context surrounding an accident and their own circumstance.  Blaming individuals reinforces belief that there are no lessons to be learned for other organizations.  If human error is local and individual (as opposed to systemic) then sanctions, exhortations to follow the procedures and remedial training are sufficient fixes.  There is a decent discussion of TMI here, where, in the authors' opinion, the initial sense of fundamental surprise and need for socio-technical fixes was soon replaced by a search for local, technologically-focused solutions.
      
There is often pressure to hold people accountable after incidents or accidents.  One answer is a “just culture” which views incidents as system learning opportunities but also draws a line between acceptable and unacceptable behavior.  Since the “line” is an attribution the key question for any organization is who gets to draw it.  Another challenge is defining the discretionary space where individuals alone have the authority to decide how to proceed.  There is more on just culture but this is all (or mostly) Dekker. (see our Just Culture commentary here)

The authors' recommendations for analyzing errors and improving safety can be summed up as follows: recognize that human error is an attribution; pursue second stories that reveal the multiple, systemic contributors to failure; avoid hindsight bias; understand how work really gets done; search for systemic vulnerabilities; study how practice creates safety; search for underlying patterns; examine how change will produce new vulnerabilities; use technology to enhance human expertise; and tame complexity. (p. 239)  “Safety is created at the sharp end as practitioners interact with hazardous processes . . . using the available tools and resources.” (p. 243)

Our Perspective

This is a book about organizational characteristics and socio-technical systems.  Recommendations and advice are aimed at organizational policy makers and incident investigators.  The discussion of a “just culture” is the only time culture is discussed in detail although safety culture is mentioned in passing in the HRO write-up.

Our first problem with the book is repeatedly referring to medicine, aviation, aircraft carrier operations and nuclear power plants as complex systems.***  Although medicine is definitely complex and aviation (including air traffic control) possibly is, carrier operations and nuclear power plants are simply complicated.  While carrier and nuclear personnel have to make some adaptations on the fly, they do not face sudden, disruptive changes in their technologies or operating environments and they are not exposed to cutthroat competition.  Their operations are tightly coordinated but, where possible, by design more loosely coupled to facilitate recovery if operations start to go sour.  In addition, calling nuclear power operations complex perpetuates the myth that nuclear is “unique and special” and thus merits some special place in the pantheon of industry.  It isn't and it doesn't.

Our second problem relates to the authors' recasting of the nature of human error.  We decry the rush to judgment after negative events, particularly a search limited to identifying culpable humans.  The search for bad apples or outright criminals satisfies society's perceived need to bring someone to justice and the corporate system's desire to appear to fix things through management exhortations and training without really admitting systemic problems or changing anything substantive, e.g., the management incentive plan.  The authors' plea for more systemic analysis is thus welcome.

But they push the pendulum too far in the opposite direction.  They appear to advocate replacing all human errors (except for gross negligence, willful violations or sabotage) with systemic explanations, aka rationalizations.  What is never mentioned is that medical errors lead to tens of thousands of preventable deaths per year.****  In contrast, U.S. commercial aviation has not experienced over a hundred fatalities (excluding 9/11) since 1996; carriers and nuclear power plants experience accidents, but there are few fatalities.  At worst, this book is a denial that real human errors (including bad decisions, slip ups, impairments, coverups) occur and a rationalization of medical mistakes caused by arrogance, incompetence, class structure and lack of accountability.

This is a dense book, 250 pages of small print, with an index that is nearly useless.  Pressures (most likely cost and schedule) have apparently pushed publishing to the system boundary for copy editing—there are extra, missing and wrong words throughout the text.

This 2010 second edition updates the original 1994 monograph.  Many of the original ideas have been fleshed out elsewhere by the authors (primarily Dekker) and others.  Some references, e.g., Hollnagel, Perrow and the HRO school, should be read in their original form. 


*  D.D. Woods, S. Dekker, R. Cook, L. Johannesen and N. Sarter, Behind Human Error, 2d ed.  (Ashgate, Burlington, VT: 2010).  Thanks to Bill Mullins for bringing this book to our attention.

**  There is considerable overlap of the perspectives of the authors and the control theorists (Leveson and Rasmussen are cited in the book).  As an aside, Dekker was a dissertation advisor for one of Leveson's MIT students.

***  The authors' different backgrounds contribute to this mash-up.  Cook is a physician, Dekker is a pilot and some of Woods' cited publications refer to nuclear power (and aviation).

****  M. Makary, “How to Stop Hospitals From Killing Us,” Wall Street Journal online (Sept. 21, 2012).  Retrieved July 4, 2013.

Friday, May 3, 2013

High Reliability Organizations and Safety Culture

On February 10th, we posted about a report covering lessons for safety culture (SC) that can be gleaned from the social science literature. The report's authors judged that high reliability organization (HRO) literature provided a solid basis for linking individual and organizational assumptions with traits and practices that can affect safety performance. This post explores HRO characteristics and how they can influence SC.

Our source is Managing the Unexpected: Resilient Performance in an Age of Uncertainty* by Karl Weick and Kathleen Sutcliffe. Weick is a leading contemporary HRO scholar. This book is clearly written, with many pithy comments, so lots of quotations are included below to present the authors' views in their own words.

What makes an HRO different?

Many organizations work with risky technologies where the consequences of problems or errors can be catastrophic, use complex management systems and exist in demanding environments. But successful HROs approach their work with a different attitude and practices, an “ongoing mindfulness embedded in practices that enact alertness, broaden attention, reduce distractions, and forestall misleading simplifications.” (p. 3)

Mindfulness

An underlying assumption of HROs is “that gradual . . . development of unexpected events sends weak signals . . . along the way” (p. 63) so constant attention is required. Mindfulness means that “when people act, they are aware of context, of ways in which details differ . . . and of deviations from their expectations.” (p. 32) HROs “maintain continuing alertness to the unexpected in the face of pressure to take cognitive shortcuts.” (p. 19) Mindful organizations “notice the unexpected in the making, halt it or contain it, and restore system functioning.” (p. 21)

It takes a lot of energy to maintain mindfulness. As the authors warn us, “mindful processes unravel pretty fast.” (p. 106) Complacency and hubris are two omnipresent dangers. “Success narrows perceptions, . . . breeds overconfidence . . . and reduces acceptance of opposing points of view. . . . [If] people assume that success demonstrates competence, they are more likely to drift into complacency, . . .” (p. 52) Pressure in the task environment is another potential problem. “As pressure increases, people are more likely to search for confirming information and to ignore information that is inconsistent with their expectations.” (p. 26) The opposite of mindfulness is mindlessness. “Instances of mindlessness occur when people confront weak stimuli, powerful expectations, and strong desires to see what they expect to see.” (p. 88)

Mindfulness can lead to insight and knowledge. “In that brief interval between surprise and successful normalizing lies one of your few opportunities to discover what you don't know.” (p. 31)**

Five principles

HROs follow five principles. The first three cover anticipation of problems and the remaining two cover containment of problems that do arise.

Preoccupation with failure

HROs “treat any lapse as a symptom that something may be wrong with the system, something that could have severe consequences if several separate small errors happened to coincide. . . . they are wary of the potential liabilities of success, including complacency, the temptation to reduce margins of safety, and the drift into automatic processing.” (p. 9)

Managers usually think surprises are bad, evidence of bad planning. However, “Feelings of surprise are diagnostic because they are a solid cue that one's model of the world is flawed.” (p. 104) HROs “Interpret a near miss as danger in the guise of safety rather than safety in the guise of danger. . . . No news is bad news. All news is good news, because it means that the system is responding.” (p. 152)

People in HROs “have a good sense of what needs to go right and a clearer understanding of the factors that might signal that things are unraveling.” (p. 86)

Reluctance to simplify

HROs “welcome diverse experience, skepticism toward received wisdom, and negotiating tactics that reconcile differences of opinion without destroying the nuances that diverse people detect. . . . [They worry that] superficial similarities between the present and the past mask deeper differences that could prove fatal.” (p. 10) “Skepticism thus counteracts complacency . . . .” (p. 155) “Unfortunately, diverse views tend to be disproportionately distributed toward the bottom of the organization, . . .” (p. 95)

The language people use at work can be a catalyst for simplification. A person may initially perceive something different in the environment but using familiar or standard terms to communicate the experience can raise the risk of losing the early warnings the person perceived.

Sensitivity to operations

HROs “are attentive to the front line, . . . Anomalies are noticed while they are still tractable and can still be isolated . . . . People who refuse to speak up out of fear undermine the system, which knows less than it needs to know to work effectively.” (pp. 12-13) “Being sensitive to operations is a unique way to correct failures of foresight.” (p. 97)

In our experience, nuclear plants are generally good in this regard; most include a focus on operations among their critical success factors.

Commitment to resilience

“HROs develop capabilities to detect, contain, and bounce back from those inevitable errors that are part of an indeterminate world.” (p. 14) “. . . environments that HROs face are typically more complex than the HRO systems themselves. Reliability and resilience lie in practices that reduce . . . environmental complexity or increase system complexity.” (p. 113) Because it's difficult or impossible to reduce environmental complexity, the organization needs to makes its systems more complex.*** This requires clear thinking and insightful analysis. Unfortunately, actual organizational response to disturbances can fall short. “. . . systems often respond to a disturbance with new rules and new prohibitions designed to present the same disruption from happening in the future. This response reduces flexibility to deal with subsequent unpredictable changes.” (p. 72)

Deference to expertise.

“Decisions are made on the front line, and authority migrates to the people with the most expertise, regardless of their rank.” (p. 15) Application of expertise “emerges from a collective, cultural belief that the necessary capabilities lie somewhere in the system and that migrating problems [down or up] will find them.” (p. 80) “When tasks are highly interdependent and time is compressed, decisions migrate down . . . Decisions migrate up when events are unique, have potential for very serious consequences, or have political or career ramifications . . .” (p. 100)

This is another ideal that can fail in practice. We've all seen decisions made by the highest ranking person rather than the most qualified one. In other words, “who is right” can trump “what is right.”

Relationship to safety culture

Much of the chapter on culture is based on the ideas of Schein and Reason so we'll focus on key points emphasized by Weick and Sutcliffe. In their view, “culture is something an organization has [practices and controls] that eventually becomes something an organization is [beliefs, attitudes, values].” (p. 114, emphasis added)

“Culture consists of characteristic ways of knowing and sensemaking. . . . Culture is about practices—practices of expecting, managing disconfirmations, sensemaking, learning, and recovering.” (pp. 119-120) A single organization can have different types of culture: an integrative culture that everyone shares, differentiated cultures that are particular to sub-groups and fragmented cultures that describe individuals who don't fit into the first two types. Multiple cultures support the development of more varied responses to nascent problems.

A complete culture strives to be mindful, safe and informed with an emphasis on wariness. As HRO principles are ingrained in an organization, they become part of the culture. The goal is a strong SC that reinforces concern about the unexpected, is open to questions and reporting of failures, views close calls as a failure, is fearful of complacency, resists simplifications, values diversity of opinions and focuses on imperfections in operations.

What else is in the book?

One chapter contains a series of audits (presented as survey questions) to assess an organization's mindfulness and appreciation of the five principles. The audits can show an organization's attitudes and capabilities relative to HROs and relative to its own self-image and goals.

The final chapter describes possible “small wins” a change agent (often an individual) can attempt to achieve in an effort to move his organization more in line with HRO practices, viz., mindfulness and the five principles. For example, “take your team to the actual site where an unexpected event was handled either well or poorly, walk everyone through the decision making that was involved, and reflect on how to handle that event more mindfully.” (p. 144)

The book's case studies include an aircraft carrier, a nuclear power plant,**** a pediatric surgery center and wildland firefighting.

Our perspective

Weick and Sutcliffe draw on the work of many other scholars, including Constance Perin, Charles Perrow, James Reason and Diane Vaughan, all of whom we have discussed in this blog. The book makes many good points. For example, the prescription for mindfulness and the five principles can contribute to an effective context for decision making although it does not comprise a complete management system. The authors' recognize that reliability does not mean a complete lack of performance variation, instead reliability follows from practices that recognize and contain emerging problems. Finally, there is evidence of a systems view, which we espouse, when the authors say “It is this network of relationships taken together—not necessarily any one individual or organization in the group—that can also maintain the big picture of operations . . .” (p. 142)

The authors would have us focus on nascent problems in operations, which is obviously necessary. But another important question is what are the faint signals that the SC is developing problems? What are the precursors to the obvious signs, like increasing backlogs of safety-related work? Could that “human error” that recently occurred be a sign of a SC that is more forgiving of growing organizational mindlessness?

Bottom line: Safetymatters says check out Managing the Unexpected and consider adding it to your library.


* K.E. Weick and K.M. Sutcliffe, Managing the Unexpected: Resilient Performance in an Age of Uncertainty, 2d ed. (San Francisco, CA: Jossey-Bass, 2007). Also, Wikipedia has a very readable summary of HRO history and characteristics.

** More on normalization and rationalization: “On the actual day of battle naked truths may be picked up for the asking. But by the following morning they have already begun to get into their uniforms.” E.A. Cohen and J. Gooch, Military Misfortunes: The Anatomy of Failure in War (New York: Vintage Books, 1990), p. 44, quoted in Managing the Unexpected, p. 31.

*** The prescription to increase system complexity to match the environment is based on the system design principle of requisite variety which means “if you want to cope successfully with a wide variety of inputs, you need a wide variety of responses.” (p. 113)

**** I don't think the authors performed any original research on nuclear plants. But the studies they reviewed led them to conclude that “The primary threat to operations in nuclear plants is the engineering culture, which places a higher value on knowledge that is quantitative, measurable, hard, objective, and formal . . . HROs refuse to draw a hard line between knowledge that is quantitative and knowledge that is qualitative.” (p. 60)