Showing posts with label Just Culture. Show all posts
Showing posts with label Just Culture. Show all posts

Monday, November 11, 2013

Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson

In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory.  The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics.  It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.

Part I
     
Part I sets the stage for a new safety paradigm.  Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6)   Traditional accident analysis techniques are no longer sufficient.  They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10)  Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).**   We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.

Event-based models are simply inadequate.  There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it.  The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24)  Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19)  In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)

Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors.  “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination.  When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent.  This dependence may be related to common systemic factors that do not appear in an event chain.  Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35)  “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36) 

The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38)  In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46) 

Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51)  Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52)  This is reminiscent of  Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.

When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56)  In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64)  Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60) 

Part II

Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes).  For our purposes we don't need to spend much space on this material.  “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”***   It attempts to achieve the goals listed at the end of Part I.

STAMP treats safety in a system as a control problem, not a reliability one.  Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76)  Controls may be physical or social, including cultural.  There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87)  “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)

Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system.  There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.

Part III

Part III describes in detail how STAMP can be applied.  There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section.  Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of  STAMP over traditional accident analysis techniques. 

We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.

Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421)  Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk.  The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40)  One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26)  Such an approach should be based on the organization's comprehensive hazards analysis.****

The safety culture discussion is also familiar. (pp. 426-33)  Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust.  In such a culture, Leveson says an early warning system for migration toward states of high risk can be established.  A section on Just Culture is taken directly from Dekker's work.  The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.

Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.

Our Perspective
 

Overall, we like this book.  It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial.  The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog.  The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
 

However, there are only a few direct references to nuclear.  The examples in the book are mostly from aerospace, aviation, maritime activities and the military.  Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory),  shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs.  Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428)  She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7)  In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective.  TMI is used to illustrate specific system design/operation problems.
 

We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists.  First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12)  But I'm not aware of any HRO work that has been done in an organization that is patently unsafe.  HRO asserts that reliability follows from practices that recognize and contain emerging problems.  She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers.  Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44)  Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could.  Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.”  In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410)  I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals.  Overall, this appears to be academic posturing and feather fluffing.
 

We offer no opinion on the efficacy of using Leveson's STAMP approach.  She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
 

In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii)  You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.


*  N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011)  The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed.  All quotes in this post were retyped from the original text.


**  We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier.  These authors are all aware of others' publications and contributions.  Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text. 

***  Nancy Leveson informal bio page.


****  “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157)  The hazards analysis identifies all major hazards the system may confront.  Baseline safety requirements follow from the hazards analysis.  Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk.  The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)

Saturday, July 6, 2013

Behind Human Error by Woods, Dekker, Cook, Johannesen and Sarter

This book* examines how errors occur in complex socio-technical systems.  The authors' thesis is that behind every ascribed “human error” there is a “second story” of the context (conditions, demands, constraints, etc.) created by the system itself.  “That which we label “human error” after the fact is never the cause of an accident.  Rather, it is the cumulative effect of multiple cognitive, collaborative, and organizational factors.” (p. 35)  In other words, “Error is a symptom indicating the need to investigate the larger operational systems and the organizational context in which it functions.” (p. 28)  This post presents a summary of the book followed by our perspective on its value.  (The book has a lot of content so this will not be a short post.)

The Second Story

This section establishes the authors' view of error and how socio-technical systems function.  They describe two mutually exclusive world views: (1) “erratic people degrade an otherwise safe system” vs. (2) “people create safety at all levels of the socio-technical system by learning and adapting . . .” (p. 6)  It should be obvious that the authors favor option 2.

In such a world “Failure, then, represents breakdowns in adaptations directed at coping with complexity.  Indeed, the enemy of safety is not the human: it is complexity.” (p. 1)  “. . . accidents emerge from the coupling and interdependence of modern systems.” (p. 31) 

Adaptation occurs in response to pressures or environmental changes.  For example, systems are under stakeholder pressure to become faster, better, cheaper; multiple goals and goal conflict are regular complex system characteristics.  But adaptation is not always successful.  There may be too little (rules and procedures are followed even though conditions have changed) or too much (adaptation is attempted with insufficient information to achieve goals).  Because of pressure, adaptations evolve toward performance boundaries, in particular, safety boundaries.  There is a drift toward failure. (see Dekker, reviewed here)

The authors present 15 premises for analyzing errors in complex socio-technical systems. (pp. 19-30)  Most are familiar but some are worth highlighting and remembering when thinking about system errors:

  • “There is a loose coupling between process and outcome.”  A “bad” process does not always produce bad outcomes and a “good” process does not always produce good outcomes.
  • “Knowledge of outcome (hindsight) biases judgments about process.”  More about that later.
  • “Lawful factors govern the types of erroneous actions or assessments to be expected.”   In other words, “errors are regular and predictable consequences of a variety of factors.”
  • “The design of artifacts affects the potential for erroneous actions and paths towards disaster.”  This is Human Factors 101 but problems still arise.  “Increased coupling increases the cognitive demands on practitioners.”  Increased coupling plus weak feedback can create a latent failure.

Complex Systems Failure


This section covers traditional mental models used for assessing failures and points out the putative inadequacies of each.  The sequence-of-events (or domino) model is familiar Newtonian causal analysis.  Man-made disaster theory puts company culture and institutional design at the heart of the safety question.  Vulnerability develops over time but is hidden by the organization’s belief that it has risk under control.  A system or component is driven into failure.  The latent failure (or Swiss cheese) model proposes that “disasters are characterized by a concatenation of several small failures and contributing events. . .” (p. 50)  While a practitioner may be closest to an accident, the associated latent failures were created by system managers, designers, maintainers or regulators.  All these models reinforce the search for human error (someone untrained, inattentive or a “bad apple) and the customary fixes (more training, procedure adherence and personal attention, or targeted discipline).  They represent a failure to adopt systems thinking and concepts of dynamics, learning, adaptation and the notion that a system can produce accidents as a natural consequence of its normal functioning.

A more sophisticated set of models is then discussed.  Perrow's normal accident theory says that “accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled.” (p. 61)  Such systems structurally confuse operators and prevent them from recovering when incipient failure is discovered.  People are part of the Perrowian system and can exhibit inadequate expertise.  Control theory sees systems as composed of components that must be kept in dynamic equilibrium based on feedback and continual control inputs—basically a system dynamics view.  Accidents are a result of normal system behavior and occur when components interact to violate safety constraints and the feedback (and control inputs) do not reflect the developing problems.  Small changes in the system can lead to huge consequences elsewhere.  Accident avoidance is based on making system performance boundaries explicit and known although the goal of efficiency will tend to push operations toward the boundaries.  In contrast, the authors would argue for a different focus: making the system more resilient, i.e., error-tolerant.**  High reliability theory describes how how-hazard activities can achieve safe performance through leadership, closed systems, functional decentralization, safety culture, redundancy and systematic learning.  High reliability means minimal variations in performance, which in the short-term, means safe performance but HROs are subject to incidents indicative of residual system noise and unseen changes from social forces, information management or new technologies. (See Weick, reviewed here)

Standing on the shoulders of the above sophisticated models, resilience engineering (RE) is proposed as a better way to think about safety.  According to this model, accidents “represent the breakdowns in the adaptations necessary to cope with the real world complexity. (p. 83)  The authors use the Columbia space shuttle disaster to illustrate patterns of failure evident in complex systems: drift toward failure, past success as reason for continued confidence, fragmented problem-solving, ignoring new evidence and intra-organizational communication breakdowns.  To oppose or compensate for these patterns, RE proposes monitoring or enhancing other system properties including: buffering capacity, flexibility, margin and tolerance (which means replacing quick collapse with graceful degradation).  RE “focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment.” (p. 93)  In practice, that means detecting signs of increasing risk, having resources for safety available, and recognizing when and where to invest to offset risk.  It also requires focusing on organizational decision making, e.g., cross checks for risky decisions, the safety-production-efficiency balance and the reporting and disposition of safety concerns.  “Enhancing error tolerance, detection and recovery together produce safety.” (p. 26)

Operating at the Sharp End

An organization's sharp end is where practitioners apply their expertise in an effort to achieve the organization's goals.  The blunt end is where support functions, from administration to engineering, work.  The blunt end designs the system, the sharp end operates it.  Practitioner performance is affected by cognitive activities in three areas: activation of knowledge, the flow of attention and interactions among multiple goals.

The knowledge available to practitioners arrives as organized content.  Challenges include: organization may be poor, the content may be incomplete or simply wrong.  Practitioner mental models may be inaccurate or incomplete without the practitioners realizing it, i.e., they may be poorly calibrated.  Knowledge may be inert, i.e., not accessed when it is needed.  Oversimplifications (heuristics) may work in some situations but produce errors in others and limit the practitioner's ability to account for uncertainties or conflicts that arise in individual cases.  The discussion of heuristics suggests Hollnagel, reviewed here.

Mindset is about attention and its control.” (p. 114)  Attention is a limited resource.  Problems with maintaining effective attention include loss of situational awareness, in which the practitioner's mental model of events doesn't match the real world, and fixation, where the practitioner's initial assessment of  a situation creates a going-forward bias against accepting discrepant data and a failure to trigger relevant inert knowledge.  Mindset seems similar to HRO mindfulness. (see Weick)

Goal conflict can arise from many sources including management policies, regulatory requirements, economic (cost) factors and risk of legal liability.  Decision making must consider goals (which may be implicit), values, costs and risks—which may be uncertain.  Normalization of deviance is a constant threat.  Decision makers may be held responsible for achieving a goal but lack the authority to do so.  The conflict between cost and safety may be subtle or unrecognized.  “Safety is not a concrete entity and the argument that one should always choose the safest path misrepresents the dilemmas that confront the practitioner.” (p. 139)  “[I]t is difficult for many organizations (particularly in regulated industries) to admit that goal conflicts and tradeoff decisions arise.” (p. 139)  Overall, the authors present a good discussion of goal conflict.

How Design Can Induce Error


The design of computerized devices intended to help practitioners can instead lead to greater risks of errors and incidents.  Specific causes of problems include clumsy automation, limited information visibility and mode errors. 

Automation is supposed to increase user effectiveness and efficiency.  However, clumsy automation creates situations where the user loses track of what the computer is set up to do, what it's doing and what it will do next.  If support systems are so flexible that users can't know all their possible configurations, they adopt simplifying strategies which may be inappropriate in some cases.  Clumsy automation leads to more (instead of less) cognitive work, user attention is diverted to the machine instead of the task, increased potential for new kinds of errors and the need for new user knowledge and judgments.  The machine effectively has its own model of the world, based on user inputs, data sensors and internal functioning, and passes that back to the user.

Machines often hide a mass of data behind a narrow keyhole of visibility into the system.  Successful design creates “a visible conceptual space meaningfully related to activities and constraints in a field of practice.” (p. 162)  In addition, “Effective representations highlight  'operationally interesting' changes for sequences of behavior . . .” (p. 167)  However, default displays typically do not make interesting events directly visible.

Mode errors occurs when an operator initiates an action that would be appropriate if the machine were in mode A but, in fact, it's in mode B.  (This may be a man-machine problem but it's not the machine's fault.)  A machine can change modes based on situational and system factors in addition to operator input.  Operators have to maintain mode awareness, not an easy task when viewing a small, cluttered display that may not highlight current mode or mode changes.

To cope with bad design “practitioners adapt information technology provided for them to the immediate tasks at hand in a locally pragmatic way, . . .” (p. 191)  They use system tailoring where they adapt the device, often by focusing on a feature set they consider useful and ignoring other machine capabilities.  They use task tailoring where they adapt strategies to accommodate constraints imposed by the new technology.  Both types of adaptation can lead to success or eventual failures. 

The authors suggest various countermeasures and design changes to address these problems. 

Reactions to Failure

Different approaches for analyzing accidents lead to different perspectives on human error. 

Hindsight bias is “the tendency for people to 'consistently exaggerate what could have been anticipated in foresight.'” (p. 15)  It reinforces the tendency to look for the human in the human error.  Operators are blamed for bad outcomes because they are available, tracking back to multiple contributing causes is difficult, most system performance is good and investigators tend to judge process quality by its outcome.  Outsiders tend to think operators knew more about their situation than they actually did.  Evaluating process instead of outcome is also problematic.  Process and outcome are loosely coupled and what standards should be used for process evaluation?  Formal work descriptions “underestimate the dilemmas, interactions between constraints, goal conflicts, and tradeoffs present in the actual workplace.” (p. 208)  A suggested alternative approach is to ask what other practitioners would have done in the same situation and build a set of contrast cases.  “What we should not do, . . . is rely on putatively objective external evaluations . . . such as . . . court cases or other formal hearings.  Such processes in fact institutionalize and legitimate the hindsight bias . . . leading to blame and a focus on individual actors at the expense of a system view.” (pp. 213-214)

Distancing through differencing is another risk.  In this practice, reviewers focus on differences between the context surrounding an accident and their own circumstance.  Blaming individuals reinforces belief that there are no lessons to be learned for other organizations.  If human error is local and individual (as opposed to systemic) then sanctions, exhortations to follow the procedures and remedial training are sufficient fixes.  There is a decent discussion of TMI here, where, in the authors' opinion, the initial sense of fundamental surprise and need for socio-technical fixes was soon replaced by a search for local, technologically-focused solutions.
      
There is often pressure to hold people accountable after incidents or accidents.  One answer is a “just culture” which views incidents as system learning opportunities but also draws a line between acceptable and unacceptable behavior.  Since the “line” is an attribution the key question for any organization is who gets to draw it.  Another challenge is defining the discretionary space where individuals alone have the authority to decide how to proceed.  There is more on just culture but this is all (or mostly) Dekker. (see our Just Culture commentary here)

The authors' recommendations for analyzing errors and improving safety can be summed up as follows: recognize that human error is an attribution; pursue second stories that reveal the multiple, systemic contributors to failure; avoid hindsight bias; understand how work really gets done; search for systemic vulnerabilities; study how practice creates safety; search for underlying patterns; examine how change will produce new vulnerabilities; use technology to enhance human expertise; and tame complexity. (p. 239)  “Safety is created at the sharp end as practitioners interact with hazardous processes . . . using the available tools and resources.” (p. 243)

Our Perspective

This is a book about organizational characteristics and socio-technical systems.  Recommendations and advice are aimed at organizational policy makers and incident investigators.  The discussion of a “just culture” is the only time culture is discussed in detail although safety culture is mentioned in passing in the HRO write-up.

Our first problem with the book is repeatedly referring to medicine, aviation, aircraft carrier operations and nuclear power plants as complex systems.***  Although medicine is definitely complex and aviation (including air traffic control) possibly is, carrier operations and nuclear power plants are simply complicated.  While carrier and nuclear personnel have to make some adaptations on the fly, they do not face sudden, disruptive changes in their technologies or operating environments and they are not exposed to cutthroat competition.  Their operations are tightly coordinated but, where possible, by design more loosely coupled to facilitate recovery if operations start to go sour.  In addition, calling nuclear power operations complex perpetuates the myth that nuclear is “unique and special” and thus merits some special place in the pantheon of industry.  It isn't and it doesn't.

Our second problem relates to the authors' recasting of the nature of human error.  We decry the rush to judgment after negative events, particularly a search limited to identifying culpable humans.  The search for bad apples or outright criminals satisfies society's perceived need to bring someone to justice and the corporate system's desire to appear to fix things through management exhortations and training without really admitting systemic problems or changing anything substantive, e.g., the management incentive plan.  The authors' plea for more systemic analysis is thus welcome.

But they push the pendulum too far in the opposite direction.  They appear to advocate replacing all human errors (except for gross negligence, willful violations or sabotage) with systemic explanations, aka rationalizations.  What is never mentioned is that medical errors lead to tens of thousands of preventable deaths per year.****  In contrast, U.S. commercial aviation has not experienced over a hundred fatalities (excluding 9/11) since 1996; carriers and nuclear power plants experience accidents, but there are few fatalities.  At worst, this book is a denial that real human errors (including bad decisions, slip ups, impairments, coverups) occur and a rationalization of medical mistakes caused by arrogance, incompetence, class structure and lack of accountability.

This is a dense book, 250 pages of small print, with an index that is nearly useless.  Pressures (most likely cost and schedule) have apparently pushed publishing to the system boundary for copy editing—there are extra, missing and wrong words throughout the text.

This 2010 second edition updates the original 1994 monograph.  Many of the original ideas have been fleshed out elsewhere by the authors (primarily Dekker) and others.  Some references, e.g., Hollnagel, Perrow and the HRO school, should be read in their original form. 


*  D.D. Woods, S. Dekker, R. Cook, L. Johannesen and N. Sarter, Behind Human Error, 2d ed.  (Ashgate, Burlington, VT: 2010).  Thanks to Bill Mullins for bringing this book to our attention.

**  There is considerable overlap of the perspectives of the authors and the control theorists (Leveson and Rasmussen are cited in the book).  As an aside, Dekker was a dissertation advisor for one of Leveson's MIT students.

***  The authors' different backgrounds contribute to this mash-up.  Cook is a physician, Dekker is a pilot and some of Woods' cited publications refer to nuclear power (and aviation).

****  M. Makary, “How to Stop Hospitals From Killing Us,” Wall Street Journal online (Sept. 21, 2012).  Retrieved July 4, 2013.

Wednesday, December 5, 2012

Drift Into Failure by Sydney Dekker

Sydney Dekker's Drift Into Failure* is a noteworthy effort to provide new insights into how accidents and other bad outcomes occur in large organizations. He begins by describing two competing world views, the essentially mechanical view of the world spawned by Newton and Descartes (among others), and a view based on complexity in socio-technical organizations and a systems approach. He shows how each world view biases the search for the “truth” behind how accidents and incidents occur.

Newtonian-Cartesian (N-C) Vision

Issac Newton and Rene Descartes were leading thinkers during the dawn of the Age of Reason. Newton used the language of mathematics to describe the world while Descartes relied on the inner process of reason. Both believed there was a single reality that could be investigated, understood and explained through careful analysis and thought—complete knowledge was possible if investigators looked long and hard enough. The assumptions and rules that started with them, and were extended by others over time, have been passed on and most of us accept them, uncritically, as common sense, the most effective way to look at the world.

The N-C world is ruled by invariant cause-and-effect; it is, in fact, a machine. If something bad happens, then there was a unique cause or set of causes. Investigators search for these broken components, which could be physical or human. It is assumed that a clear line exists between the broken part(s) and the overall behavior of the system. The explicit assumption of determinism leads to an implicit assumption of time reversibility—because system performance can be predicted from time A if we know the starting conditions and the functional relationships of all components, then we can start from a later time B (the bad outcome) and work back to the true causes. (p. 84) Root cause analysis and criminal investigations are steeped in this world view.

In this view, decision makers are expected to be rational people who “make decisions by systematically and consciously weighing all possible outcomes along all relevant criteria.” (p. 3) Bad outcomes are caused by incompetent or worse, corrupt decision makers. Fixes include more communications, training, procedures, supervision, exhortations to try harder and criminal charges.

Dekker credits Newton et al for giving man the wherewithal to probe Nature's secrets and build amazing machines. However, Newtonian-Cartesian vision is not the only way to view the world, especially the world of complex, socio-technical systems. For that a new model, with different concepts and operating principles, is required.

The Complex System

Characteristics

The sheer number of parts does not make a system complex, only complicated. A truly complex system is open (it interacts with its environment), has components that act locally and don't know the full effects of their actions, is constantly making decisions to maintain performance and adapt to changing circumstances, and has non-linear interactions (small events can cause large results) because of multipliers and feedback loops. Complexity is a result of the ever-changing relationships between components. (pp.138-144)

Adding to the myriad information confronting a manager or observer, system performance is often optimized at the edge of chaos, where competitors are perpetually vying for relative advantage at an affordable cost.** The system is constantly balancing its efforts between exploration (which will definitely incur costs but may lead to new advantages) and exploitation (which reaps benefits of current advantages but will likely dissipate over time). (pp. 164-165)

The most important feature of a complex system is that it adapts to its environment over time in order to survive. And its environment is characterized by resource scarcity and competition. There is continuous pressure to maintain production and increase efficiency (and their visible artifacts: output, costs, profits, market share, etc) and less visible outputs, e.g., safety, will receive less attention. After all, “Though safety is a (stated) priority, operational systems do not exist to be safe. They exist to provide a service or product . . . .” (p. 99) And the cumulative effect of multiple adaptive decisions can be an erosion of safety margins and a changed response of the entire system. Such responses may be beneficial or harmful—a drift into failure.

Drift by a complex system exhibits several characteristics. First, as mentioned above, it is driven by environmental factors. Second, drift occurs in small steps so changes can be hardly noticed, and even applauded if they result in local performance improvement; “. . . successful outcomes keep giving the impression that risk is under control” (p. 106) as a series of small decisions whittle away at safety margins. Third, these complex systems contain unruly technology (think deepwater drilling) where uncertainties exist about how the technology may be ultimately deployed and how it may fail. Fourth, there is significant interaction with a key environmental player, the regulator, and regulatory capture can occur, resulting in toothless oversight.

“Drifting into failure is not so much about breakdowns or malfunctioning of components, as it is about an organization not adapting effectively to cope with the complexity of its own structure and environment.” (p. 121) Drift and occasionally accidents occur because of ordinary system functioning, normal people going about their regular activities making ordinary decisions “against a background of uncertain technology and imperfect information.” Accidents, like safety, can be viewed as an emergent system property, i.e., they are the result of system relationships but cannot be predicted by examining any particular system component.

Managers' roles

Managers should not try to transform complex organizations into merely complicated ones, even if it's possible. Complexity is necessary for long-term survival as it maximizes organizational adaptability. The question is how to manage in a complex system. One key is increasing the diversity of personnel in the organization. More diversity means less group think and more creativity and greater capacity for adaptation. In practice, this means validation of minority opinions and encouragement of dissent, reflecting on the small decisions as they are made, stopping to ponder why some technical feature or process is not working exactly as expected and creating slack to reduce the chances of small events snowballing into large failures. With proper guidance, organizations can drift their way to success.

Accountability

Amoral and criminal behavior certainly exist in large organizations but bad outcomes can also result from normal system functioning. That's why the search for culprits (bad actors or broken parts) may not always be appropriate or adequate. This is a point Dekker has explored before, in Just Culture (briefly reviewed here) where he suggests using accountability as a means to understand the system-based contributors to failure and resolve those contributors in a manner that will avoid recurrence.

Application to Nuclear Safety Culture

A commercial nuclear power plant or fleet is probably not a complete complex system. It interacts with environmental factors but in limited ways; it's certainly not directly exposed to the Wild West competition of say, the cell phone industry. Group think and normalization of deviance*** is a constant threat. The technology is reasonably well-understood but changes, e.g., uprates based on more software-intensive instrumentation and control, may be invisibly sanding away safety margin. Both the industry and the regulator would deny regulatory capture has occurred but an outside observer may think the relationship is a little too cozy. Overall, the fit is sufficiently good that students of safety culture should pay close attention to Dekker's observations.

In contrast, the Hanford Waste Treatment Plant (Vit Plant) is almost certainly a complex system and this book should be required reading for all managers in that program.

Conclusion

Drift Into Failure is not a quick read. Dekker spends a lot of time developing his theory, then circling back to further explain it or emphasize individual pieces. He reviews incidents (airplane crashes, a medical error resulting in patient death, software problems, public water supply contamination) and descriptions of organization evolution (NASA, international drug smuggling, “conflict minerals” in Africa, drilling for oil, terrorist tactics, Enron) to illustrate how his approach results in broader and arguably more meaningful insights than the reports of official investigations. Standing on the shoulders of others, especially Diane Vaughan, Dekker gives us a rich model for what might be called the “banality of normalization of deviance.” 


* S. Dekker, Drift Into Failure: From Hunting Broken Components to Understanding Complex Systems (Burlington VT: Ashgate 2011).

** See our Sept. 4, 2012 post onCynefin for another description of how the decisions an organization faces can suddenly slip from the Simple space to the Chaotic space.

*** We have posted many times about normalization of deviance, the corrosive organizational process by which the yesterday's “unacceptable” becomes today's “good enough.”

Monday, August 3, 2009

Reading List: Just Culture by Sidney Dekker

Thought I would share with you a relatively recent addition to the safety management system bookshelf, Just Culture by Sidney Dekker, Professor of Human Factors and System Safety at Lund University in Sweden.  In Dekker’s view a “just culture” is critical for the creation of safety culture.  A just culture will not simply assign blame in response to a failure or problem, it will seek to use accountability as a means to understand the system-based contributors to failure and resolve those in a manner that will avoid recurrence.  One of the reasons we believe so strongly in safety simulation is the emphasis on system-based understanding, including a shared organizational mental model of how safety management happens.  One reviewer (D. Sillars) of this book on the amazon.com website summarizes, “’Just culture’ is an abstract phrase, which in practice, means . . . getting to an account of failure that can both satisfy demands for accountability while contributing to learning and improvement.” 


Question for nuclear professionals:  Does your organization maintain a library of resources such as Just Culture or Dianne Vaughan’s book, The Challenger Launch Decision, that provide deep insights into organizational performance and culture?  Are materials like this routinely the subject of discussions in training sessions and topical meetings?