Showing posts with label Leveson. Show all posts
Showing posts with label Leveson. Show all posts

Wednesday, October 9, 2019

More on Mental Models in Healthcare

Source: Clipart Panda
Our August 6, 2019 post discussed the appalling incidence of preventable harm in healthcare settings.  We suggested that a better mental model of healthcare delivery could contribute to reducing the incidence of preventable harm.  It will come as no surprise to Safetymatters readers that we are referring to a systems-oriented model.

We’ll use a 2014 article* by Nancy Leveson and Sidney Dekker to describe how a systems approach can lead to better understanding of why accidents and other negative outcomes occur.  The authors begin by noting that 70-90% of industrial accidents are blamed on individual workers.**  As a consequence, proposed fixes focus on disciplining, firing, or retraining individuals or, for groups, specifying their work practices in ever greater detail (the authors call this “rigidifying” work).  This is the Safety I mental model in a nutshell, limiting its view to the “what” and “who” of incidents.   

In contrast, systems thinking posits the behavior of individuals can only be understood by examining the context in which their behavior occurs.  The context includes management decision-making and priorities, regulatory requirements and deficiencies, and of course, organizational culture, especially safety culture.  Fixes that don’t consider the overall process almost guarantee that similar problems will arise in the future.  “. . . human error is a symptom of a system that needs to be redesigned.”  Systems thinking adds the “why” to incident analysis.

Every system has a designer, although they may not be identified as such and may not even be aware they’re “designing” when they specify work steps or flows, or define support processes, e.g., procurement or quality control.  Importantly, designers deal with an ideal system, not with the actual constructed system.  The actual system may differ from the designer's original specification because of inherent process variances, the need to address unforeseen conditions, or evolution over time.  Official procedures may be incomplete, e.g., missing unlikely but possible conditions or assume that certain conditions cannot occur.  However, the people doing the work must deal with the  constructed system, however imperfect, and the conditions that actually occur.

The official procedures present a doubled-edged threat to employees.  If they adapt procedures in the face of unanticipated conditions, and the adaptation turns out to be ineffective or leads to negative outcomes, employees can be blamed for not following the procedures.  On the other hand, if they stick to the procedures when conditions suggest they should be adapted and negative outcomes occur, the employees can be blamed for too rigidly following them.

Personal blame is a major problem in System I.  “Blame is the enemy of safety . . . it creates a culture where people are afraid to report mistakes . . . A safety culture that focuses on blame will never be very effective in preventing accidents.”

Our Perspective

How does the above relate to reducing preventable harm in healthcare?  We believe that structural and cultural factors impede the application of systems thinking in the healthcare field.  It keeps them stuck in a Safety I worldview no matter how much they pretend otherwise. 

The hospital as formal bureaucracy

When we say “healthcare” we are referring to a large organization that provides medical care, a hospital is the smallest unit of analysis.  A hospital is literally a textbook example of what organizational theorists call a formal bureaucracy.  It has specialized departments with an official division of authority among them—silos are deliberately created and maintained.  An administrative hierarchy mediates among the silos and attempts to guide them toward overall goals. The organization is deliberately impersonal to avoid favoritism and behavior is prescribed, proscribed and guided by formal rules and procedures.  It appears hospitals were deliberately designed to promote System I thinking and its inherent bias for blaming the individual for negative outcomes.

Employees have two major strategies for avoiding blame: strong occupational associations and plausible deniability. 

Powerful guilds and unions 

Medical personnel are protected by their silo and tribe.  Department heads defend their employees (and their turf) from outsiders.  The doctors effectively belong to a guild that jealously guards their professional authority; the nurses and other technical fields have their unions.  These unofficial and official organizations exist to protect their members and promote their interests.  They do not exist to protect patients although they certainly tout such interest when they are pushing for increased employee headcounts.  A key cultural value is members do not rat on other members of their tribe so problems may be observed but go unreported.

Hiding behind the procedures

In this environment, the actual primary goal is to conform to the rules, not to serve clients.  The safest course for the individual employee is to follow the rules and procedures, independent of the effect this may have on a patient.  The culture espouses a value of patient safety but what gets a higher value is plausible deniability, the ability to avoid personal responsibility, i.e., blame, by hiding behind the established practices and rules when negative outcomes occur.

An enabling environment 

The environment surrounding healthcare allows them to continue providing a level of service that literally kills patients.  Data opacity means it’s very difficult to get reliable information on patient outcomes.  Hospitals with high failure rates simply claim they are stuck with or choose to serve the sickest patients.  Weak malpractice laws are promoted by the doctors’ guild and maintained by the politicians they support.  Society in general is overly tolerant of bad medical outcomes.  Some families may make a fuss when a relative dies from inadequate care but settlements are paid, non-disclosure agreements are signed, and the enterprise moves on.

Bottom line: It will take powerful forces to get the healthcare industry to adopt true systems-oriented thinking and identify the real reasons why preventive harm occurs and what corrective actions could be effective.  Healthcare claims to promote evidence-based medicine; they need to add evidence-based harm reduction strategies.  Industry-wide adoption of the aviation industry’s confidential reporting system for errors would be a big step forward.    

*  N. Leveson and S. Dekker, “Get To The Root Of Accidents,” (Feb 27, 2014).  Retrieved Oct. 7, 2019.  Leveson is an MIT professor and long-standing champion of systems thinking; Dekker has written extensively on Just Culture and Safety II concepts.  Click on their respective labels to pull up our other posts on their work.

**  The article is tailored for the process industry but the same thinking can be applied to service industries.

Monday, October 16, 2017

Nuclear Safety Culture: A Suggestion for Integrating “Just Culture” Concepts

All of you have heard of “Just Culture” (JC).  At heart, it is an attitude toward investigating and explaining errors that occur in organizations in terms of “why” an error occurred, including systemic reasons, rather than focusing on identifying someone to blame.  How might JC be applied in practice?  A paper* by Shem Malmquist describes how JC concepts could be used in the early phases of an investigation to mitigate cognitive bias on the part of the investigators.

The author asserts that “cognitive bias has a high probability of occurring, and becoming integrated into the investigators subconscious during the early stages of an accident investigation.” 

He recommends that, from the get-go, investigators categorize all pertinent actions that preceded the error as an error (unintentional act), at-risk behavior (intentional but for a good reason) or reckless (conscious disregard of a substantial risk or intentional rule violation). (p. 5)  For errors or at-risk actions, the investigator should analyze the system, e.g., policies, procedures, training or equipment, for deficiencies; for reckless behavior, the investigator should determine what system components, if any, broke down and allowed the behavior to occur. (p. 12).  Individuals should still be held responsible for deliberate actions that resulted in negative consequences.

Adding this step to a traditional event chain model will enrich the investigation and help keep investigators from going down the rabbit hole of following chains suggested by their own initial biases.

Because JC is added to traditional investigation techniques, Malmquist believes it might be more readily accepted than other approaches for conducting more systemic investigations, e.g., Leveson’s System Theoretic Accident Model and Processes (STAMP).  Such approaches are complex, require lots of data and implementing them can be daunting for even experienced investigators.  In our opinion, these models usually necessitate hiring model experts who may be the only ones who can interpret the ultimate findings—sort of like an ancient priest reading the entrails of a sacrificial animal.  Snide comment aside, we admire Leveson’s work and reviewed it in our Nov. 11, 2013 post.

Our Perspective

This paper is not some great new insight into accident investigation but it does describe an incremental step that could make traditional investigation methods more expansive in outlook and robust in their findings.

The paper also provides a simple introduction to the works of authors who cover JC or decision-making biases.  The former category includes Reason and Dekker and the latter one Kahneman, all of whom we have reviewed here at Safetymatters.  For Reason, see our Nov. 3, 2014 post; for Dekker, see our Aug. 3, 2009 and Dec. 5, 2012 posts; for Kahneman, see our Nov. 4, 2011 and Dec. 18, 2013 posts.

Bottom line: The parts describing and justifying the author’s proposed approach are worth reading.  You are already familiar with much of the contextual material he includes.  

*  S. Malmquist, “Just Culture Accident Model – JCAM” (June 2017).

Friday, January 27, 2017

Leadership, Decisions, Systems Thinking and Nuclear Safety Culture

AcciMap Excerpt
We recently read a paper* that echoes some of the themes we emphasize on Safetymatters, viz., leadership, decisions and a systems view.  Following is an excerpt from the abstract:

Leadership is progressively being recognized as a key** factor in supporting successful performance across a range of domains. . . . the decisions and actions that characterize safety leadership thus become important emergent properties in the prevention of incidents, which should be considered within the context of the broader organizational system and not merely constrained to understanding events or conditions that shape performance at the ‘sharp end’.”  [emphasis added]

The authors go on to analyze decisions and actions after a mining incident (landslide) using a combination of three different schemes: Rasmussen’s Risk Management Framework (RMF) and corresponding AcciMap, and the Critical Decision Method (CDM).

The RMF describes work systems as comprised of various levels and argues that safety performance is affected by decisions and actions at all levels from politicians in the external environment down through company executives and managers and finally to individual workers.  Rasmussen’s AcciMap is an expansive causal diagram for an accident or incident that displays the contributions (or omissions) at each level in the RMF and their connections.

CDM uses semi-structured interviews to obtain information about how individuals formulate their decisions, including context such as background knowledge and immediate influencing factors.  Consistent with the RMF, case study interviews were conducted with individuals at different organizational levels.  CDM data were used to construct the AcciMap.

We won’t go into the details of the analysis but it identified over a dozen key decisions made at different organizational levels before and during the incident; most were connected to at least one other key decision.  The AcciMap illustrates decisions and communications across multiple levels and thus provides a useful picture of how an organization anticipates and responds to an unusual situation.

Our Perspective

The authors argue, and we agree, that this type of analysis provides greater detail and insight into the performance of an organization’s safety management system than traditional accident investigations (especially those focused on finding someone to blame).

This article does not specifically discuss culture.  But the body of decisions an organization produces is the strongest evidence and most visible artifact of its culture.  Organizational decisions are far more important than responses to surveys or interviews where people can report what they believe (or hope) the culture is, or what they think their audience wants to hear.

We like that RMF and AcciMap are agnostic: they can be used to analyze either “what went wrong” or “what went right” scenarios.  (The case study was in the latter category because no one was hurt in the incident.)  If an assessor is looking at a sample of decisions to infer a nuclear organization’s culture, most of those decisions will have had positive (or at least no negative) consequences.

The authors are Australian academics but this short (8 pages total) paper is quite readable and a good introduction to CDM and Rasmussen’s constructs.  The references include people whose work we have positively reviewed on Safetymatters, including Dekker, Hollnagel, Leveson and Reason.

Bottom line: There is nothing about culture or nuclear here, but the overall message reinforces our beliefs about how to think about Nuclear Safety Culture.

*  S-L Donovana, P.M. Salmonb and M.G. Lennéa, “The leading edge: A systems thinking methodology for assessing safety leadership,” Procedia Manufacturing 3 (2015), pp. 6644–6651.  Available at; retrieved Jan. 19, 2017.

**  Note they do not say “one and only” or even “most important.”

Thursday, March 17, 2016

IAEA Nuclear Safety Culture Conference

The International Atomic Energy Agency (IAEA) recently sponsored a week-long conference* to celebrate 30 years of interest and work in safety culture (SC).  By our reckoning, there were about 75 individual presentations in plenary sessions and smaller groups; dialog sessions with presenters and subject matter experts; speeches and panels; and over 30 posters.  It must have been quite a circus.

We cannot justly summarize the entire conference in this space but we can highlight material related to SC factors we’ve emphasized or people we’ve discussed on Safetymatters, or interesting items that merit your consideration.

Topics We Care About

A Systems Viewpoint

Given that the IAEA has promoted a systemic approach to safety and it was a major conference topic it’s no surprise that many participants addressed it.  But we were still pleased to see over 30 presentations, posters and dialogues that included mention of systems, system dynamics, and systemic and/or holistic viewpoints or analyses.  Specific topics covered a broad range including complexity, coupling, Fukushima, the Interaction between Human, Technical and Organizational Factors (HTOF), error/incident analysis, regulator-licensee relationships, SC assessment, situational adaptability and system dynamics.

Role of Leadership

Leadership and Management for Safety was another major conference topic.  Leadership in a substantive context was mentioned in about 20 presentations and posters, usually as one of multiple success factors in creating and maintaining a strong SC.  Topics included leader/leadership commitment, skills, specific competences, attributes, obligations and responsibilities; leadership’s general importance, relationship to performance and role in accidents; and the importance of leadership in nuclear regulatory agencies. 

Decision Making

This was mentioned about 10 times, with multiple discussions of decisions made during the early stages of the Fukushima disaster.  Other presenters described how specific techniques, such as Probabilistic Risk Assessment and Human Reliability Analysis, or general approaches, such risk control and risk informed, can contribute to decision making, which was seen as an important component of SC.

Compensation and Rewards

We’ve always been clear: If SC and safety performance are important then people from top executives to individual workers should be rewarded (by which we mean paid money) for doing it well.  But, as usual, there was zero mention of compensation in the conference materials.  Rewards were mentioned a few times, mostly by regulators, but with no hint they were referring to monetary rewards.  Overall, a continuing disappointment.   

Participants Who Have Been Featured in Safetymatters

Over the years we have presented the work of many conference participants to Safetymatters readers.  Following are some familiar names that caught our eye.
  Page numbers refer to the conference “Programme and Abstracts” document.
We have to begin with Edgar Schein, the architect of the cultural construct used by almost everyone in the SC space.  His discussion paper (p. 47) argued that the SC components in a nuclear plant depend on whether the executives actually create the climate of trust and openness that the other attributes hinge on.  We’ve referred to Schein so often he has his own label on Safetymatters.

Mats Alvesson’s presentation
(p. 46) discussed “hyper culture,” the vague and idealistic terms executives often promote that look good in policy documents but seldom work well in practice.  This presentation is consistent with his article on Functional Stupidity which we reviewed on Feb. 23, 2016.

Sonja Haber’s paper (p. 55) outlined a road map for the nuclear community to move forward in the way it thinks about SC.  Dr. Haber has conducted many SC assessments for the Department of Energy that we have reviewed on Safetymatters. 

Ken Koves of INPO led or participated in three dialogue sessions.  He was a principal researcher in a project that correlated SC survey data with safety performance measures which we reviewed on Oct. 22, 2010 and Oct. 5, 2014.

Najmedin Meshkati discussed (p. 60) how organizations react when their control systems start to run behind environmental demands using Fukushima as an illustrative case.  His presentation draws on an article he coauthored comparing the cultures at TEPCO’s Fukushima Daiichi plant and Tohoku Electric’s Onagawa plant which we reviewed on Mar. 19, 2014.

Jean-Marie Rousseau co-authored a paper (p. 139) on the transfer of lesson learned from accidents in one industry to another industry.  We reviewed his paper on the effects of competitive pressures on nuclear safety management issues on May 8, 2013.

Carlo Rusconi discussed (p. 167) how the over-specialization of knowledge required by decision makers can result in pools of knowledge rather than a stream accessible to all members of an organization.  A systemic approach to training can address this issue.  We reviewed Rusconi’s earlier papers on training on June 26, 2013 and Jan. 9, 2014.

Richard Taylor’s presentation (p. 68) covered major event precursors and organizations’ failure to learn from previous events.  We reviewed his keynote address at a previous IAEA conference where he discussed using system dynamics to model organizational archetypes on July 31, 2012.

Madalina Tronea talked about (p. 114) the active oversight of nuclear plant SC by the National Commission for Nuclear Activities Control (CNCAN), the Romanian regulatory authority.  CNCAN has developed its own model of organizational culture and uses multiple methods to collect information for SC assessment.  We reviewed her initial evaluation guidelines on Mar. 23, 2012

Our Perspective

Many of the presentations were program descriptions or status reports related to the presenter’s employer, usually a utility or regulatory agency.  Fukushima was analyzed or mentioned in 40 different papers or posters.  Overall, there were relatively few efforts to promote new ideas, insights or information.  Having said that, following are some materials you should consider reviewing.

From the conference participants mentioned above, Haber’s abstract (p. 55) and Rusconi’s abstract (p. 167) are worth reading.  Taylor’s abstract (p. 68) and slides are also worth reviewing.  He advocates using system dynamics to analyze complicated issues like the effectiveness of organizational learning and how events can percolate through a supply chain.

Benoît Bernard described the Belgian regulator’s five years of experience assessing nuclear plant SC.  Note that lessons learned are described in his abstract (p. 113) but are somewhat buried in his presentation slides.

If you’re interested in a systems view of SC, check out Francisco de Lemos’ presentation
(p. 63) which gives a concise depiction of a complex system plus a Systems Theoretic Accident Models and Processes (STAMP) analysis.  His paper is based on Nancy Leveson’s work which we reviewed on Nov. 11, 2013.

Diana Engström argued that nuclear personnel can put more faith in reported numbers than justified by the underlying information, e.g., CAP trending data, and thus actually add risk to the overall system.  We’d call this practice an example of functional stupidity although she doesn’t use that term in her provocative paper.  Both her abstract (p. 126) and slides are worth reviewing.

Jean Paries gave a talk on the need for resilience in the management of nuclear operations.  The abstract (p. 228) is clear and concise; there is additional information in his slides but they are a bit messy.

And that’s it for this installment.  Be safe.  Please don’t drink and text.

*  International Atomic Energy Agency, International Conference on Human and Organizational Aspects of Assuring Nuclear Safety: Exploring 30 years of Safety Culture (Feb. 22–26, 2016).  This page shows the published conference materials.  Thanks to Madalina Tronea for publicizing them.  Dr. Tronea is the founder/moderator of the LinkedIn Nuclear Safety Culture discussion group. 

Monday, April 13, 2015

Safety-I and Safety-II: The Past and Future of Safety Management by Erik Hollnagel

This book* discusses two different ways of conceptualizing safety performance problems (e.g., near-misses, incidents and accidents) and safety management in socio-technical systems.  This post describes each approach and provides our perspective on Hollnagel’s efforts.  As usual, our interest lies in the potential value new ways of thinking can offer to the nuclear industry.


This is the common way of looking at safety performance problems.  It is reactive, i.e., it waits for problems to arise** and analytic, e.g., it uses specific methods to work back from the problem to its root causes.  The key assumption is that something in the system has failed or malfunctioned and the purpose of an investigation is to identify the causes and correct them so the problem will not recur.  A second assumption is that chains of causes and effects are linear, i.e., it is actually feasible to start with a problem and work back to its causes.  A third assumption is that a single solution (the “first story”) can be found. (pp. 86, 175-76)***  Underlying biases include the hindsight bias (p. 176) and the belief that the human is usually the weak link. (pp. 78-79)  The focus of safety management is minimizing the number of things that go wrong.

Our treatment of Safety-I is brief because we have reported on criticism of linear thinking/models elsewhere, primarily in the work of Dekker, Woods et al, and Leveson.  See our posts of Dec. 5, 2012; July 6, 2013; and Nov. 11, 2013 for details.


Safety-II is proposed as a different way to look at safety performance.  It is proactive, i.e., it looks at the ways work is actually performed on a day-to-day basis and tries to identify causes of performance variability and then manage them.  A key cause of variability is the regular adjustments people make in performing their jobs in order to keep the system running.  In Hollnagel’s view, “Finding out what these [performance] adjustments are and trying to learn from them can be more important than finding the causes of infrequent adverse outcomes!” (p. 149)  The focus of safety management is on increasing the likelihood that things will go right and developing “the ability to succeed under varying conditions, . . .” (p. 137).

Performance is variable because, among other reasons, people are always making trade-offs between thoroughness and efficiency.  They may use heuristics or have to compensate for something that is missing or take some steps today to avoid future problems.  The underlying assumption of Safety-II is that the same behaviors that almost always lead to successful outcomes can occasionally lead to problems because of performance variability that goes beyond the boundary of the control space.  A second assumption is that chains of causes and effects may be non-linear, i.e., a small variance may lead to a large problem, and may have an emergent aspect where a specific performance variability may occur then disappear or the Swiss cheese holes may momentarily line up exposing the system to latent hazards. (pp. 66, 131-32)  There may be multiple explanations (“second stories”) for why a particular problem occurred.  Finally, Safety-II accepts that there are often differences between Work-as-Imagined (esp. as imagined by folks at the blunt end) and Work-as-Done (by people at the sharp end). (pp. 40-41)***

The Two Approaches

Safety-I and Safety-II are not in some winner-take-all competitive struggle.  Hollnagel notes there are plenty of problems for which a Safety-I investigation is appropriate and adequate. (pp. 141, 146)

Safety-I expenditures are viewed as a cost (to reduce errors). (p. 57)  In contrast, Safety-II expenditures are viewed as bona fide investments to create more correct outcomes. (p. 166)

In all cases, organizational factors, such as safety culture, can impact safety performance and organizational learning. (p. 31)

Our Perspective

The more complex a socio-technical entity is, the more it exhibits emergent properties and the more appropriate Safety-II thinking is.  And nuclear has some elements of complexity.****  In addition, Hollnagel notes that a common explanation for failures that occur in a System-I world is “it was never imagined something like that could happen.” (p. 172)  To avoid being the one in front of the cameras saying that, it might be helpful for you to spend a little time reflecting on how System-II thinking might apply in your world.

Why do most things go right?  Is it due to strict compliance with procedures?  Does personal creativity or insight contribute to successful plant performance?  Do you talk with your colleagues about possible efficiency-thoroughness trade-offs (short cuts) that you or others make?  Can thinking about why things go right make one more alert to situations where things are heading south?  Does more automation (intended to reduce reliance on fallible humans) actually move performance closer to the control boundary because it removes the human’s ability to make useful adjustments?  Has any of your root cause evaluations appeared to miss other plausible explanations for why a problem occurred?

Some of the Safety-II material is not new.  Performance variability in Safety-II builds on Hollnagel’s earlier work on the efficiency-thoroughness trade-off (ETTO) principle.  (See our Jan. 3, 2013 post.)   His call for mindfulness and constant alertness to problems is straight out of the High Reliability Organization playbook. (pp. 36, 163-64)  (See our May 3, 2013 post.)

A definite shortcoming is the lack of concrete examples in the Safety-II discussion.  If someone has tried to do this, it would be nice to hear about it.

Bottom line, Hollnagel has some interesting observations although his Safety-II model is probably not the Next Big Thing for nuclear safety management.


*  E. Hollnagel, Safety-I and Safety-II: The Past and Future of Safety Management  (Burlington, VT: Ashgate , 2014)

**  In the author’s view, forward-looking risk analysis is not proactive because it is infrequently performed. (p. 57) 

***  There are other assumptions in the Safety-I approach (see pp. 97-104) but for the sake of efficiency, they are omitted from this post.

****  Nuclear power plants have some aspects of a complex socio-technical system but other aspects are merely complicated.   On the operations side, activities are tightly coupled (one attribute of complexity) but most of the internal organizational workings are complicated.  The lack of sudden environmental disrupters (excepting natural disasters) means they have time to adapt to changes in their financial or regulatory environment, reducing complexity.

Monday, October 13, 2014

Systems Thinking in Air Traffic Management

A recent white paper* presents ten principles to consider when thinking about a complex socio-technical system, specifically European Air Traffic Management (ATM).  We review the principles below, highlighting aspects that might provide some insights for nuclear power plant operations and safety culture (SC).

Before we start, we should note that ATM is truly a complex** system.  Decisions involving safety and efficiency occur on a continuous basis.  There is always some difference between work-as-imagined and work-as-done.

In contrast, we have argued that a nuclear plant is a complicated system but it has some elements of complexity.  To the extent complexity exists, treating nuclear like a complicated machine via “analysing components using reductionist methods; identifying ‘root causes’ of problems or events; thinking in a linear and short-term way; . . . [or] making changes at the component level” is inadequate. (p. 5)  In other words, systemic factors may contribute to observed performance variability and frustrate efforts to achieve the goal in nuclear of eliminating all differences between work-as-planned and work-as-done.

Principles 1-3 relate to the view of people within systems – our view from the outside and their view from the inside.

1. Field Expert Involvement
“To understand work-as-done and improve how things really work, involve those who do the work.” (p. 8)
2. Local Rationality
“People do things that make sense to them given their goals, understanding of the situation and focus of attention at that time.” (p. 10)
3. Just Culture
“Adopt a mindset of openness, trust and fairness. Understand actions in context, and adopt systems language that is non-judgmental and non-blaming.” (p. 12)

Nuclear is pretty good at getting line personnel involved.  Adages such as “Operations owns the plant” are useful to the extent they are true.  Cross-functional teams can include operators or maintenance personnel.  An effective CAP that allows workers to identify and report problems with equipment, procedures, etc. is good; an evaluation and resolution process that involves members from the same class of workers is even better.  Having someone involved in an incident or near-miss go around to the tailgates and classes to share the lessons learned can be convincing.

But when something unexpected or bad happens, nuclear tends to spend too much time looking for the malfunctioning component (usually human).   “The assumption is that if the person would try harder, pay closer attention, do exactly what was prescribed, then things would go well. . . . [But a] focus on components becomes less effective with increasing system complexity and interactivity.” (p. 4)  An outside-in approach ignores the context in which the human performed, the information and time available, the competition for focus of attention, the physical conditions of the work, fatigue, etc.  Instead of insight into system nuances, the result is often limited to more training, supervision or discipline.

The notion of a “just culture” comes from James Reason.  It’s a culture where employees are not punished for their actions, omissions or decisions that are commensurate with their experience and training, but where gross negligence, willful violations and destructive acts are not tolerated.

Principles 4 and 5 relate to the system conditions and context that affect work.

4. Demand and Pressure
“Demands and pressures relating to efficiency and capacity have a fundamental effect on performance.” (p. 14)
5. Resources & Constraints

“Success depends on adequate resources and appropriate constraints.” (p. 16)

Fluctuating demand creates far more varied and unpredictable problems for ATM than it does in nuclear.  However, in nuclear the potential for goal conflicts between production, cost and safety is always present.  The problem arises from acting as if these conflicts don’t exist.

ATM has to “cope with variable demand and variable resources,” a situation that is also different from nuclear with its base load plants and established resource budgets.  The authors opine that for ATM, “a rigid regulatory environment destroys the capacity to adapt constantly to the environment.” (p. 2) Most of us think of nuclear as quite constrained by procedures, rules, policies, regulations, etc., but an important lesson from Fukushima was that under unforeseen conditions, the organization must be able to adapt according to local, knowledge-based decisions  Even the NRC recognizes that “flexibility may be necessary when responding to off-normal conditions.”***

Principles 6 through 10 concern the nature of system behavior, with 9 and 10 more concerned with system outcomes.  These do not have specific implications for SC other than keeping an open mind and being alert to systemic issues, e.g., complacency, drift or emergent behavior.

6. Interactions and Flows
“Understand system performance in the context of the flows of activities and functions, as well as the interactions that comprise these flows.” (p. 18)
7. Trade-Offs
“People have to apply trade-offs in order to resolve goal conflicts and to cope with the complexity of the system and the uncertainty of the environment.” (p. 20)
8. Performance variability
“Understand the variability of system conditions and behaviour.  Identify wanted and unwanted variability in light of the system’s need and tolerance for variability.” (p. 22)
9. Emergence
“System behaviour in complex systems is often emergent; it cannot be reduced to the behaviour of components and is often not as expected.” (p. 24)
10. Equivalence
“Success and failure come from the same source – ordinary work.” (p. 26)

Work flow certainly varies in ATM but is relatively well-understood in nuclear.  There’s really not much more to say on that topic.

Trade-offs occur in decision making in any context where more than one goal exists.  One useful mental model for conceptualizing trade-offs is Hollnagel’s efficiency-thoroughness construct, basically doing things quickly (to meet the production and cost goals) vs. doing things well (to meet the quality and possibly safety goals).  We reviewed his work on Jan. 3, 2013.

Performance variability occurs in all systems, including nuclear, but the outcomes are usually successful because a system has a certain range of tolerance and a certain capacity for resilience.  Performance drift happens slowly, and can be difficult to identify from the inside.  Dekker’s work speaks to this and we reviewed it on Dec. 5, 2012.

Nuclear is not fully complex but surprises do happen, some of them not caused by component failure.  Emergence (problems that arise from new or unforeseen system interactions) is more likely to occur following the implementation of new technical systems.  We discussed this possibility in a July 6, 2013 post on a book by Woods, Dekker et al.

Equivalence means that work that results in both good and bad outcomes starts out the same way, with people (saboteurs excepted) trying to be successful.  When bad things happen, we should cast a wide net in looking for different factors, including systemic ones, that aligned (like Swiss cheese slices) in the subject case.

The white paper also includes several real and hypothetical case studies illustrating the application of the principles to understanding safety performance challenges 

Our Perspective 

The authors draw on a familiar cast of characters, including Dekker, Hollnagel, Leveson and Reason.  We have posted about all these folks, just click on their label in the right hand column.

The principles are intended to help us form a more insightful mental model of a system under consideration, one that includes non-linear cause and effect relationships, and the possibility of emergent behavior.  The white paper is not a “must read” but may stimulate useful thinking about the nature of the nuclear operating organization.

*  European Organisation for the Safety of Air Navigation(EUROCONTROL), “Systems Thinking for Safety: Ten Principles” (Aug. 2014).  Thanks to Bill Mullins for bringing this white paper to our attention.

**  “[C]omplex systems involve large numbers of interacting elements and are typically highly dynamic and constantly changing with changes in conditions. Their cause-effect relations are non-linear; small changes can produce disproportionately large effects. Effects usually have multiple causes, though causes may not be traceable and are socially constructed.” (pp. 4-5)

Also see our Oct. 14, 2013 discussion of the California Independent System Operator for another example of a complex system.

***  “Work Processes,” NRC Safety Culture Trait Talk, no. 2 (July 2014), p. 1.  ADAMS ML14203A391.  Retrieved Oct. 8, 2014

Monday, November 11, 2013

Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson

In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory.  The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics.  It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.

Part I
Part I sets the stage for a new safety paradigm.  Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6)   Traditional accident analysis techniques are no longer sufficient.  They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10)  Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).**   We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.

Event-based models are simply inadequate.  There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it.  The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24)  Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19)  In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)

Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors.  “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination.  When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent.  This dependence may be related to common systemic factors that do not appear in an event chain.  Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35)  “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36) 

The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38)  In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46) 

Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51)  Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52)  This is reminiscent of  Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.

When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56)  In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64)  Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60) 

Part II

Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes).  For our purposes we don't need to spend much space on this material.  “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”***   It attempts to achieve the goals listed at the end of Part I.

STAMP treats safety in a system as a control problem, not a reliability one.  Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76)  Controls may be physical or social, including cultural.  There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87)  “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)

Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system.  There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.

Part III

Part III describes in detail how STAMP can be applied.  There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section.  Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of  STAMP over traditional accident analysis techniques. 

We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.

Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421)  Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk.  The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40)  One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26)  Such an approach should be based on the organization's comprehensive hazards analysis.****

The safety culture discussion is also familiar. (pp. 426-33)  Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust.  In such a culture, Leveson says an early warning system for migration toward states of high risk can be established.  A section on Just Culture is taken directly from Dekker's work.  The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.

Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.

Our Perspective

Overall, we like this book.  It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial.  The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog.  The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.

However, there are only a few direct references to nuclear.  The examples in the book are mostly from aerospace, aviation, maritime activities and the military.  Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory),  shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs.  Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428)  She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7)  In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective.  TMI is used to illustrate specific system design/operation problems.

We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists.  First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12)  But I'm not aware of any HRO work that has been done in an organization that is patently unsafe.  HRO asserts that reliability follows from practices that recognize and contain emerging problems.  She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers.  Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44)  Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could.  Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.”  In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410)  I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals.  Overall, this appears to be academic posturing and feather fluffing.

We offer no opinion on the efficacy of using Leveson's STAMP approach.  She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)

In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii)  You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.

*  N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011)  The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed.  All quotes in this post were retyped from the original text.

**  We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier.  These authors are all aware of others' publications and contributions.  Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text. 

***  Nancy Leveson informal bio page.

****  “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157)  The hazards analysis identifies all major hazards the system may confront.  Baseline safety requirements follow from the hazards analysis.  Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk.  The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)

Saturday, July 6, 2013

Behind Human Error by Woods, Dekker, Cook, Johannesen and Sarter

This book* examines how errors occur in complex socio-technical systems.  The authors' thesis is that behind every ascribed “human error” there is a “second story” of the context (conditions, demands, constraints, etc.) created by the system itself.  “That which we label “human error” after the fact is never the cause of an accident.  Rather, it is the cumulative effect of multiple cognitive, collaborative, and organizational factors.” (p. 35)  In other words, “Error is a symptom indicating the need to investigate the larger operational systems and the organizational context in which it functions.” (p. 28)  This post presents a summary of the book followed by our perspective on its value.  (The book has a lot of content so this will not be a short post.)

The Second Story

This section establishes the authors' view of error and how socio-technical systems function.  They describe two mutually exclusive world views: (1) “erratic people degrade an otherwise safe system” vs. (2) “people create safety at all levels of the socio-technical system by learning and adapting . . .” (p. 6)  It should be obvious that the authors favor option 2.

In such a world “Failure, then, represents breakdowns in adaptations directed at coping with complexity.  Indeed, the enemy of safety is not the human: it is complexity.” (p. 1)  “. . . accidents emerge from the coupling and interdependence of modern systems.” (p. 31) 

Adaptation occurs in response to pressures or environmental changes.  For example, systems are under stakeholder pressure to become faster, better, cheaper; multiple goals and goal conflict are regular complex system characteristics.  But adaptation is not always successful.  There may be too little (rules and procedures are followed even though conditions have changed) or too much (adaptation is attempted with insufficient information to achieve goals).  Because of pressure, adaptations evolve toward performance boundaries, in particular, safety boundaries.  There is a drift toward failure. (see Dekker, reviewed here)

The authors present 15 premises for analyzing errors in complex socio-technical systems. (pp. 19-30)  Most are familiar but some are worth highlighting and remembering when thinking about system errors:

  • “There is a loose coupling between process and outcome.”  A “bad” process does not always produce bad outcomes and a “good” process does not always produce good outcomes.
  • “Knowledge of outcome (hindsight) biases judgments about process.”  More about that later.
  • “Lawful factors govern the types of erroneous actions or assessments to be expected.”   In other words, “errors are regular and predictable consequences of a variety of factors.”
  • “The design of artifacts affects the potential for erroneous actions and paths towards disaster.”  This is Human Factors 101 but problems still arise.  “Increased coupling increases the cognitive demands on practitioners.”  Increased coupling plus weak feedback can create a latent failure.

Complex Systems Failure

This section covers traditional mental models used for assessing failures and points out the putative inadequacies of each.  The sequence-of-events (or domino) model is familiar Newtonian causal analysis.  Man-made disaster theory puts company culture and institutional design at the heart of the safety question.  Vulnerability develops over time but is hidden by the organization’s belief that it has risk under control.  A system or component is driven into failure.  The latent failure (or Swiss cheese) model proposes that “disasters are characterized by a concatenation of several small failures and contributing events. . .” (p. 50)  While a practitioner may be closest to an accident, the associated latent failures were created by system managers, designers, maintainers or regulators.  All these models reinforce the search for human error (someone untrained, inattentive or a “bad apple) and the customary fixes (more training, procedure adherence and personal attention, or targeted discipline).  They represent a failure to adopt systems thinking and concepts of dynamics, learning, adaptation and the notion that a system can produce accidents as a natural consequence of its normal functioning.

A more sophisticated set of models is then discussed.  Perrow's normal accident theory says that “accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled.” (p. 61)  Such systems structurally confuse operators and prevent them from recovering when incipient failure is discovered.  People are part of the Perrowian system and can exhibit inadequate expertise.  Control theory sees systems as composed of components that must be kept in dynamic equilibrium based on feedback and continual control inputs—basically a system dynamics view.  Accidents are a result of normal system behavior and occur when components interact to violate safety constraints and the feedback (and control inputs) do not reflect the developing problems.  Small changes in the system can lead to huge consequences elsewhere.  Accident avoidance is based on making system performance boundaries explicit and known although the goal of efficiency will tend to push operations toward the boundaries.  In contrast, the authors would argue for a different focus: making the system more resilient, i.e., error-tolerant.**  High reliability theory describes how how-hazard activities can achieve safe performance through leadership, closed systems, functional decentralization, safety culture, redundancy and systematic learning.  High reliability means minimal variations in performance, which in the short-term, means safe performance but HROs are subject to incidents indicative of residual system noise and unseen changes from social forces, information management or new technologies. (See Weick, reviewed here)

Standing on the shoulders of the above sophisticated models, resilience engineering (RE) is proposed as a better way to think about safety.  According to this model, accidents “represent the breakdowns in the adaptations necessary to cope with the real world complexity. (p. 83)  The authors use the Columbia space shuttle disaster to illustrate patterns of failure evident in complex systems: drift toward failure, past success as reason for continued confidence, fragmented problem-solving, ignoring new evidence and intra-organizational communication breakdowns.  To oppose or compensate for these patterns, RE proposes monitoring or enhancing other system properties including: buffering capacity, flexibility, margin and tolerance (which means replacing quick collapse with graceful degradation).  RE “focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment.” (p. 93)  In practice, that means detecting signs of increasing risk, having resources for safety available, and recognizing when and where to invest to offset risk.  It also requires focusing on organizational decision making, e.g., cross checks for risky decisions, the safety-production-efficiency balance and the reporting and disposition of safety concerns.  “Enhancing error tolerance, detection and recovery together produce safety.” (p. 26)

Operating at the Sharp End

An organization's sharp end is where practitioners apply their expertise in an effort to achieve the organization's goals.  The blunt end is where support functions, from administration to engineering, work.  The blunt end designs the system, the sharp end operates it.  Practitioner performance is affected by cognitive activities in three areas: activation of knowledge, the flow of attention and interactions among multiple goals.

The knowledge available to practitioners arrives as organized content.  Challenges include: organization may be poor, the content may be incomplete or simply wrong.  Practitioner mental models may be inaccurate or incomplete without the practitioners realizing it, i.e., they may be poorly calibrated.  Knowledge may be inert, i.e., not accessed when it is needed.  Oversimplifications (heuristics) may work in some situations but produce errors in others and limit the practitioner's ability to account for uncertainties or conflicts that arise in individual cases.  The discussion of heuristics suggests Hollnagel, reviewed here.

Mindset is about attention and its control.” (p. 114)  Attention is a limited resource.  Problems with maintaining effective attention include loss of situational awareness, in which the practitioner's mental model of events doesn't match the real world, and fixation, where the practitioner's initial assessment of  a situation creates a going-forward bias against accepting discrepant data and a failure to trigger relevant inert knowledge.  Mindset seems similar to HRO mindfulness. (see Weick)

Goal conflict can arise from many sources including management policies, regulatory requirements, economic (cost) factors and risk of legal liability.  Decision making must consider goals (which may be implicit), values, costs and risks—which may be uncertain.  Normalization of deviance is a constant threat.  Decision makers may be held responsible for achieving a goal but lack the authority to do so.  The conflict between cost and safety may be subtle or unrecognized.  “Safety is not a concrete entity and the argument that one should always choose the safest path misrepresents the dilemmas that confront the practitioner.” (p. 139)  “[I]t is difficult for many organizations (particularly in regulated industries) to admit that goal conflicts and tradeoff decisions arise.” (p. 139)  Overall, the authors present a good discussion of goal conflict.

How Design Can Induce Error

The design of computerized devices intended to help practitioners can instead lead to greater risks of errors and incidents.  Specific causes of problems include clumsy automation, limited information visibility and mode errors. 

Automation is supposed to increase user effectiveness and efficiency.  However, clumsy automation creates situations where the user loses track of what the computer is set up to do, what it's doing and what it will do next.  If support systems are so flexible that users can't know all their possible configurations, they adopt simplifying strategies which may be inappropriate in some cases.  Clumsy automation leads to more (instead of less) cognitive work, user attention is diverted to the machine instead of the task, increased potential for new kinds of errors and the need for new user knowledge and judgments.  The machine effectively has its own model of the world, based on user inputs, data sensors and internal functioning, and passes that back to the user.

Machines often hide a mass of data behind a narrow keyhole of visibility into the system.  Successful design creates “a visible conceptual space meaningfully related to activities and constraints in a field of practice.” (p. 162)  In addition, “Effective representations highlight  'operationally interesting' changes for sequences of behavior . . .” (p. 167)  However, default displays typically do not make interesting events directly visible.

Mode errors occurs when an operator initiates an action that would be appropriate if the machine were in mode A but, in fact, it's in mode B.  (This may be a man-machine problem but it's not the machine's fault.)  A machine can change modes based on situational and system factors in addition to operator input.  Operators have to maintain mode awareness, not an easy task when viewing a small, cluttered display that may not highlight current mode or mode changes.

To cope with bad design “practitioners adapt information technology provided for them to the immediate tasks at hand in a locally pragmatic way, . . .” (p. 191)  They use system tailoring where they adapt the device, often by focusing on a feature set they consider useful and ignoring other machine capabilities.  They use task tailoring where they adapt strategies to accommodate constraints imposed by the new technology.  Both types of adaptation can lead to success or eventual failures. 

The authors suggest various countermeasures and design changes to address these problems. 

Reactions to Failure

Different approaches for analyzing accidents lead to different perspectives on human error. 

Hindsight bias is “the tendency for people to 'consistently exaggerate what could have been anticipated in foresight.'” (p. 15)  It reinforces the tendency to look for the human in the human error.  Operators are blamed for bad outcomes because they are available, tracking back to multiple contributing causes is difficult, most system performance is good and investigators tend to judge process quality by its outcome.  Outsiders tend to think operators knew more about their situation than they actually did.  Evaluating process instead of outcome is also problematic.  Process and outcome are loosely coupled and what standards should be used for process evaluation?  Formal work descriptions “underestimate the dilemmas, interactions between constraints, goal conflicts, and tradeoffs present in the actual workplace.” (p. 208)  A suggested alternative approach is to ask what other practitioners would have done in the same situation and build a set of contrast cases.  “What we should not do, . . . is rely on putatively objective external evaluations . . . such as . . . court cases or other formal hearings.  Such processes in fact institutionalize and legitimate the hindsight bias . . . leading to blame and a focus on individual actors at the expense of a system view.” (pp. 213-214)

Distancing through differencing is another risk.  In this practice, reviewers focus on differences between the context surrounding an accident and their own circumstance.  Blaming individuals reinforces belief that there are no lessons to be learned for other organizations.  If human error is local and individual (as opposed to systemic) then sanctions, exhortations to follow the procedures and remedial training are sufficient fixes.  There is a decent discussion of TMI here, where, in the authors' opinion, the initial sense of fundamental surprise and need for socio-technical fixes was soon replaced by a search for local, technologically-focused solutions.
There is often pressure to hold people accountable after incidents or accidents.  One answer is a “just culture” which views incidents as system learning opportunities but also draws a line between acceptable and unacceptable behavior.  Since the “line” is an attribution the key question for any organization is who gets to draw it.  Another challenge is defining the discretionary space where individuals alone have the authority to decide how to proceed.  There is more on just culture but this is all (or mostly) Dekker. (see our Just Culture commentary here)

The authors' recommendations for analyzing errors and improving safety can be summed up as follows: recognize that human error is an attribution; pursue second stories that reveal the multiple, systemic contributors to failure; avoid hindsight bias; understand how work really gets done; search for systemic vulnerabilities; study how practice creates safety; search for underlying patterns; examine how change will produce new vulnerabilities; use technology to enhance human expertise; and tame complexity. (p. 239)  “Safety is created at the sharp end as practitioners interact with hazardous processes . . . using the available tools and resources.” (p. 243)

Our Perspective

This is a book about organizational characteristics and socio-technical systems.  Recommendations and advice are aimed at organizational policy makers and incident investigators.  The discussion of a “just culture” is the only time culture is discussed in detail although safety culture is mentioned in passing in the HRO write-up.

Our first problem with the book is repeatedly referring to medicine, aviation, aircraft carrier operations and nuclear power plants as complex systems.***  Although medicine is definitely complex and aviation (including air traffic control) possibly is, carrier operations and nuclear power plants are simply complicated.  While carrier and nuclear personnel have to make some adaptations on the fly, they do not face sudden, disruptive changes in their technologies or operating environments and they are not exposed to cutthroat competition.  Their operations are tightly coordinated but, where possible, by design more loosely coupled to facilitate recovery if operations start to go sour.  In addition, calling nuclear power operations complex perpetuates the myth that nuclear is “unique and special” and thus merits some special place in the pantheon of industry.  It isn't and it doesn't.

Our second problem relates to the authors' recasting of the nature of human error.  We decry the rush to judgment after negative events, particularly a search limited to identifying culpable humans.  The search for bad apples or outright criminals satisfies society's perceived need to bring someone to justice and the corporate system's desire to appear to fix things through management exhortations and training without really admitting systemic problems or changing anything substantive, e.g., the management incentive plan.  The authors' plea for more systemic analysis is thus welcome.

But they push the pendulum too far in the opposite direction.  They appear to advocate replacing all human errors (except for gross negligence, willful violations or sabotage) with systemic explanations, aka rationalizations.  What is never mentioned is that medical errors lead to tens of thousands of preventable deaths per year.****  In contrast, U.S. commercial aviation has not experienced over a hundred fatalities (excluding 9/11) since 1996; carriers and nuclear power plants experience accidents, but there are few fatalities.  At worst, this book is a denial that real human errors (including bad decisions, slip ups, impairments, coverups) occur and a rationalization of medical mistakes caused by arrogance, incompetence, class structure and lack of accountability.

This is a dense book, 250 pages of small print, with an index that is nearly useless.  Pressures (most likely cost and schedule) have apparently pushed publishing to the system boundary for copy editing—there are extra, missing and wrong words throughout the text.

This 2010 second edition updates the original 1994 monograph.  Many of the original ideas have been fleshed out elsewhere by the authors (primarily Dekker) and others.  Some references, e.g., Hollnagel, Perrow and the HRO school, should be read in their original form. 

*  D.D. Woods, S. Dekker, R. Cook, L. Johannesen and N. Sarter, Behind Human Error, 2d ed.  (Ashgate, Burlington, VT: 2010).  Thanks to Bill Mullins for bringing this book to our attention.

**  There is considerable overlap of the perspectives of the authors and the control theorists (Leveson and Rasmussen are cited in the book).  As an aside, Dekker was a dissertation advisor for one of Leveson's MIT students.

***  The authors' different backgrounds contribute to this mash-up.  Cook is a physician, Dekker is a pilot and some of Woods' cited publications refer to nuclear power (and aviation).

****  M. Makary, “How to Stop Hospitals From Killing Us,” Wall Street Journal online (Sept. 21, 2012).  Retrieved July 4, 2013.