A brief article* in the December 2015 The Atlantic magazine asks “What was VW thinking?” then reviews a few classic business cases to show how top management, often CEO, decisions can percolate down through an organization, sometimes with appalling results. The author also describes a couple of mechanisms by which bad decision making can be institutionalized. We’ll start with the cases.
Johnson & Johnson had a long-standing credo that outlined its responsibilities to those who used its products. In 1979, the CEO reinforced the credo’s relevance to J&J’s operations. When poisoned Tylenol showed up in stores, J&J did not hesitate to recall product, warn people against taking Tylenol and absorb a $100 million hit. This is often cited as an example of a corporation doing the right thing.
B. F. Goodrich promised an Air Force contractor an aircraft brake that was ultralight and ultracheap. The only problem was it didn’t work, in fact it melted. Only by massively finagling the test procedures and falsifying test results did they get the brake qualified. The Air Force discovered the truth when they reviewed the raw test data. A Goodrich whistleblower announced his resignation over the incident but was quickly fired by the company.
Ford President Lee Iacocca wanted the Pinto to be light, inexpensive and available in 25 months. The gas tank’s position made the vehicle susceptible to fire when the car was rear-ended but repositioning the gas tank would have delayed the roll-out schedule. Ford delayed addressing the problem, resulting in at least one costly lawsuit and bad publicity for the company.
With respect to institutional mechanisms, the author reviews Diane Vaughan’s normalization of deviance and how it led to the space shuttle Challenger disaster. To promote efficiency, organizations adopt scripts that tell members how to handle various situations. Scripts provide a rationale for decisions, which can sometimes be the wrong decisions. In Vaughan’s view, scripts can “expand like an elastic waistband” to accommodate more and more deviation from standards or norms. Scripts are important organizational culture artifacts. We have often referred to Vaughan’s work on Safetymatters.
The author closes with a quote: “Culture starts at the top, . . . Employees will see through empty rhetoric and will emulate the nature of top-management decision making . . . ” The speaker? Andrew Fastow, Enron’s former CFO and former federal prison inmate.
Our Perspective
I used to use these cases when I was teaching ethics to business majors at a local university. Students would say they would never do any of the bad stuff. I said they probably would, especially once they had mortgages (today it’s student debt), families and career aspirations. It’s hard to put up a fight when the organization has so accepted the script they actually believe they are doing the right thing. And don’t even think about being a whistleblower unless you’ve got money set aside and a good lawyer lined up.
Bottom line: This is worth a quick read. It illustrates the importance of senior management’s decisions as opposed to its sloganeering or other empty leadership behavior.
* J. Useem, “What Was Volkswagen Thinking? On the origins of corporate evil—and idiocy,” The Atlantic (Dec. 2015), pp.26-28.
Showing posts with label Normalization of Deviance. Show all posts
Showing posts with label Normalization of Deviance. Show all posts
Monday, January 4, 2016
Thursday, January 29, 2015
Safety Culture at Chevron’s Richmond, CA Refinery
The U.S. Chemical
Safety and Hazard Investigation Board (CSB) released its final report* on the
August 2012 fire at the Chevron refinery in Richmond, CA caused by a leaking
pipe. In the discussion around the CSB’s
interim incident report (see our April 16, 2013 post) the agency’s chairman
said Chevron’s safety culture (SC) appeared to be a factor in the incident. This post focuses on the final report findings
related to the refinery’s SC.
During
their investigation, the CSB learned that some personnel were uncomfortable
working around the leaking pipe because of potential exposure to the flammable
fluid. “Some individuals even
recommended that the Crude Unit be shut down, but they left the final decision to
the management personnel present. No one
formally invoked their Stop Work Authority. In addition, Chevron safety culture surveys
indicate that between 2008 and 2010, personnel had become less willing to use
their Stop Work Authority. . . . there are a number of reasons why such a
program may fail related to the ‘human factors’ issue of decision-making; these
reasons include belief that the Stop Work decision should be made by someone
else higher in the organizational hierarchy, reluctance to speak up and delay
work progress, and fear of reprisal for stopping the job.” (pp. 12-13)
The report
also mentioned decision making that favored continued production over safety.
(p. 13) In the report’s details, the CSB
described the refinery organization’s decisions to keep operating under
questionable safety conditions as “normalization of deviance,” a term popularized
by Diane Vaughn and familiar to Safetymatters readers. (p. 105)
The report
included a detailed comparison of the refinery’s 2008 and 2010 SC surveys. In addition to the decrease in employees’ willingness
to use their Stop Work Authority, surveyed operators and mechanics reported an
increased belief that using such authority could get them into trouble (p. 108)
and that equipment was not properly cared for. (p. 109)
Our Perspective
We like the
CSB. They’re straight shooters and don’t
mince words. While we are not big fans
of SC surveys, the CSB’s analysis of Chevron’s SC surveys appears to show a
deteriorating SC between 2008 and 2010.
Chevron
says they agree with some CSB findings however Chevron believes “the CSB has
presented an inaccurate depiction of the Richmond Refinery’s current process
safety culture.” Chevron says “In a
third-party survey commissioned by Contra Costa County, when asked whether they
feel free to use Stop Work Authority during any work activity, 93 percent of
Chevron refinery workers responded favorably. The overall results for the process safety
survey exceeded the survey taker’s benchmark for North American refineries.”** Who owns the truth here? The CSB?
Chevron? Both?
In 2013, the
city of Richmond adopted an Industrial Safety Ordinance (RISO) that requires
Chevron to conduct SC assessments, preserve records and develop corrective
actions. The CSB recommendations
including beefing up the RISO to evaluate the quality of Chevron’s action items
and their actual impact on SC. (p. 116)
Chevron
continues to receive blowback from the incident. The refinery is the largest employer and
taxpayer in Richmond. It’s not a company
town but Chevron has historically had a lot of political sway in the city. That’s changed, at least for now. In the recent city council election, none of
the candidates backed by Chevron was elected.***
As an
aside, the CSB report referenced a 2010 study**** that found a sample of oil
and gas workers directly intervened in only about 2 out of 5 of the unsafe acts
they observed on the job. How diligent
are you and your colleagues about calling out safety problems?
* CSB, “Final Investigation Report Chevron Richmond Refinery Pipe Rupture and Fire,” Report No. 2012-03-I-CA (Jan. 2015).
** M. Aldax, “Survey finds Richmond Refinery safety culture strong,” Richmond Standard (Jan. 29, 2015). Retrieved Jan. 29, 2015. The Richmond Standard is a website published
by Chevron Richmond.
*** C. Jones, “Chevron’s $3 million backfires in Richmond election,” SFGate (Nov. 5, 2014).
Retrieved Jan. 29, 2015.
**** R.D. Ragain, P. Ragain, Mike Allen and
Michael Allen, “Study: Employees intervene in only 2 of 5 observed unsafe acts,” Drilling Contractor (Jan./Feb. 2011).
Retrieved Jan. 29, 2015.
Monday, November 3, 2014
A Life In Error by James Reason
Most of us
associate psychologist James Reason with the “Swiss Cheese Model” of defense in
depth or possibly the notion of a “just culture.” But his career has been more than two ideas,
he has literally spent his professional life studying errors, their causes and
contexts. A Life In Error* is an academic memoir, recounting his study of
errors starting with the individual and ending up with the organization (the
“system”) including its safety culture (SC).
This post summarizes relevant portions of the book and provides our
perspective. It is going to read like a sub-titled
movie on fast-forward but there are a lot of particulars packed in this short
(124 pgs.) book.
Slips and Mistakes
People make plans and take action, consequences follow. Errors occur when the intended goals are not achieved. The plan may be adequate but the execution faulty because of slips (absent-mindedness) or trips (clumsy actions). A plan that was inadequate to begin with is a mistake which is usually more subtle than a slip, and may go undetected for long periods of time if no obviously bad consequences occur. (pp. 10-12) A mistake is a creation of higher-level mental activity than a slip. Both slips and mistakes can take “strong but wrong” forms, where schema** that were effective in prior situations are selected even though they are not appropriate in the current situation.
Absent-minded slips can occur from misapplied competence where a planned routine is sidetracked into an unplanned one. Such diversions can occur, for instance, when one’s attention is unexpectedly diverted just as one reaches a decision point and multiple schema are both available and actively vying to be chosen. (pp. 21-25) Reason’s recipe for absent-minded errors is one part cognitive under-specification, e.g., insufficient knowledge, and one part the existence of an inappropriate response primed by prior, recent use and the situational conditions. (p. 49)
Planning Biases
The planning activity is subject to multiple biases. An individual planner’s database may be incomplete or shaped by past experiences rather than future uncertainties, with greater emphasis on past successes than failures. Planners can underestimate the influence of chance, overweight data that is emotionally charged, be overly influenced by their theories, misinterpret sample data or miss covariations, suffer hindsight bias or be overconfident.*** Once a plan is prepared, planners may focus only on confirmatory data and are usually resistant to changing the plan. Planning in a group is subject to “groupthink” problems including overconfidence, rationalization, self-censorship and an illusion of unanimity. (pp. 56-62)
Errors and Violations
Violations are deliberate acts to break rules or procedures, although bad outcomes are not generally intended. Violations arise from various motivational factors including the organizational culture. Types of violations include corner-cutting to avoid clumsy procedures, necessary violations to get the job done because the procedures are unworkable, adjustments to satisfy conflicting goals and one-off actions (such as turning off a safety system) when faced with exceptional circumstances. Violators perform a type of cost:benefit analysis biased by the fact that benefits are likely immediate while costs, if they occur, are usually somewhere in the future. In Reason’s view, the proper course for the organization is to increase the perceived benefits of compliance not increase the costs (penalties) for violations. (There is a hint of the “just culture” here.)
Organizational Accidents
Major accidents (TMI, Chernobyl, Challenger) have three common characteristics: contributing factors that were latent in the system, multiple levels of defense, and an unforeseen combination of latent factors and active failures (errors and/or violations) that defeated the defenses. This is the well-known Swiss Cheese Model with the active failures opening short-lived holes and latent failures creating longer-lasting but unperceived holes.
Organizational accidents are low frequency, high severity events with causes that may date back years. In contrast, individual accidents are more frequent but have limited consequences; they arise from slips, trips and lapses. This is why organizations can have a good industrial accident record while they are on the road to a large-scale disaster, e.g., BP at Texas City.
Organizational Culture
Certain industries, including nuclear power, have defenses-in-depth distributed throughout the system but are vulnerable to something that is equally widespread. According to Reason, “The most likely candidate is safety culture. It can affect all elements in a system for good or ill.” (p. 81) An inadequate SC can undermine the Swiss Cheese Model: there will be more active failures at the “sharp end”; more latent conditions created and sustained by management actions and policies, e.g., poor maintenance, inadequate equipment or downgrading training; and the organization will be reluctant to deal proactively with known problems. (pp. 82-83)
Reason describes a “cluster of organizational pathologies” that make an adverse system event more likely: blaming sharp-end operators, denying the existence of systemic inadequacies, and a narrow pursuit of production and financial goals. He goes on to list some of the drivers of blame and denial. The list includes: accepting human error as the root cause of an event; the hindsight bias; evaluating prior decisions based on their outcomes; shooting the messenger; belief in a bad apple but not a bad barrel (the system); failure to learn; a climate of silence; workarounds that compensate for systemic inadequacies’ and normalization of deviance. (pp. 86-92) Whew!
Our Perspective
Reason teaches us that the essence of understanding errors is nuance. At one end of the spectrum, some errors are totally under the purview of the individual, at the other end they reside in the realm of the system. The biases and issues described by Reason are familiar to Safetymatters readers and echo in the work of Dekker, Hollnagel, Kahneman and others. We have been pounding the drum for a long time on the inadequacies of safety analyses that ignore systemic issues and corrective actions that are limited to individuals (e.g., more training and oversight, improved procedures and clearer expectations).
The book is densely packed with the work of a career. One could easily use the contents to develop a SC assessment or self-assessment. We did not report on the chapters covering research into absent-mindedness, Freud and medical errors (Reason’s current interest) but they are certainly worth reading.
Reason says this book is his valedictory: “I have nothing new to say and I’m well past my prime.” (p. 122) We hope not.
* J. Reason, A Life In Error: From Little Slips to Big Disasters (Burlington, VT: Ashgate, 2013).
** Knowledge structures in long-term memory. (p. 24)
*** This will ring familiar to readers of Daniel Kahneman. See our Dec. 18, 2013 post on Kahneman’s Thinking, Fast and Slow.
Slips and Mistakes
People make plans and take action, consequences follow. Errors occur when the intended goals are not achieved. The plan may be adequate but the execution faulty because of slips (absent-mindedness) or trips (clumsy actions). A plan that was inadequate to begin with is a mistake which is usually more subtle than a slip, and may go undetected for long periods of time if no obviously bad consequences occur. (pp. 10-12) A mistake is a creation of higher-level mental activity than a slip. Both slips and mistakes can take “strong but wrong” forms, where schema** that were effective in prior situations are selected even though they are not appropriate in the current situation.
Absent-minded slips can occur from misapplied competence where a planned routine is sidetracked into an unplanned one. Such diversions can occur, for instance, when one’s attention is unexpectedly diverted just as one reaches a decision point and multiple schema are both available and actively vying to be chosen. (pp. 21-25) Reason’s recipe for absent-minded errors is one part cognitive under-specification, e.g., insufficient knowledge, and one part the existence of an inappropriate response primed by prior, recent use and the situational conditions. (p. 49)
Planning Biases
The planning activity is subject to multiple biases. An individual planner’s database may be incomplete or shaped by past experiences rather than future uncertainties, with greater emphasis on past successes than failures. Planners can underestimate the influence of chance, overweight data that is emotionally charged, be overly influenced by their theories, misinterpret sample data or miss covariations, suffer hindsight bias or be overconfident.*** Once a plan is prepared, planners may focus only on confirmatory data and are usually resistant to changing the plan. Planning in a group is subject to “groupthink” problems including overconfidence, rationalization, self-censorship and an illusion of unanimity. (pp. 56-62)
Errors and Violations
Violations are deliberate acts to break rules or procedures, although bad outcomes are not generally intended. Violations arise from various motivational factors including the organizational culture. Types of violations include corner-cutting to avoid clumsy procedures, necessary violations to get the job done because the procedures are unworkable, adjustments to satisfy conflicting goals and one-off actions (such as turning off a safety system) when faced with exceptional circumstances. Violators perform a type of cost:benefit analysis biased by the fact that benefits are likely immediate while costs, if they occur, are usually somewhere in the future. In Reason’s view, the proper course for the organization is to increase the perceived benefits of compliance not increase the costs (penalties) for violations. (There is a hint of the “just culture” here.)
Organizational Accidents
Major accidents (TMI, Chernobyl, Challenger) have three common characteristics: contributing factors that were latent in the system, multiple levels of defense, and an unforeseen combination of latent factors and active failures (errors and/or violations) that defeated the defenses. This is the well-known Swiss Cheese Model with the active failures opening short-lived holes and latent failures creating longer-lasting but unperceived holes.
Organizational accidents are low frequency, high severity events with causes that may date back years. In contrast, individual accidents are more frequent but have limited consequences; they arise from slips, trips and lapses. This is why organizations can have a good industrial accident record while they are on the road to a large-scale disaster, e.g., BP at Texas City.
Organizational Culture
Certain industries, including nuclear power, have defenses-in-depth distributed throughout the system but are vulnerable to something that is equally widespread. According to Reason, “The most likely candidate is safety culture. It can affect all elements in a system for good or ill.” (p. 81) An inadequate SC can undermine the Swiss Cheese Model: there will be more active failures at the “sharp end”; more latent conditions created and sustained by management actions and policies, e.g., poor maintenance, inadequate equipment or downgrading training; and the organization will be reluctant to deal proactively with known problems. (pp. 82-83)
Reason describes a “cluster of organizational pathologies” that make an adverse system event more likely: blaming sharp-end operators, denying the existence of systemic inadequacies, and a narrow pursuit of production and financial goals. He goes on to list some of the drivers of blame and denial. The list includes: accepting human error as the root cause of an event; the hindsight bias; evaluating prior decisions based on their outcomes; shooting the messenger; belief in a bad apple but not a bad barrel (the system); failure to learn; a climate of silence; workarounds that compensate for systemic inadequacies’ and normalization of deviance. (pp. 86-92) Whew!
Our Perspective
Reason teaches us that the essence of understanding errors is nuance. At one end of the spectrum, some errors are totally under the purview of the individual, at the other end they reside in the realm of the system. The biases and issues described by Reason are familiar to Safetymatters readers and echo in the work of Dekker, Hollnagel, Kahneman and others. We have been pounding the drum for a long time on the inadequacies of safety analyses that ignore systemic issues and corrective actions that are limited to individuals (e.g., more training and oversight, improved procedures and clearer expectations).
The book is densely packed with the work of a career. One could easily use the contents to develop a SC assessment or self-assessment. We did not report on the chapters covering research into absent-mindedness, Freud and medical errors (Reason’s current interest) but they are certainly worth reading.
Reason says this book is his valedictory: “I have nothing new to say and I’m well past my prime.” (p. 122) We hope not.
* J. Reason, A Life In Error: From Little Slips to Big Disasters (Burlington, VT: Ashgate, 2013).
** Knowledge structures in long-term memory. (p. 24)
*** This will ring familiar to readers of Daniel Kahneman. See our Dec. 18, 2013 post on Kahneman’s Thinking, Fast and Slow.
Posted by
Lewis Conner
1 comments. Click to view/add.
Monday, October 14, 2013
High Reliability Management by Roe and Schulman
This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market. CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world. In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.
The High Reliability Management (HRM) Model
The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3) System Volatility refers to the magnitude and rate of change of CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall). Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs.
System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each. All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid. Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes. Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)
High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance. Some strategies can be substituted for others. It is a dynamic but manageable environment.
High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance. They run from pillar to post; it is highly stressful. Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error. Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills. It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)
Low Volatility and Low Options mean generation and demand are not changing quickly. The critical feature here is demand has been reduced by load shedding. The operators have exhausted all other strategies for maintaining balance. It is a command-and-control approach, effected by declaring a Stage 3 grid situation and run using formal rules and procedures. It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished.
Low Volatility and High Options is an HRM's preferred mode. Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available. Procedures based on analyzed conditions exist and are used. There are few, if any, surprises. Learning can occur but it is incremental, the result of new methods or analysis. Performance is important and system behavior operates within a narrow bandwidth. Loss of attention (complacency) is a risk. Is this starting to sound familiar? This is the domain of High Reliability Organization (HRO) theory and practice. Nuclear power operations is an example of an HRO. (pp. 60-62)
Lessons for Nuclear Operations
Nuclear plants work hard to stay in the Low Volatility/High Options mode. If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62) In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses. Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.** Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.
In an HRO, trial and error is not an acceptable method for trying out new options. No one wants cowboy operators in the control room. But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233) In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)
The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear. The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system. In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance. In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126)
The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)
Implications for Safety Culture
Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2) Both organizational types need a focus on operations as the central activity. Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.
The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants. Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220)
The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC. We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?”
Conclusion
This review gives short shrift to the authors' broad and deep description and analysis of CAISO.**** The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.
The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management. But it's a good read and full of insightful observations, e.g., the description of CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)
(High Reliability Management was recommended by a Safetymatters reader. If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)
* E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008) This book reports the authors' study of CAISO from 2001 through 2006.
** By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment. Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.
The importance of considering low-probability, major consequence events is argued by Taleb (see here) and Dédale (see here).
*** Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits.
The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone. For more on Cynefin, see here and here.
**** For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO. As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)
Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits. This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.
The High Reliability Management (HRM) Model
The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3) System Volatility refers to the magnitude and rate of change of CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall). Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs.
System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each. All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid. Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes. Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)
High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance. Some strategies can be substituted for others. It is a dynamic but manageable environment.
High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance. They run from pillar to post; it is highly stressful. Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error. Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills. It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)
Low Volatility and Low Options mean generation and demand are not changing quickly. The critical feature here is demand has been reduced by load shedding. The operators have exhausted all other strategies for maintaining balance. It is a command-and-control approach, effected by declaring a Stage 3 grid situation and run using formal rules and procedures. It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished.
Low Volatility and High Options is an HRM's preferred mode. Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available. Procedures based on analyzed conditions exist and are used. There are few, if any, surprises. Learning can occur but it is incremental, the result of new methods or analysis. Performance is important and system behavior operates within a narrow bandwidth. Loss of attention (complacency) is a risk. Is this starting to sound familiar? This is the domain of High Reliability Organization (HRO) theory and practice. Nuclear power operations is an example of an HRO. (pp. 60-62)
Lessons for Nuclear Operations
Nuclear plants work hard to stay in the Low Volatility/High Options mode. If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62) In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses. Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.** Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.
In an HRO, trial and error is not an acceptable method for trying out new options. No one wants cowboy operators in the control room. But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233) In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)
The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear. The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system. In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance. In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126)
The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)
Implications for Safety Culture
Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2) Both organizational types need a focus on operations as the central activity. Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.
The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants. Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220)
The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC. We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?”
Conclusion
This review gives short shrift to the authors' broad and deep description and analysis of CAISO.**** The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.
The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management. But it's a good read and full of insightful observations, e.g., the description of CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)
(High Reliability Management was recommended by a Safetymatters reader. If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)
* E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008) This book reports the authors' study of CAISO from 2001 through 2006.
** By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment. Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.
The importance of considering low-probability, major consequence events is argued by Taleb (see here) and Dédale (see here).
*** Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits.
The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone. For more on Cynefin, see here and here.
**** For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO. As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)
Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits. This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.
Posted by
Lewis Conner
1 comments. Click to view/add.
Labels:
Complacency,
Cynefin,
HRO,
Management,
Mental Model,
Normalization of Deviance,
References,
Simulation,
Taleb
Saturday, July 6, 2013
Behind Human Error by Woods, Dekker, Cook, Johannesen and Sarter
This book* examines how errors occur in complex socio-technical systems. The authors' thesis is that behind every ascribed “human error” there is a “second story” of the context (conditions, demands, constraints, etc.) created by the system itself. “That which we label “human error” after the fact is never the cause of an accident. Rather, it is the cumulative effect of multiple cognitive, collaborative, and organizational factors.” (p. 35) In other words, “Error is a symptom indicating the need to investigate the larger operational systems and the organizational context in which it functions.” (p. 28) This post presents a summary of the book followed by our perspective on its value. (The book has a lot of content so this will not be a short post.)
The Second Story
This section establishes the authors' view of error and how socio-technical systems function. They describe two mutually exclusive world views: (1) “erratic people degrade an otherwise safe system” vs. (2) “people create safety at all levels of the socio-technical system by learning and adapting . . .” (p. 6) It should be obvious that the authors favor option 2.
In such a world “Failure, then, represents breakdowns in adaptations directed at coping with complexity. Indeed, the enemy of safety is not the human: it is complexity.” (p. 1) “. . . accidents emerge from the coupling and interdependence of modern systems.” (p. 31)
Adaptation occurs in response to pressures or environmental changes. For example, systems are under stakeholder pressure to become faster, better, cheaper; multiple goals and goal conflict are regular complex system characteristics. But adaptation is not always successful. There may be too little (rules and procedures are followed even though conditions have changed) or too much (adaptation is attempted with insufficient information to achieve goals). Because of pressure, adaptations evolve toward performance boundaries, in particular, safety boundaries. There is a drift toward failure. (see Dekker, reviewed here)
The authors present 15 premises for analyzing errors in complex socio-technical systems. (pp. 19-30) Most are familiar but some are worth highlighting and remembering when thinking about system errors:
Complex Systems Failure
This section covers traditional mental models used for assessing failures and points out the putative inadequacies of each. The sequence-of-events (or domino) model is familiar Newtonian causal analysis. Man-made disaster theory puts company culture and institutional design at the heart of the safety question. Vulnerability develops over time but is hidden by the organization’s belief that it has risk under control. A system or component is driven into failure. The latent failure (or Swiss cheese) model proposes that “disasters are characterized by a concatenation of several small failures and contributing events. . .” (p. 50) While a practitioner may be closest to an accident, the associated latent failures were created by system managers, designers, maintainers or regulators. All these models reinforce the search for human error (someone untrained, inattentive or a “bad apple) and the customary fixes (more training, procedure adherence and personal attention, or targeted discipline). They represent a failure to adopt systems thinking and concepts of dynamics, learning, adaptation and the notion that a system can produce accidents as a natural consequence of its normal functioning.
A more sophisticated set of models is then discussed. Perrow's normal accident theory says that “accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled.” (p. 61) Such systems structurally confuse operators and prevent them from recovering when incipient failure is discovered. People are part of the Perrowian system and can exhibit inadequate expertise. Control theory sees systems as composed of components that must be kept in dynamic equilibrium based on feedback and continual control inputs—basically a system dynamics view. Accidents are a result of normal system behavior and occur when components interact to violate safety constraints and the feedback (and control inputs) do not reflect the developing problems. Small changes in the system can lead to huge consequences elsewhere. Accident avoidance is based on making system performance boundaries explicit and known although the goal of efficiency will tend to push operations toward the boundaries. In contrast, the authors would argue for a different focus: making the system more resilient, i.e., error-tolerant.** High reliability theory describes how how-hazard activities can achieve safe performance through leadership, closed systems, functional decentralization, safety culture, redundancy and systematic learning. High reliability means minimal variations in performance, which in the short-term, means safe performance but HROs are subject to incidents indicative of residual system noise and unseen changes from social forces, information management or new technologies. (See Weick, reviewed here)
Standing on the shoulders of the above sophisticated models, resilience engineering (RE) is proposed as a better way to think about safety. According to this model, accidents “represent the breakdowns in the adaptations necessary to cope with the real world complexity. (p. 83) The authors use the Columbia space shuttle disaster to illustrate patterns of failure evident in complex systems: drift toward failure, past success as reason for continued confidence, fragmented problem-solving, ignoring new evidence and intra-organizational communication breakdowns. To oppose or compensate for these patterns, RE proposes monitoring or enhancing other system properties including: buffering capacity, flexibility, margin and tolerance (which means replacing quick collapse with graceful degradation). RE “focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment.” (p. 93) In practice, that means detecting signs of increasing risk, having resources for safety available, and recognizing when and where to invest to offset risk. It also requires focusing on organizational decision making, e.g., cross checks for risky decisions, the safety-production-efficiency balance and the reporting and disposition of safety concerns. “Enhancing error tolerance, detection and recovery together produce safety.” (p. 26)
Operating at the Sharp End
An organization's sharp end is where practitioners apply their expertise in an effort to achieve the organization's goals. The blunt end is where support functions, from administration to engineering, work. The blunt end designs the system, the sharp end operates it. Practitioner performance is affected by cognitive activities in three areas: activation of knowledge, the flow of attention and interactions among multiple goals.
The knowledge available to practitioners arrives as organized content. Challenges include: organization may be poor, the content may be incomplete or simply wrong. Practitioner mental models may be inaccurate or incomplete without the practitioners realizing it, i.e., they may be poorly calibrated. Knowledge may be inert, i.e., not accessed when it is needed. Oversimplifications (heuristics) may work in some situations but produce errors in others and limit the practitioner's ability to account for uncertainties or conflicts that arise in individual cases. The discussion of heuristics suggests Hollnagel, reviewed here.
“Mindset is about attention and its control.” (p. 114) Attention is a limited resource. Problems with maintaining effective attention include loss of situational awareness, in which the practitioner's mental model of events doesn't match the real world, and fixation, where the practitioner's initial assessment of a situation creates a going-forward bias against accepting discrepant data and a failure to trigger relevant inert knowledge. Mindset seems similar to HRO mindfulness. (see Weick)
Goal conflict can arise from many sources including management policies, regulatory requirements, economic (cost) factors and risk of legal liability. Decision making must consider goals (which may be implicit), values, costs and risks—which may be uncertain. Normalization of deviance is a constant threat. Decision makers may be held responsible for achieving a goal but lack the authority to do so. The conflict between cost and safety may be subtle or unrecognized. “Safety is not a concrete entity and the argument that one should always choose the safest path misrepresents the dilemmas that confront the practitioner.” (p. 139) “[I]t is difficult for many organizations (particularly in regulated industries) to admit that goal conflicts and tradeoff decisions arise.” (p. 139) Overall, the authors present a good discussion of goal conflict.
How Design Can Induce Error
The design of computerized devices intended to help practitioners can instead lead to greater risks of errors and incidents. Specific causes of problems include clumsy automation, limited information visibility and mode errors.
Automation is supposed to increase user effectiveness and efficiency. However, clumsy automation creates situations where the user loses track of what the computer is set up to do, what it's doing and what it will do next. If support systems are so flexible that users can't know all their possible configurations, they adopt simplifying strategies which may be inappropriate in some cases. Clumsy automation leads to more (instead of less) cognitive work, user attention is diverted to the machine instead of the task, increased potential for new kinds of errors and the need for new user knowledge and judgments. The machine effectively has its own model of the world, based on user inputs, data sensors and internal functioning, and passes that back to the user.
Machines often hide a mass of data behind a narrow keyhole of visibility into the system. Successful design creates “a visible conceptual space meaningfully related to activities and constraints in a field of practice.” (p. 162) In addition, “Effective representations highlight 'operationally interesting' changes for sequences of behavior . . .” (p. 167) However, default displays typically do not make interesting events directly visible.
Mode errors occurs when an operator initiates an action that would be appropriate if the machine were in mode A but, in fact, it's in mode B. (This may be a man-machine problem but it's not the machine's fault.) A machine can change modes based on situational and system factors in addition to operator input. Operators have to maintain mode awareness, not an easy task when viewing a small, cluttered display that may not highlight current mode or mode changes.
To cope with bad design “practitioners adapt information technology provided for them to the immediate tasks at hand in a locally pragmatic way, . . .” (p. 191) They use system tailoring where they adapt the device, often by focusing on a feature set they consider useful and ignoring other machine capabilities. They use task tailoring where they adapt strategies to accommodate constraints imposed by the new technology. Both types of adaptation can lead to success or eventual failures.
The authors suggest various countermeasures and design changes to address these problems.
Reactions to Failure
Different approaches for analyzing accidents lead to different perspectives on human error.
Hindsight bias is “the tendency for people to 'consistently exaggerate what could have been anticipated in foresight.'” (p. 15) It reinforces the tendency to look for the human in the human error. Operators are blamed for bad outcomes because they are available, tracking back to multiple contributing causes is difficult, most system performance is good and investigators tend to judge process quality by its outcome. Outsiders tend to think operators knew more about their situation than they actually did. Evaluating process instead of outcome is also problematic. Process and outcome are loosely coupled and what standards should be used for process evaluation? Formal work descriptions “underestimate the dilemmas, interactions between constraints, goal conflicts, and tradeoffs present in the actual workplace.” (p. 208) A suggested alternative approach is to ask what other practitioners would have done in the same situation and build a set of contrast cases. “What we should not do, . . . is rely on putatively objective external evaluations . . . such as . . . court cases or other formal hearings. Such processes in fact institutionalize and legitimate the hindsight bias . . . leading to blame and a focus on individual actors at the expense of a system view.” (pp. 213-214)
Distancing through differencing is another risk. In this practice, reviewers focus on differences between the context surrounding an accident and their own circumstance. Blaming individuals reinforces belief that there are no lessons to be learned for other organizations. If human error is local and individual (as opposed to systemic) then sanctions, exhortations to follow the procedures and remedial training are sufficient fixes. There is a decent discussion of TMI here, where, in the authors' opinion, the initial sense of fundamental surprise and need for socio-technical fixes was soon replaced by a search for local, technologically-focused solutions.
There is often pressure to hold people accountable after incidents or accidents. One answer is a “just culture” which views incidents as system learning opportunities but also draws a line between acceptable and unacceptable behavior. Since the “line” is an attribution the key question for any organization is who gets to draw it. Another challenge is defining the discretionary space where individuals alone have the authority to decide how to proceed. There is more on just culture but this is all (or mostly) Dekker. (see our Just Culture commentary here)
The authors' recommendations for analyzing errors and improving safety can be summed up as follows: recognize that human error is an attribution; pursue second stories that reveal the multiple, systemic contributors to failure; avoid hindsight bias; understand how work really gets done; search for systemic vulnerabilities; study how practice creates safety; search for underlying patterns; examine how change will produce new vulnerabilities; use technology to enhance human expertise; and tame complexity. (p. 239) “Safety is created at the sharp end as practitioners interact with hazardous processes . . . using the available tools and resources.” (p. 243)
Our Perspective
This is a book about organizational characteristics and socio-technical systems. Recommendations and advice are aimed at organizational policy makers and incident investigators. The discussion of a “just culture” is the only time culture is discussed in detail although safety culture is mentioned in passing in the HRO write-up.
Our first problem with the book is repeatedly referring to medicine, aviation, aircraft carrier operations and nuclear power plants as complex systems.*** Although medicine is definitely complex and aviation (including air traffic control) possibly is, carrier operations and nuclear power plants are simply complicated. While carrier and nuclear personnel have to make some adaptations on the fly, they do not face sudden, disruptive changes in their technologies or operating environments and they are not exposed to cutthroat competition. Their operations are tightly coordinated but, where possible, by design more loosely coupled to facilitate recovery if operations start to go sour. In addition, calling nuclear power operations complex perpetuates the myth that nuclear is “unique and special” and thus merits some special place in the pantheon of industry. It isn't and it doesn't.
Our second problem relates to the authors' recasting of the nature of human error. We decry the rush to judgment after negative events, particularly a search limited to identifying culpable humans. The search for bad apples or outright criminals satisfies society's perceived need to bring someone to justice and the corporate system's desire to appear to fix things through management exhortations and training without really admitting systemic problems or changing anything substantive, e.g., the management incentive plan. The authors' plea for more systemic analysis is thus welcome.
But they push the pendulum too far in the opposite direction. They appear to advocate replacing all human errors (except for gross negligence, willful violations or sabotage) with systemic explanations, aka rationalizations. What is never mentioned is that medical errors lead to tens of thousands of preventable deaths per year.**** In contrast, U.S. commercial aviation has not experienced over a hundred fatalities (excluding 9/11) since 1996; carriers and nuclear power plants experience accidents, but there are few fatalities. At worst, this book is a denial that real human errors (including bad decisions, slip ups, impairments, coverups) occur and a rationalization of medical mistakes caused by arrogance, incompetence, class structure and lack of accountability.
This is a dense book, 250 pages of small print, with an index that is nearly useless. Pressures (most likely cost and schedule) have apparently pushed publishing to the system boundary for copy editing—there are extra, missing and wrong words throughout the text.
This 2010 second edition updates the original 1994 monograph. Many of the original ideas have been fleshed out elsewhere by the authors (primarily Dekker) and others. Some references, e.g., Hollnagel, Perrow and the HRO school, should be read in their original form.
* D.D. Woods, S. Dekker, R. Cook, L. Johannesen and N. Sarter, Behind Human Error, 2d ed. (Ashgate, Burlington, VT: 2010). Thanks to Bill Mullins for bringing this book to our attention.
** There is considerable overlap of the perspectives of the authors and the control theorists (Leveson and Rasmussen are cited in the book). As an aside, Dekker was a dissertation advisor for one of Leveson's MIT students.
*** The authors' different backgrounds contribute to this mash-up. Cook is a physician, Dekker is a pilot and some of Woods' cited publications refer to nuclear power (and aviation).
**** M. Makary, “How to Stop Hospitals From Killing Us,” Wall Street Journal online (Sept. 21, 2012). Retrieved July 4, 2013.
The Second Story
This section establishes the authors' view of error and how socio-technical systems function. They describe two mutually exclusive world views: (1) “erratic people degrade an otherwise safe system” vs. (2) “people create safety at all levels of the socio-technical system by learning and adapting . . .” (p. 6) It should be obvious that the authors favor option 2.
In such a world “Failure, then, represents breakdowns in adaptations directed at coping with complexity. Indeed, the enemy of safety is not the human: it is complexity.” (p. 1) “. . . accidents emerge from the coupling and interdependence of modern systems.” (p. 31)
Adaptation occurs in response to pressures or environmental changes. For example, systems are under stakeholder pressure to become faster, better, cheaper; multiple goals and goal conflict are regular complex system characteristics. But adaptation is not always successful. There may be too little (rules and procedures are followed even though conditions have changed) or too much (adaptation is attempted with insufficient information to achieve goals). Because of pressure, adaptations evolve toward performance boundaries, in particular, safety boundaries. There is a drift toward failure. (see Dekker, reviewed here)
The authors present 15 premises for analyzing errors in complex socio-technical systems. (pp. 19-30) Most are familiar but some are worth highlighting and remembering when thinking about system errors:
- “There is a loose coupling between process and outcome.” A “bad” process does not always produce bad outcomes and a “good” process does not always produce good outcomes.
- “Knowledge of outcome (hindsight) biases judgments about process.” More about that later.
- “Lawful factors govern the types of erroneous actions or assessments to be expected.” In other words, “errors are regular and predictable consequences of a variety of factors.”
- “The design of artifacts affects the potential for erroneous actions and paths towards disaster.” This is Human Factors 101 but problems still arise. “Increased coupling increases the cognitive demands on practitioners.” Increased coupling plus weak feedback can create a latent failure.
Complex Systems Failure
This section covers traditional mental models used for assessing failures and points out the putative inadequacies of each. The sequence-of-events (or domino) model is familiar Newtonian causal analysis. Man-made disaster theory puts company culture and institutional design at the heart of the safety question. Vulnerability develops over time but is hidden by the organization’s belief that it has risk under control. A system or component is driven into failure. The latent failure (or Swiss cheese) model proposes that “disasters are characterized by a concatenation of several small failures and contributing events. . .” (p. 50) While a practitioner may be closest to an accident, the associated latent failures were created by system managers, designers, maintainers or regulators. All these models reinforce the search for human error (someone untrained, inattentive or a “bad apple) and the customary fixes (more training, procedure adherence and personal attention, or targeted discipline). They represent a failure to adopt systems thinking and concepts of dynamics, learning, adaptation and the notion that a system can produce accidents as a natural consequence of its normal functioning.
A more sophisticated set of models is then discussed. Perrow's normal accident theory says that “accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled.” (p. 61) Such systems structurally confuse operators and prevent them from recovering when incipient failure is discovered. People are part of the Perrowian system and can exhibit inadequate expertise. Control theory sees systems as composed of components that must be kept in dynamic equilibrium based on feedback and continual control inputs—basically a system dynamics view. Accidents are a result of normal system behavior and occur when components interact to violate safety constraints and the feedback (and control inputs) do not reflect the developing problems. Small changes in the system can lead to huge consequences elsewhere. Accident avoidance is based on making system performance boundaries explicit and known although the goal of efficiency will tend to push operations toward the boundaries. In contrast, the authors would argue for a different focus: making the system more resilient, i.e., error-tolerant.** High reliability theory describes how how-hazard activities can achieve safe performance through leadership, closed systems, functional decentralization, safety culture, redundancy and systematic learning. High reliability means minimal variations in performance, which in the short-term, means safe performance but HROs are subject to incidents indicative of residual system noise and unseen changes from social forces, information management or new technologies. (See Weick, reviewed here)
Standing on the shoulders of the above sophisticated models, resilience engineering (RE) is proposed as a better way to think about safety. According to this model, accidents “represent the breakdowns in the adaptations necessary to cope with the real world complexity. (p. 83) The authors use the Columbia space shuttle disaster to illustrate patterns of failure evident in complex systems: drift toward failure, past success as reason for continued confidence, fragmented problem-solving, ignoring new evidence and intra-organizational communication breakdowns. To oppose or compensate for these patterns, RE proposes monitoring or enhancing other system properties including: buffering capacity, flexibility, margin and tolerance (which means replacing quick collapse with graceful degradation). RE “focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment.” (p. 93) In practice, that means detecting signs of increasing risk, having resources for safety available, and recognizing when and where to invest to offset risk. It also requires focusing on organizational decision making, e.g., cross checks for risky decisions, the safety-production-efficiency balance and the reporting and disposition of safety concerns. “Enhancing error tolerance, detection and recovery together produce safety.” (p. 26)
Operating at the Sharp End
An organization's sharp end is where practitioners apply their expertise in an effort to achieve the organization's goals. The blunt end is where support functions, from administration to engineering, work. The blunt end designs the system, the sharp end operates it. Practitioner performance is affected by cognitive activities in three areas: activation of knowledge, the flow of attention and interactions among multiple goals.
The knowledge available to practitioners arrives as organized content. Challenges include: organization may be poor, the content may be incomplete or simply wrong. Practitioner mental models may be inaccurate or incomplete without the practitioners realizing it, i.e., they may be poorly calibrated. Knowledge may be inert, i.e., not accessed when it is needed. Oversimplifications (heuristics) may work in some situations but produce errors in others and limit the practitioner's ability to account for uncertainties or conflicts that arise in individual cases. The discussion of heuristics suggests Hollnagel, reviewed here.
“Mindset is about attention and its control.” (p. 114) Attention is a limited resource. Problems with maintaining effective attention include loss of situational awareness, in which the practitioner's mental model of events doesn't match the real world, and fixation, where the practitioner's initial assessment of a situation creates a going-forward bias against accepting discrepant data and a failure to trigger relevant inert knowledge. Mindset seems similar to HRO mindfulness. (see Weick)
Goal conflict can arise from many sources including management policies, regulatory requirements, economic (cost) factors and risk of legal liability. Decision making must consider goals (which may be implicit), values, costs and risks—which may be uncertain. Normalization of deviance is a constant threat. Decision makers may be held responsible for achieving a goal but lack the authority to do so. The conflict between cost and safety may be subtle or unrecognized. “Safety is not a concrete entity and the argument that one should always choose the safest path misrepresents the dilemmas that confront the practitioner.” (p. 139) “[I]t is difficult for many organizations (particularly in regulated industries) to admit that goal conflicts and tradeoff decisions arise.” (p. 139) Overall, the authors present a good discussion of goal conflict.
How Design Can Induce Error
The design of computerized devices intended to help practitioners can instead lead to greater risks of errors and incidents. Specific causes of problems include clumsy automation, limited information visibility and mode errors.
Automation is supposed to increase user effectiveness and efficiency. However, clumsy automation creates situations where the user loses track of what the computer is set up to do, what it's doing and what it will do next. If support systems are so flexible that users can't know all their possible configurations, they adopt simplifying strategies which may be inappropriate in some cases. Clumsy automation leads to more (instead of less) cognitive work, user attention is diverted to the machine instead of the task, increased potential for new kinds of errors and the need for new user knowledge and judgments. The machine effectively has its own model of the world, based on user inputs, data sensors and internal functioning, and passes that back to the user.
Machines often hide a mass of data behind a narrow keyhole of visibility into the system. Successful design creates “a visible conceptual space meaningfully related to activities and constraints in a field of practice.” (p. 162) In addition, “Effective representations highlight 'operationally interesting' changes for sequences of behavior . . .” (p. 167) However, default displays typically do not make interesting events directly visible.
Mode errors occurs when an operator initiates an action that would be appropriate if the machine were in mode A but, in fact, it's in mode B. (This may be a man-machine problem but it's not the machine's fault.) A machine can change modes based on situational and system factors in addition to operator input. Operators have to maintain mode awareness, not an easy task when viewing a small, cluttered display that may not highlight current mode or mode changes.
To cope with bad design “practitioners adapt information technology provided for them to the immediate tasks at hand in a locally pragmatic way, . . .” (p. 191) They use system tailoring where they adapt the device, often by focusing on a feature set they consider useful and ignoring other machine capabilities. They use task tailoring where they adapt strategies to accommodate constraints imposed by the new technology. Both types of adaptation can lead to success or eventual failures.
The authors suggest various countermeasures and design changes to address these problems.
Reactions to Failure
Different approaches for analyzing accidents lead to different perspectives on human error.
Hindsight bias is “the tendency for people to 'consistently exaggerate what could have been anticipated in foresight.'” (p. 15) It reinforces the tendency to look for the human in the human error. Operators are blamed for bad outcomes because they are available, tracking back to multiple contributing causes is difficult, most system performance is good and investigators tend to judge process quality by its outcome. Outsiders tend to think operators knew more about their situation than they actually did. Evaluating process instead of outcome is also problematic. Process and outcome are loosely coupled and what standards should be used for process evaluation? Formal work descriptions “underestimate the dilemmas, interactions between constraints, goal conflicts, and tradeoffs present in the actual workplace.” (p. 208) A suggested alternative approach is to ask what other practitioners would have done in the same situation and build a set of contrast cases. “What we should not do, . . . is rely on putatively objective external evaluations . . . such as . . . court cases or other formal hearings. Such processes in fact institutionalize and legitimate the hindsight bias . . . leading to blame and a focus on individual actors at the expense of a system view.” (pp. 213-214)
Distancing through differencing is another risk. In this practice, reviewers focus on differences between the context surrounding an accident and their own circumstance. Blaming individuals reinforces belief that there are no lessons to be learned for other organizations. If human error is local and individual (as opposed to systemic) then sanctions, exhortations to follow the procedures and remedial training are sufficient fixes. There is a decent discussion of TMI here, where, in the authors' opinion, the initial sense of fundamental surprise and need for socio-technical fixes was soon replaced by a search for local, technologically-focused solutions.
There is often pressure to hold people accountable after incidents or accidents. One answer is a “just culture” which views incidents as system learning opportunities but also draws a line between acceptable and unacceptable behavior. Since the “line” is an attribution the key question for any organization is who gets to draw it. Another challenge is defining the discretionary space where individuals alone have the authority to decide how to proceed. There is more on just culture but this is all (or mostly) Dekker. (see our Just Culture commentary here)
The authors' recommendations for analyzing errors and improving safety can be summed up as follows: recognize that human error is an attribution; pursue second stories that reveal the multiple, systemic contributors to failure; avoid hindsight bias; understand how work really gets done; search for systemic vulnerabilities; study how practice creates safety; search for underlying patterns; examine how change will produce new vulnerabilities; use technology to enhance human expertise; and tame complexity. (p. 239) “Safety is created at the sharp end as practitioners interact with hazardous processes . . . using the available tools and resources.” (p. 243)
Our Perspective
This is a book about organizational characteristics and socio-technical systems. Recommendations and advice are aimed at organizational policy makers and incident investigators. The discussion of a “just culture” is the only time culture is discussed in detail although safety culture is mentioned in passing in the HRO write-up.
Our first problem with the book is repeatedly referring to medicine, aviation, aircraft carrier operations and nuclear power plants as complex systems.*** Although medicine is definitely complex and aviation (including air traffic control) possibly is, carrier operations and nuclear power plants are simply complicated. While carrier and nuclear personnel have to make some adaptations on the fly, they do not face sudden, disruptive changes in their technologies or operating environments and they are not exposed to cutthroat competition. Their operations are tightly coordinated but, where possible, by design more loosely coupled to facilitate recovery if operations start to go sour. In addition, calling nuclear power operations complex perpetuates the myth that nuclear is “unique and special” and thus merits some special place in the pantheon of industry. It isn't and it doesn't.
Our second problem relates to the authors' recasting of the nature of human error. We decry the rush to judgment after negative events, particularly a search limited to identifying culpable humans. The search for bad apples or outright criminals satisfies society's perceived need to bring someone to justice and the corporate system's desire to appear to fix things through management exhortations and training without really admitting systemic problems or changing anything substantive, e.g., the management incentive plan. The authors' plea for more systemic analysis is thus welcome.
But they push the pendulum too far in the opposite direction. They appear to advocate replacing all human errors (except for gross negligence, willful violations or sabotage) with systemic explanations, aka rationalizations. What is never mentioned is that medical errors lead to tens of thousands of preventable deaths per year.**** In contrast, U.S. commercial aviation has not experienced over a hundred fatalities (excluding 9/11) since 1996; carriers and nuclear power plants experience accidents, but there are few fatalities. At worst, this book is a denial that real human errors (including bad decisions, slip ups, impairments, coverups) occur and a rationalization of medical mistakes caused by arrogance, incompetence, class structure and lack of accountability.
This is a dense book, 250 pages of small print, with an index that is nearly useless. Pressures (most likely cost and schedule) have apparently pushed publishing to the system boundary for copy editing—there are extra, missing and wrong words throughout the text.
This 2010 second edition updates the original 1994 monograph. Many of the original ideas have been fleshed out elsewhere by the authors (primarily Dekker) and others. Some references, e.g., Hollnagel, Perrow and the HRO school, should be read in their original form.
* D.D. Woods, S. Dekker, R. Cook, L. Johannesen and N. Sarter, Behind Human Error, 2d ed. (Ashgate, Burlington, VT: 2010). Thanks to Bill Mullins for bringing this book to our attention.
** There is considerable overlap of the perspectives of the authors and the control theorists (Leveson and Rasmussen are cited in the book). As an aside, Dekker was a dissertation advisor for one of Leveson's MIT students.
*** The authors' different backgrounds contribute to this mash-up. Cook is a physician, Dekker is a pilot and some of Woods' cited publications refer to nuclear power (and aviation).
**** M. Makary, “How to Stop Hospitals From Killing Us,” Wall Street Journal online (Sept. 21, 2012). Retrieved July 4, 2013.
Thursday, December 20, 2012
The Logic of Failure by Dietrich Dörner
This
book was mentioned in a nuclear safety discussion forum so we figured this is a good time to revisit Dörner's 1989 tome.* Below we
provide a summary of the book followed by our assessment of how it
fits into our interest in decision making and the use of simulations
in training.
Dörner's
work focuses on why people fail to make good decisions when faced
with problems and challenges. In particular, he is interested in the
psychological needs and coping mechanisms people exhibit. His
primary research method is observing test subjects interact with
simulation models of physical sub-worlds, e.g., a malfunctioning
refrigeration unit, an African tribe of subsistence farmers and
herdsmen, or a small English manufacturing city. He applies his
lessons learned to real situations, e.g, the Chernobyl nuclear plant
accident.
He
proposes a multi-step process for improving decision making in
complicated situations then describes each step in detail and the
problems people can create for themselves while executing the step.
These problems generally consist of tactics people adopt to
preserve their sense of competence and control at the expense of
successfully achieving overall objectives. Although the steps
are discussed in series, he recognizes that, at any point, one may
have to loop back through a previous step.
Goal
setting
Goals
should be concrete and specific to guide future steps. The
relationships between and among goals should be specified, including
dependencies, conflicts and relative importance. When people don't
to do this, they can become distracted by obvious or unimportant
(although potentially achievable) goals, or peripheral issues they
know how to address rather than important issues that should be
resolved. Facing performance failure, they may attempt to turn
failure into success with doublespeak or blame unseen forces.
Formulate
models and gather information
Good
decision-making requires an adequate mental model of the system being
studied—the variables that comprise the system and the functional
relationships among them, which may include positive and negative
feedback loops. The model's level of detail should be sufficient to
understand the interrelationships among the variables the decision
maker wants to influence. Unsuccessful test subjects were inclined
to use a “reductive hypothesis,” which unreasonably reduces the
model to a single key variable, or overgeneralization.
Information
gathered is almost always incomplete and the decision maker has to
decide when he has enough to proceed. The more successful test
subjects asked more questions and made fewer decisions (then the less
successful subjects) in the early time periods of the sim.
Predict
and extrapolate
Once a
model is formulated, the decision maker must attempt to determine how
the values of variables will change over time in response to his
decisions or internal system dynamics. One problem is predicting
that outputs will change in a linear fashion, even as the evidence
grows for a non-linear, e.g., exponential function. An exponential
variable may suddenly grow dramatically then equally suddenly reverse
course when the limits on growth (resources) are reached. Internal
time delays mean that the effects of a decision are not visible until
some time in the future. Faced with poor results, unsuccessful test
subjects implement or exhibit “massive countermeasures, ad hoc
hypotheses that ignore the actual data, underestimations of growth
processes, panic reactions, and ineffectual frenetic activity.” (p.
152) Successful subjects made an effort to understand the system's
dynamics, kept notes (history) on system performance and tried to
anticipate what would happen in the future.
Plan
and execute actions, check results and adjust strategy
“The
essence of planning is to think through the consequences of certain
actions and see whether those actions will bring us closer to our
desired goal.” (p. 153) Easier said than done in an environment of
too many alternative courses of action and too little time. In
rapidly evolving situations, it may be best to create rough plans and
delegate as many implementing decisions as possible to subordinates.
A major risk is thinking that planning has been so complete than the
unexpected cannot occur. A related risk is the reflexive use of
historically successful strategies. “As at Chernobyl, certain
actions carried out frequently in the past, yielding only the
positive consequences of time and effort saved and incurring no
negative consequences, acquire the status of an (automatically
applied) ritual and can contribute to catastrophe.” (p. 172)
In the
sims, unsuccessful test subjects often exhibited “ballistic”
behavior—they implemented decisions but paid no attention to, i.e,
did not learn from, the results. Successful subjects watched for the
effects of their decisions, made adjustments and learned from their
mistakes.
Dörner
identified several characteristics of people who tended to end up in
a failure situation. They failed to formulate their goals, didn't
recognize goal conflict or set priorities, and didn't correct their
errors. (p. 185) Their ignorance of interrelationships among system
variables and the longer-term repercussions of current decisions set
the stage for ultimate failure.
Assessment
Dörner's
insights and models have informed our thinking about human
decision-making behavior in demanding, complicated situations. His
use and promotion of simulation models as learning tools was one
starting point for Bob Cudlin's work in developing a nuclear
management training simulation program. Like Dörner, we see
simulation as a powerful tool to “observe and record the background
of planning, decision making, and evaluation processes that are
usually hidden.” (pp. 9-10)
However,
this book does not cover the entire scope of our interests. Dörner
is a psychologist interested in individuals, group behavior is beyond
his range. He alludes to normalization of deviance but his
references appear limited to the flaunting of safety rules rather
than a more pervasive process of slippage. More importantly, he does
not address behavior that arises from the system itself, in
particular adaptive behavior as an open system reacts to and
interacts with its environment.
From
our view, Dörner's suggestions may help the individual decision
maker avoid common pitfalls and achieve locally optimum
answers. On the downside, following Dörner's prescription might
lead the decision maker to an unjustified confidence in his overall
system management abilities. In a truly complex system, no one knows
how the entire assemblage works. It's sobering to note that even in
Dörner's closed,** relatively simple models many test subjects still
had a hard time developing a reasonable mental model, and some failed
completely.
This
book is easy to read and Dörner's insights into the psychological
traps that limit human decision making effectiveness remain useful.
* D.
Dörner, The Logic of Failure: Recognizing and Avoiding Error in
Complex Situations, trans. R. and R. Kimber (Reading, MA: Perseus
Books, 1998). Originally published in German in 1989.
**
One simulation model had an external input.
Subscribe to:
Posts (Atom)