Showing posts with label System Dynamics. Show all posts
Showing posts with label System Dynamics. Show all posts

Tuesday, November 17, 2015

Foolproof by Greg Ip: Insights for the Nuclear Industry

This book* is primarily about systemic lessons learned from the 2008 U.S. financial crisis and, to a lesser extent, various European euro crises. Some of the author’s observations also apply to the nuclear industry.

Ip’s overarching thesis is that steps intended to protect a system, e.g., a national or global financial system, may over time lead to over-confidence, increased risk-taking and eventual instability.  Stability breeds complacency.**  As we know, a well-functioning system creates a series of successful outcomes, a line of dynamic non-events.  But that dynamic includes gradual changes to the system, e.g., innovation or adaptation to the environment, that may increase systemic risk and result in a new crisis or unintended consequences

He sees examples that evidence his thesis in other fields.  For automobiles, the implementation of anti-lock braking systems leads some operators to drive more recklessly.  In football, better helmets mean increased use of the head as a weapon and more concussions and spinal injuries.  For forest fires, a century of fire suppression has led to massive fuel build-ups and more people moving into forested areas.  For flood control, building more and higher levees has led to increased economic development in historically flood-prone areas.  As a result, both fires and floods can have huge financial losses when they eventually occur.  In all cases, well-intentioned system “improvements” lead to increased confidence (aka loss of fear) and risk-taking, both obvious and implicit.  In short, “If the surroundings seem safer, the systems tolerate more risk.” (p. 18)

Ip uses the nuclear industry to illustrate how society can create larger issues elsewhere in a system when it effects local responses to a perceived problem.  Closing down nuclear plants after an accident (e.g., Fukushima) or because of green politics does not remove the demand for electric energy.  To the extent the demand shortfall is made up with hydrocarbons, additional people will suffer from doing the mining, drilling, processing, etc. and the climate will be made worse.

He cites the aviation industry as an example of a system where near-misses are documented and widely shared in an effort to improve overall system safety.  He notes that the few fatal accidents that occur in commercial aviation serve both as lessons learned and keep those responsible for operating the system (pilots and controllers) on their toes.

He also makes an observation about aviation that could be applied to the nuclear industry: “It is almost impossible to improve a system that never has an accident. . . . regulators are unlikely to know whether anything they propose now will have provable benefits; it also means that accidents will increasingly be of the truly mysterious, unimaginable variety . . .” (p. 252)

Speaking of finance, Ip says “A huge part of what the financial system does is try to create the fact—and at times the illusion—of safety.  Usually, it succeeds; . . . On those rare occasions when it fails, the result is panic.” (p. 86)  Could this description also apply to the nuclear industry? 

Our Perspective

Ip’s search for systemic, dynamic factors to explain the financial crisis echoes the type of analysis we’ve been promoting for years.  Like us, he recognizes that people hold different world views of the same system.  Ip contrasts the engineers and the ecologists:  “Engineers satisfy our desire for control, . . . civilization’s needs to act, to do something, . . .” (p. 278)  Ecologists believe “it’s the nature of risk to find the vulnerabilities we missed, to hit when least expected, to exploit the very trust in safety we so assiduously cultivate with all our protection . . .” (p. 279)

Ip’s treatment of the nuclear industry, while positive, is incomplete and somewhat simplistic.  It’s really just an example, not an industry analysis.  His argument that shutting down nuclear plants exacerbates climate harm could have come from the NEI playbook.  He ignores the impact of renewables, efficiency and conservation.

He doesn’t discuss the nuclear industry’s penchant for secrecy, but we have and believe it feeds the public’s uncertainty about the industry's safety.  As Ip notes, “People who crave certainty cannot tolerate even a slight increase in uncertainty, and so they flee not just the bad banks, the bad paper, and the bad country, but everything that resembles them, . . .” (p. 261)  If a system that is assumed [or promoted] to be safe has a crisis, even a local one, the result is often panic. (p. 62)

He mentions high reliability organizations (HROs) focusing on their avoiding catastrophe and “being a little bit scared all of the time.” (p. 242)  He does not mention that some of the same systemic factors of the financial system are at work in the world of HROs, including exposure to the corrosive effects of complacency and system drift. (p. 242)

Bottom line: Read Foolproof if you have an interest in an intelligible assessment of the financial crisis.  And remember: “Fear serves a purpose: it keeps us out of trouble.” (p. 19)  “. . . but it can keep us from taking risks that could make us better off.” (p. 159)


*  G. Ip, Foolproof (New York: Little, Brown, 2015).  Ip is a finance and economics journalist, currently with the Wall Street Journal and previously with The Economist.

**  He quotes a great quip from Larry Summers: “Complacency is a self-denying prophecy.”  Ip adds, “If everyone worried about complacency, no one would succumb to it.” (p.263)

Monday, October 12, 2015

IAEA International Conference on Operational Safety, including Safety Culture

IAEA Building
Back in June, the International Atomic Energy Agency (IAEA) hosted an International Conference on Operational Safety.*  Conference sessions covered Peer Reviews, Corporate Management, Post-Fukushima Improvements, Operating Experience, Leadership and Safety Culture and Long Term Operation.  Later, the IAEA published a summary of conference highlights, including conclusions in the session areas.**  It reported the following with respect to safety culture (SC):

“No organization works in isolation: the safety culture of the operator is influenced by the safety culture of the regulator and vice versa. Everything the regulator says or does not say has an effect on the operator. The national institutions and other cultural factors affect the regulatory framework. Corporate leadership is integral to achieving and improving safety culture, the challenge here is that regulators are not always allowed to conduct oversight at the corporate management level.”

Whoa!  This is an example of the kind of systemic thinking that we have been preaching for years.  We wondered who said that so we reviewed all the SC presentations looking for clues.  Perhaps not surprisingly, it was a bit like gold-mining: one has to crush a lot of ore to find a nugget.

Most of the ore for the quote was provided by a SC panelist who was not one of the SC speakers but a Swiss nuclear regulator (and the only regulator mentioned in the SC session program).  Her slide bullets included “The regulatory body needs to take different perspectives on SC: SC as an oversight issue, impact of oversight on licensees’ SC, the regulatory body’s own SC, [and] Self-reflection on its own SC.”  Good advice to regulators everywhere.

As far as we can tell, no presenter made the point that regulators seldom have the authority to oversee corporate management; perhaps that arose during the subsequent discussion.

SC Presentations

The SC presentations contained hearty, although standard fare.  A couple were possibly more revealing, which we’ll highlight later.

The German, Japanese and United Kingdom presentations reviewed their respective SC improvement plans.  In general these plans are focused on specific issues identified during methodical diagnostic investigations.  The plan for the German Philippsburg plant focuses on specific management responsibilities, personnel attitudes and conduct at all hierarchy levels, and communications.  The Japanese plan concentrates on continued recovery from the Fukushima disaster.  TEPCO company-wide issues include Safety awareness, Engineering capability and Communication ability.  The slides included a good system dynamics-type model.  At EDF’s Heysham 2 in the UK, the interventions are aimed at improving management (leadership, decision-making), trust (just culture) and organizational learning.  As a French operator of a UK plant, EDF recognizes they must tune interventions to the local organization’s core values and beliefs.

The United Arab Emirates presentation described a model for their new nuclear organization; the values, traits and attributes come right out of established industry SC guidelines.

The Entergy presenter parroted the NRC/INPO party line on SC definition, leadership responsibility, traits, attributes and myriad supporting activities.  It’s interesting to hear such bold talk from an SC-challenged organization.  Maybe INPO or the NRC “encouraged” him to present at the conference.  (The NRC is not shy about getting licensees with SC issues to attend the Regulatory Information Conference and confess their sins.)

The Russian presentation consisted of a laundry list of SC improvement activities focused on leadership, personnel reliability, observation and cross-cultural factors (for Hanhikivi 1 in Finland).  It was all top-down.  There was nothing about empowering or taking advantage of individuals’ knowledge or experience.  You can make your own inferences.

Management Presentations

We also reviewed the Management sessions for further clues.  All the operator presenters were European and they had similar structures, with “independent” safety performance advisory groups at the plant, fleet and corporate levels.  They all appeared to focus on programmatic strengths and weaknesses in the safety performance area.  There was no indication any of the groups opined on management performance.  The INPO presenter noted that SC is included in every plant and corporate evaluation and SC issues are highlighted in the INPO Executive Summary to a CEO.

Our Perspective

The IAEA press release writer did a good job of finding appealing highlights to emphasize.  The actual presentations were more ordinary and about what you’d expect from anything involving IAEA: build the community, try to not offend anyone.  For example, the IAEA SC presentation stressed the value in developing a common international SC language but acknowledged that different industry players and countries can have their own specific needs.

Bottom line: Read the summary and go to the conference materials if something piques your interest—but keep your expectations modest.


*  International Atomic Energy Agency, International Conference on Operational Safety, June 23-26, 2015, Vienna.

**  IAEA press release, “Nuclear Safety is a Continuum, not a Final Destination” (July 3, 2015).

Monday, May 5, 2014

WIPP - Release the Hounds

(Ed. note: This is Safetymatters’ second post on the Phase 1 WIPP report.  Bob and I independently saw the report, concluded it raised important questions about DOE and its investigative process and headed for our keyboards.  We will try to get an official response to our posts—but don’t hold your breath.) 

Earlier this week the DOE released its Accident Investigation Report on the Radiological Release Event at the Waste Isolation Pilot Plant.  The report is a prodigious effort in the just over two months since the event.  It is also a serious indictment of DOE’s management of WIPP and arguably, the DOE itself.  There is however a significant flaw in the investigation and report: the investigators were kept on too tight a leash.  Itemization of failures, particularly pervasive failures, without pursuing how and why they occurred is not sufficient.  It also highlights the essence and value of systems analysis - identifying the fundamental dynamics that produced the failures and solutions that change those dynamics.

At first blush the issuance of yet another report on safety issues and safety management performance at a DOE facility would hardly merit a rush to the keyboard to dissect the findings.  Yet we believe this report is a tipping point in the pervasive and continuing issues at DOE facilities and should be a call for much more aggressive action.  It doesn’t take long for the report to get to the point in the Executive Summary:

“The Board identified the root cause of Phase 1 of the investigation of the release of radioactive material from underground to the environment to be NWP’s and CBFO’s management failure to fully understand, characterize, and control the radiological hazard.” [emphasis added] (p. ES-6)  NWP is Nuclear Waste Partnership, the contractor with direct management responsibility for WIPP operations, and CBFO is the Carlsbad Field Office of the DOE.

To complete the picture the investigation board also found as a contributing cause, that DOE Headquarters oversight was ineffective.  So in sum, the board found a total failure of the management system responsible for radiological safety at the WIPP. 

Interestingly there has been a rather muted response to this report.  The DOE issued the report with a strikingly neutral press release quoting Matt Moury, Environmental Management Deputy Assistant Secretary, Safety, Security, and Quality Programs: “The Department believes this detailed report will lead WIPP recovery efforts as we work toward resuming disposal operations at the facility.”  And Joe Franco, DOE’s Carlsbad Field Office Manager: “We understand the importance of these findings, and the community’s sense of urgency for WIPP to become operational in the future.”*  (We note that both statements focus on resumption of operations versus correction of deficiencies.)  New Mexico’s U.S. Senators Udall and Heinrich called the findings “deeply troubling” but then simply noted that they expected DOE management to take the necessary corrective actions.**  If there is any sense of urgency we would think it might be directed at understanding how and why there was such a total management failure at the WIPP.

To fully appreciate the range and depth of failures associated with this event one really needs to read the board’s report.  Provided below is a brief summary of some of the highlights that illustrate the identified issues:

-    Implementation of the NWP Conduct of Operations Program is not fully compliant with DOE policy;
-    NWP does not have an effective Radiation Protection Program in accordance with 10 Code of Federal Regulations (CFR) 835, Occupational Radiation Protection;
-    NWP does not have an effective maintenance program;
-    NWP does not have an effective Nuclear Safety Program in accordance with 10 CFR 830 Subpart B, Safety Basis Requirements;
-    NWP implementation of DOE O 151.1C, Comprehensive Emergency Management System, was ineffective;
-    The current site safety culture does not fully embrace and implement the principles of DOE Guide (G) 450.4-1C, Integrated Safety Management Guide [note: findings consistent with findings of the 2012 SCWE self assessment results]; and DOE oversight of NWP was ineffective;
-    Execution of CBFO oversight in accordance with DOE O 226.1B was ineffective; and
-    As previously mentioned, DOE Headquarters (HQ) line management oversight was ineffective. (pp. ES 7-8)

Many of the specific deficiencies cited in the report are not point in time occurrences but stem from chronic and ongoing weaknesses in programs, personnel, facilities and resources. 

Losing the Scent

As mentioned in the opening paragraph we feel that while the report is of significant value it contains a shortcoming that will likely limit its effectiveness in correcting the identified issues.  In so many words the report fails to ask “Why?”  The report is a massive catalogue of failures yet never fully pursues the ultimate and most relevant question: Why did the failures occur?  One almost wonders how the investigators could stop short of systematic and probing interviews of key decision makers.

For example in the maintenance area, “The Board determined that the NWP maintenance and engineering programs have not been effective…”; “Additionally, configuration management was not being maintained or adequately justified when changes were made.”; “There is an acceptance to tolerate or otherwise justify (e.g., lack of funding) out-of-service equipment.” (p. 82)  And that’s where the analysis stops. 

Unfortunately (but predictably) what follows from the constrained analysis are equally unfocused corrective actions based on the following linear construct: “this is a problem - fix the problem”.  Even the corrective action vocabulary becomes numbingly sterile: “needs to take action to ensure…”, “needs to improve…”, “need to develop a performance improvement plan…”,  “needs to take a more proactive role…”.

We do not want to be overly critical as the current report reflects a little over two months of effort and may not have afforded sufficient time to pull the string on so many issues.  But it is time to realize that these types of efforts are not sufficient to understand, and therefore ultimately correct, the issues at WIPP and DOE and institutionalize an effective safety management system.


*  DOE press release, “DOE Issues WIPP Radiological Release Investigation Report” (April 24, 2014)  Retrieved May 5, 2014.

**  Senators Udall and Heinrich press release, “Udall, Heinrich Statement on Department of Energy WIPP Radiological Release Investigation Report” (April 24, 2014).  Retrieved May 5, 2014.

Monday, November 11, 2013

Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson

In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory.  The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics.  It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.

Part I
     
Part I sets the stage for a new safety paradigm.  Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6)   Traditional accident analysis techniques are no longer sufficient.  They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10)  Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).**   We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.

Event-based models are simply inadequate.  There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it.  The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24)  Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19)  In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)

Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors.  “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination.  When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent.  This dependence may be related to common systemic factors that do not appear in an event chain.  Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35)  “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36) 

The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38)  In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46) 

Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51)  Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52)  This is reminiscent of  Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.

When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56)  In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64)  Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60) 

Part II

Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes).  For our purposes we don't need to spend much space on this material.  “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”***   It attempts to achieve the goals listed at the end of Part I.

STAMP treats safety in a system as a control problem, not a reliability one.  Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76)  Controls may be physical or social, including cultural.  There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87)  “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)

Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system.  There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.

Part III

Part III describes in detail how STAMP can be applied.  There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section.  Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of  STAMP over traditional accident analysis techniques. 

We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.

Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421)  Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk.  The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40)  One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26)  Such an approach should be based on the organization's comprehensive hazards analysis.****

The safety culture discussion is also familiar. (pp. 426-33)  Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust.  In such a culture, Leveson says an early warning system for migration toward states of high risk can be established.  A section on Just Culture is taken directly from Dekker's work.  The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.

Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.

Our Perspective
 

Overall, we like this book.  It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial.  The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog.  The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
 

However, there are only a few direct references to nuclear.  The examples in the book are mostly from aerospace, aviation, maritime activities and the military.  Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory),  shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs.  Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428)  She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7)  In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective.  TMI is used to illustrate specific system design/operation problems.
 

We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists.  First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12)  But I'm not aware of any HRO work that has been done in an organization that is patently unsafe.  HRO asserts that reliability follows from practices that recognize and contain emerging problems.  She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers.  Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44)  Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could.  Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.”  In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410)  I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals.  Overall, this appears to be academic posturing and feather fluffing.
 

We offer no opinion on the efficacy of using Leveson's STAMP approach.  She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
 

In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii)  You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.


*  N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011)  The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed.  All quotes in this post were retyped from the original text.


**  We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier.  These authors are all aware of others' publications and contributions.  Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text. 

***  Nancy Leveson informal bio page.


****  “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157)  The hazards analysis identifies all major hazards the system may confront.  Baseline safety requirements follow from the hazards analysis.  Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk.  The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)

Friday, October 18, 2013

When Apples Decay

In our experience education is perceived as a continual process, accumulating knowledge progressively over time.  A shiny apple exemplifies the learning student or an inspiring insight (see Newton, Sir Isaac). Less consideration is given to the fact that the educational process can work in reverse leading to a loss of capability over time.  In other words, the apple decays.  As Martin Weller notes on his blog The Ed Techie, “education is about selling apples...we need to recognise and facilitate learning that takes ten minutes or involves extended participation in a community over a number of years.”*

This leads us to a recent Wall Street Journal piece, “Americans Need a Simple Retirement System”.**  The article is about the failure of educational efforts to improve financial literacy.  We admit that this is a bit out of context for nuclear safety culture; nevertheless it provides a useful perspective that seems to be overlooked in within the nuclear industry.  The article notes:

“The problem is that, like all educational efforts, financial education decays over time and has negligible effects on behavior after 20 months. The authors suggest that, given this decay, “just in time” financial education...might be a more effective way to proceed.”

We tend to view the safety culture training provided at nuclear plants to be of the 10 minute variety, selling apples that may vary in size and color but are just apples.   Additional training is routinely prescribed in response to findings of inadequate safety culture.  Yet we cannot recall a single reported instance where safety culture issues were associated with inadequate or ineffective training in the first place.  Nor do we see explicit recognition that such training efforts have very limited half lives, creating cycles of future problems.  We have blogged about the decay of training based reinforcement (see our March 22, 2010 post) and the contribution of decay and training saturation to complacency (see our Dec. 8, 2011 post).

The fact that safety culture knowledge and “strength” decays over time is just one example of the dynamics associated with safety management.  Arguably one could assert that an effective learning process itself is a (the?) key to building and maintaining strong safety culture.  And further that it is consistently missing in current nuclear industry programs that emphasize indoctrination in traits and values.  It’s time for better and more innovative approaches - not just more apples.



*  M. Weller, "The long-awaited 'education as fruit' metaphor," The Ed Techie blog (Sept. 10, 2009).  Retrieved Oct. 18, 2013.

**  A.H. Munnell, "Americans need a simple retirement system," MarketWatch blog (Oct. 16, 2013).  Retrieved Oct. 18, 2013.

Friday, September 27, 2013

Four Years of Safetymatters

Aztec Calendar
Over the four plus years we have been publishing this blog, regular readers will have noticed some recurring themes in our posts.  The purpose of this post is to summarize our perspective on these key themes.  We have attempted to build a body of work that is useful and insightful for you.

Systems View

We have consistently considered safety culture (SC) in the nuclear industry to be one component of a complicated socio-technical system.  A systems view provides a powerful mental model for analyzing and understanding organizational behavior. 

Our design and explicative efforts began with system dynamics as described by authors such as Peter Senge, focusing on characteristics such as feedback loops and time delays that can affect system behavior and lead to unexpected, non-linear changes in system performance.  Later, we expanded our discussion to incorporate the ways systems adapt and evolve over time in response to internal and external pressures.  Because they evolve, socio-technical organizations are learning organizations but continuous improvement is not guaranteed; in fact, evolution in response to pressure can lead to poorer performance.

The systems view, system dynamics and their application through computer simulation techniques are incorporated in the NuclearSafetySim management training tool.

Decision Making

A critical, defining activity of any organization is decision making.  Decision making determines what will (or will not) be done, by whom, and with what priority and resources.  Decision making is  directed and constrained by factors including laws, regulations, policies, goals, procedures and resource availability.  In addition, decision making is imbued with and reflective of the organization's values, mental models and aspirations, i.e., its culture, including safety culture.

Decision making is intimately related to an organization's financial compensation and incentive program.  We've commented on these programs in nuclear and non-nuclear organizations and identified the performance goals for which executives received the largest rewards; often, these were not safety goals.

Decision making is part of the behavior exhibited by senior managers.  We expect leaders to model desired behavior and are disappointed when they don't.  We have provided examples of good and bad decisions and leader behavior. 

Safety Culture Assessment


We have cited NRC Commissioner Apostolakis' observation that “we really care about what people do and maybe not why they do it . . .”  We sympathize with that view.  If organizations are making correct decisions and getting acceptable performance, the “why” is not immediately important.  However, in the longer run, trying to identify the why is essential, both to preserve organizational effectiveness and to provide a management (and mental) model that can be transported elsewhere in a fleet or industry.

What is not useful, and possibly even a disservice, is a feckless organizational SC “analysis” that focuses on a laundry list of attributes or limits remedial actions to retraining, closer oversight and selective punishment.  Such approaches ignore systemic factors and cannot provide long-term successful solutions.

We have always been skeptical of the value of SC surveys.  Over time, we saw that others shared our view.  Currently, broad-scope, in-depth interviews and focus groups are recognized as preferred ways to attempt to gauge an organization's SC and we generally support such approaches.

On a related topic, we were skeptical of the NRC's SC initiatives, which culminated in the SC Policy Statement.  As we have seen, this “policy” has led to back door de facto regulation of SC.

References and Examples

We've identified a library of references related to SC.  We review the work of leading organizational thinkers, social scientists and management writers, attempt to accurately summarize their work and add value by relating it to our views on SC.  We've reported on the contributions of Dekker, Dörner, Hollnagel, Kahneman, Perin, Perrow, Reason, Schein, Taleb, Vaughan, Weick and others.

We've also posted on the travails of organizations that dug themselves into holes that brought their SC into question.  Some of these were relatively small potatoes, e.g., Vermont Yankee and EdF, but others were actual disasters, e.g., Massey Energy and BP.  We've also covered DOE, especially the Hanford Waste Treatment and Immobilization Plant (aka the Vit plant).

Conclusion

We believe the nuclear industry is generally well-managed by well-intentioned personnel but can be affected by the natural organizational ailments of complacency, normalization of deviation, drift, hubris, incompetence and occasional criminality.  Our perspective has evolved as we have learned more about organizations in general and SC in particular.  Channeling John Maynard Keynes, we adapt our models when we become aware of new facts or better ways of looking at the data.  We hope you continue to follow Safetymatters.  

Thursday, August 29, 2013

Normal Accidents by Charles Perrow

This book*, originally published in 1984, is a regular reference for authors writing about complex socio-technical systems.**  Perrow's model for classifying such systems is intuitively appealing; it appears to reflect the reality of complexity without forcing the reader to digest a deliberately abstruse academic construct.  We will briefly describe the model then spend most of our space discussing our problems with Perrow's inferences and assertions, focusing on nuclear power.  

The Model

The model is a 2x2 matrix with axes of coupling and interactions.  Not surprisingly, it is called the Interaction/Coupling (IC) chart.

“Coupling” refers to the amount of slack, buffer or give between two items in a system.  Loosely coupled systems can accommodate shocks, failures and pressures without destabilizing.  Tightly coupled systems have a higher risk of disastrous failure because their processes are more time-dependent, with invariant sequences and a single way of achieving the production goal, and have little slack. (pp. 89-94)

“Interactions” may be linear or complex.  Linear interactions are between a system component and one or more other components that immediately precede or follow it in the production sequence.  These interactions are familiar and, if something unplanned occurs, the results are easily visible.  Complex interactions are between a system component and one or more other components outside the normal production sequence.  If unfamiliar, unplanned or unexpected sequences occur, the results may not be visible or immediately comprehensible. (pp. 77-78)

Nuclear plants have the tightest coupling and most complex interactions of the two dozen systems Perrow shows on the I/C chart, a population that included chemical plants, space missions and nuclear weapons accidents. (p. 97)

Perrow on Nuclear Power

Let's get one thing out of the way immediately: Normal Accidents is an anti-nuke screed.  Perrow started the book in 1979 and it was published in 1984.  He was motivated to write the book by the TMI accident and it obviously colored his forecast for the industry.  He reviews the TMI accident in detail, then describes nuclear industry characteristics and incidents at other plants, all of which paint an unfavorable portrait of the industry.  He concludes: “We have not had more serious accidents of the scope of Three Mile Island simply because we have not given them enough time to appear.” (p. 60, emphasis added)  While he is concerned with design, construction and operating problems, his primary fear is “the potential for unexpected interactions of small failures in that system that makes it prone to the system accident.” (p. 61)   

Why has his prediction of such serious accidents not come to pass, at least in the U.S.?

Our Perspective on Normal Accidents

We have several issues with this book and the author's “analysis.”

Nuclear is not as complex as Perrow asserts 


There is no question that the U.S. nuclear industry grew quickly, with upsized plants and utilities specifying custom design combinations (in other words, limited standardization).  The utilities were focused on meeting significant load growth forecasts and saw nuclear baseload capacity as an efficient way to produce electric power.  However, actually operating a large nuclear plant was probably more complex than the utilities realized.  But not any more.  Learning curve effects, more detailed procedures and improved analytic methods are a few of the factors that led to a greater knowledge basis for plant decision making.  The serious operational issues at the “problem plants” (circa 1997) forced operators to confront the reality that identifying and permanently resolving plant problems was necessary for survival.  This era also saw the beginning of industry consolidation, with major operators applying best methods throughout their fleets.  All of these changes have led to our view that nuclear plants are certainly complicated but no longer complex and haven't been for some time.    

This is a good place to point out that Perrow's designation of nuclear plants as the most complex and tightest coupled systems he evaluated has no basis in any real science.  In his own words, “The placement of systems [on the interaction/coupling chart] is based entirely on subjective judgments on my part; at present there is no reliable way to measure these two variables, interaction and coupling.” (p. 96)

System failures with incomprehensible consequences are not the primary problem in the nuclear industry

The 1986 Chernobyl disaster was arguably a system failure: poor plant design, personnel non-compliance with rules and a deficient safety culture.  It was a serious accident but not a catastrophe.*** 

But other significant industry events have not arisen from interactions deep within the system; they have come from negligence, hubris, incompetence or selective ignorance.  For example, Fukushima was overwhelmed by a tsunami that was known to be possible but was ignored by the owners.  At Davis-Besse, personnel ignored increasingly stronger signals of a nascent problem but managers argued that in-depth investigation could wait until the next outage (production trumps safety) and the NRC agreed (with no solid justification).  

Important system dynamics are ignored 


Perrow has some recognition of what a system is and how threats can arise within it: “. . . it is the way the parts fit together, interact, that is important.  The dangerous accidents lie in the system, not in the components.” (p. 351)  However, he is/was focused on interactions and couplings as they currently exist.  But a socio-technical system is constantly changing (evolving, learning) in response to internal and external stimuli.  Internal stimuli include management decisions and the reactions to performance feedback signals; external stimuli include environmental demands, constraints, threats and opportunities.  Complacency and normalization of deviance can seep in but systems can also bolster their defenses and become more robust and resilient.****  It would be a stretch to say that nuclear power has always learned from its mistakes (especially if they occur at someone else's plant) but steps have been taken to make operations less complex. 

My own bias is Perrow doesn't really appreciate the technical side of a socio-technical system.  He recounts incidents in great detail, but not at great depth and is often recounting the work of others.  Although he claims the book is about technology (the socio side, aka culture, is never mentioned), the fact remains that he is not an engineer or physicist; he is a sociologist.

Conclusion

Notwithstanding all my carping, this is a significant book.  It is highly readable.  Perrow's discussion of accidents, incidents and issues in various contexts, including petrochemical plants, air transport, marine shipping and space exploration, is fascinating reading.  His interaction/coupling chart is a useful mental model to help grasp relative system complexity although one must be careful about over-inferring from such a simple representation.

There are some useful suggestions, e.g., establishing an anonymous reporting system, similar to the one used in the air transport industry, for nuclear near-misses. (p. 169)  There is a good discussion of decentralization vs centralization in nuclear plant organizations. (pp. 334-5)  But he says that neither is best all the time, which he considers a contradiction.  The possibility of contingency management, i.e., using a decentralized approach for normal times and tightening up during challenging conditions, is regarded as infeasible.

Ultimately, he includes nuclear power with “systems that are hopeless and should be abandoned because the inevitable risks outweigh any reasonable benefits . . .” (p. 304)*****  As further support for this conclusion, he reviews three different ways of evaluating the world: absolute, bounded and social rationality.  Absolute rationality is the province of experts; bounded rationality recognizes resource and cognitive limitations in the search for solutions.  But Perrow favors social rationality (which we might unkindly call crowdsourced opinions) because it is the most democratic and, not coincidentally, he can cite a study that shows an industry's “dread risk” is highly correlated with its position on the I/C chart. (p. 326)  In other words, if lots of people are fearful of nuclear power, no matter how unreasonable those fears are, that is further evidence to shut it down.

The 1999 edition of Normal Accidents has an Afterword that updates the original version.  Perrow continues to condemn nuclear power but without much new data.  Much of his disapprobation is directed at the petrochemical industry.  He highlights writers who have advanced his ideas and also presents his (dis)agreements with high reliability theory and Vaughn's interpretation of the Challenger accident.

You don't need this book in your library but you do need to be aware that it is a foundation stone for the work of many other authors.

 

*  C. Perrow, Normal Accidents: Living with High-Risk Technologies (Princeton Univ. Press, Princeton, NJ: 1999).

**  For example, see Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness Trade-Off (reviewed here); Woods, Dekker et al, Behind Human Error (reviewed here); and Weick and Sutcliffe, Managing the Unexpected: Resilient Performance in an Age of Uncertainty (reviewed here).  It's ironic that Perrow set out to write a readable book without references to the “sacred texts” (p. 11) but it appears Normal Accidents has become one.

***  Perrow's criteria for catastrophe appear to be: “kill many people, irradiate others, and poison some acres of land.” (p. 348)  While any death is a tragedy, reputable Chernobyl studies report fewer than 100 deaths from radiation and project 4,000 radiation-induced cancers in a population of 600,000 people who were exposed.  The same population is expected to suffer 100,000 cancer deaths from all other causes.  Approximately 40,000 square miles of land was significantly contaminated.  Data from Chernobyl Forum, "Chernobyl's Legacy: Health, Environmental and Socio-Economic Impacts" 2nd rev. ed.  Retrieved Aug. 27, 2013.  Wikipedia, “Chernobyl disaster.”  Retrieved Aug. 27, 2013.

In his 1999 Afterword to Normal Accidents, Perrow mentions Chernobyl in passing and his comments suggest he does not consider it a catastrophe but could have been had the wind blown the radioactive materials over the city of Kiev.

****  A truly complex system can drift into failure (Dekker) or experience incidents from performance excursions outside the safety boundaries (Hollnagel).

*****  It's not just nuclear power, Perrow also supports unilateral nuclear disarmament. (p. 347)

Tuesday, July 30, 2013

Introducing NuclearSafetySim

We have referred to NuclearSafetySim and the use of simulation tools on a regular basis in this blog.  NuclearSafetySim is our initiative to develop a new approach to safety management training for nuclear professionals.  It utilizes a simulator to provide a realistic nuclear operations environment within which players are challenged by emergent issues - where they must make decisions balancing safety implications and other priorities - over a five year period.  Each player earns an overall score and is provided with analyses and data on his/her decision making and performance against goals.  It is clearly a different approach to safety culture training, one that attempts to operationalize the values and traits espoused by various industry bodies.  In that regard it is exactly what nuclear professionals must do on a day to day basis. 

At this time we are making NuclearSafetySim available to our readers through a web-based demo version.  To get started you need to access the NuclearSafetySim website.  Click on the Introduction tab at the top of the Home page.  Here you will find a link to a narrated slide show that provides important background on the approach used in the simulation.  It runs about 15 minutes.  Then click on the Simulation tab.  Here you will find another video which is a demo of NuclearSafetySim.  While this runs about 45 minutes (apologies) it does provide a comprehensive tutorial on the sim and how to interact with it.  We urge you to view it.  Finally...at the bottom of the Simulation page is a link to the NuclearSafetySim tool.  Clicking on the link brings you directly to the Home screen and you’re ready to play.

As you will see on the website and in the sim itself, there are reminders and links to facilitate providing feedback on NuclearSafetySim and/or requesting additional information.  This is important to us and we hope our readers will take the time to provide thoughtful input, including constructive criticism.  We welcome all comments. 

Wednesday, July 24, 2013

Leadership, Culture and Organizational Performance

As discussed in our July 18, 2013 post, INPO's position is that creating and maintaining a healthy safety culture (SC) is a primary leadership responsibility.*  That seems like a common sense belief but is it based on any social science?  What is the connection between leader behavior and culture?  And what is the connection between culture and organizational performance? 

To help us address these questions, we turn to a paper** by some Stanford and UC Berkeley academics.  They review the relevant literature and present their own research and findings.  This paper is not a great fit with nuclear power operations but some of the authors' observations and findings are useful.  One might think there would be ample materials on this important topic but “only a very few studies have actually explored the interrelationships among leadership, culture and performance.” (p. 33)

Leaders and Culture


Leaders can be described by different personality types.  Note this does not focus on specific behavior, e.g., how they make decisions, but the attributes of each personality type certainly imply the kinds of behavior that can reasonably be expected.  The authors contend “. . . the myriad of potential personality and value constructs can be reliably captured by five essential personality constructs, the so-called Big Five or the Five Factor Model . . .” (p. 6)  You have all been exposed to the Big 5, or a similar, taxonomy.  An individual may exhibit attributes from more than one type but can be ultimately be classified as primarily representative of one specific type.  The five types are listed below, with a few selected attributes for each.
  • Agreeableness (Cooperative, Compromising, Compassionate, Trusting)
  • Conscientiousness (Orderly, Reliable, Achievement oriented, Self-disciplined, Deliberate, Cautious)
  • Extraversion (Gregarious, Assertive, Energy, Optimistic)
  • Neuroticism (Negative affect, Anxious, Impulsive, Hostile, Insecure)
  • Openness to Experience (Insightful, Challenge convention, Autonomous, Resourceful)

Leaders can affect culture and later we'll see that some personality types are associated with specific types of organizational culture.  “While not definitive, the evidence suggests that personality as manifested in values and behavior is associated with leadership at the CEO level and that these leader attributes may affect the culture of the organization, although the specific form of these relationships is not clear.” (p. 10)  “. . . senior leaders, because of their salience, responsibility, authority and presumed status, have a disproportionate impact on culture, . . .” (p. 11)

Culture and Organizational Performance

Let's begin with a conclusion: “One of the most important yet least understood questions is how organizational culture relates to organizational performance” (p. 11)

To support their research model, the authors describe a framework, similar to the Big 5 for personality, for summarizing organizational cultures.  The Organizational Culture Profile (OCP) features seven types of culture, listed below with a few selected attributes for each. 

  • Adaptability (Willing to experiment, Taking initiative, Risk taking, Innovative)
  • Collaborative (Team-oriented, Cooperative, Supportive, Low levels of conflict)
  • Customer-oriented (Listening to customers, Being market driven)
  • Detail-oriented (Being precise, Emphasizing quality, Being analytical)
  • Integrity (High ethical standards, Being honest)
  • Results-Oriented (High expectations for performance, Achievement oriented, Not easy going)
  • Transparency (Putting the organization’s goals before the unit, Sharing information freely)
The linkage between culture and performance is fuzzy.  “While the strong intuition was that organizational culture should be directly linked to firm effectiveness, the empirical results are equivocal.” (p. 14)  “[T]he association of culture and performance is not straightforward and likely to be contingent on the firm’s strategy, the degree to which the culture promotes adaptability, and how widely shared and strongly felt the culture is.” (p. 17)  “Further compounding the issue is that the relationship between culture and firm performance has been shown to vary across industries.” (p. 11)  Finally, “although the [OCP] has the advantage of identifying a comprehensive set of cultural dimensions, there is no guarantee that any particular dimension will be relevant for a particular firm.” (p. 18)  I think it's fair to summarize the culture-performance literature by saying “It all depends.” 

Research Results

The authors gathered and analyzed data on a group of high-technology firms: CEO personalities based on the Big 5 types, cultural descriptions using the OCP, and performance data.  Firm performance was based on financial metrics, firm reputation (an intangible asset) and employee attitudes.*** (p. 23-24) 

“[T]he results reveal a number of significant relationships between CEO personality and firm culture, . . . CEOs who were more extraverted (gregarious, assertive, active) had cultures that were more results-oriented. . . . CEOs who were more conscientious (orderly, disciplined, achievement-oriented) had cultures that were more detail-oriented . . . CEOs who were higher on openness to experience (ready to challenge convention, imaginative, willing to try new activities) [were] more likely to have cultures that emphasized adaptability. (p. 26)

“Cultures that were rated as more adaptable, results-oriented and detail-oriented were seen more positively by their employees. Firms that emphasized adaptability and were more detail-oriented were also more admired by industry observers.” (p. 28)

In sum, the linkage between leadership and performance is far from clear.  But “consistent patterns of [CEO] behavior shape interpretations of what’s important [values] and how to behave. . . . Other research has shown that a CEO’s personality may affect choices of strategy and structure.” (p. 31)

Relevance to Nuclear Operations


As mentioned in the introduction, this paper is not a great fit with the nuclear industry.  The authors' research focuses on high technologically companies, there is nothing SC-specific and their financial performance metrics (more important to firms in highly competitive industries) are more robust than their non-financial measures.  Safety performance is not mentioned.

But their framework stimulates us to ask important questions.  For example, based on the research results, what type of CNO would you select for a plant with safety performance problems?  How about one facing significant economic challenges?  Or one where things are running smoothly?  Based on the OCP, what types of culture would be most supportive of a strong SC?  Would any types be inconsistent with a strong SC?  How would you categorize your organization's culture?  

The authors suggest that “Senior leaders may want to consider developing the behaviors that cultivate the most useful culture for their firm, even if these behaviors do not come naturally to them.” (p. 35)  Is that desirable or practical for your CNO?

The biggest challenge to obtaining generalizable results, which the authors recognize, is that so many driving factors are situation-specific, i.e., dependent on a firm's industry, competitive position and relative performance.  They also recognize a possible weakness in linear causality, i.e., the leadership → culture → performance logic may not be one-way.  In our systems view, we'd say there are likely feedback loops, two-way influence flows and additional relevant variables in the overall model of the organization.

The linear (Newtonian) viewpoint promoted by INPO suggests that culture is mostly (solely?) created by senior executives.  If only it were that easy.  Such a view “runs counter to the idea that culture is a social construct created by many individuals and their behavioral patterns.” (p. 10)  We believe culture, including SC, is an emergent organizational property created by the integration of top-down activities with organizational history, long-serving employees, and strongly held beliefs and values, including the organization's “real” priorities.  In other words, SC is a result of the functioning over time of the socio-technical system.  In our view, a CNO can heavily influence, but not unilaterally define, organizational culture including SC.



*  As another example of INPO's position, a recent presentation by an INPO staffer ends with an Ed Schein quote: “...the only thing of real importance that leaders do is to create and manage culture...”  The quote is from Schein's Organizational Culture and Leadership (San Francisco, CA: Jossey-Bass, 1985), p. 2.  The presentation was A. Daniels, “How to Continuously Improve Cultural Traits for the Management of Safety,” IAEA International Experts’ Meeting on Human and Organizational Factors in Nuclear Safety in the Light of the Accident at the Fukushima Daiichi Nuclear Power Plant, Vienna May 21-24, 2013.
 

**  C. O’Reilly, D. Caldwell, J. Chatman and B. Doerr, “The Promise and Problems of Organizational Culture: CEO Personality, Culture, and Firm Performance”  Working paper (2012).  Retrieved July 22, 2013.  To enhance readability, in-line citations have been removed from quotes.

***  The authors report “Several studies show that culture is associated with employee attitudes . . . ” (p. 14)

Saturday, July 6, 2013

Behind Human Error by Woods, Dekker, Cook, Johannesen and Sarter

This book* examines how errors occur in complex socio-technical systems.  The authors' thesis is that behind every ascribed “human error” there is a “second story” of the context (conditions, demands, constraints, etc.) created by the system itself.  “That which we label “human error” after the fact is never the cause of an accident.  Rather, it is the cumulative effect of multiple cognitive, collaborative, and organizational factors.” (p. 35)  In other words, “Error is a symptom indicating the need to investigate the larger operational systems and the organizational context in which it functions.” (p. 28)  This post presents a summary of the book followed by our perspective on its value.  (The book has a lot of content so this will not be a short post.)

The Second Story

This section establishes the authors' view of error and how socio-technical systems function.  They describe two mutually exclusive world views: (1) “erratic people degrade an otherwise safe system” vs. (2) “people create safety at all levels of the socio-technical system by learning and adapting . . .” (p. 6)  It should be obvious that the authors favor option 2.

In such a world “Failure, then, represents breakdowns in adaptations directed at coping with complexity.  Indeed, the enemy of safety is not the human: it is complexity.” (p. 1)  “. . . accidents emerge from the coupling and interdependence of modern systems.” (p. 31) 

Adaptation occurs in response to pressures or environmental changes.  For example, systems are under stakeholder pressure to become faster, better, cheaper; multiple goals and goal conflict are regular complex system characteristics.  But adaptation is not always successful.  There may be too little (rules and procedures are followed even though conditions have changed) or too much (adaptation is attempted with insufficient information to achieve goals).  Because of pressure, adaptations evolve toward performance boundaries, in particular, safety boundaries.  There is a drift toward failure. (see Dekker, reviewed here)

The authors present 15 premises for analyzing errors in complex socio-technical systems. (pp. 19-30)  Most are familiar but some are worth highlighting and remembering when thinking about system errors:

  • “There is a loose coupling between process and outcome.”  A “bad” process does not always produce bad outcomes and a “good” process does not always produce good outcomes.
  • “Knowledge of outcome (hindsight) biases judgments about process.”  More about that later.
  • “Lawful factors govern the types of erroneous actions or assessments to be expected.”   In other words, “errors are regular and predictable consequences of a variety of factors.”
  • “The design of artifacts affects the potential for erroneous actions and paths towards disaster.”  This is Human Factors 101 but problems still arise.  “Increased coupling increases the cognitive demands on practitioners.”  Increased coupling plus weak feedback can create a latent failure.

Complex Systems Failure


This section covers traditional mental models used for assessing failures and points out the putative inadequacies of each.  The sequence-of-events (or domino) model is familiar Newtonian causal analysis.  Man-made disaster theory puts company culture and institutional design at the heart of the safety question.  Vulnerability develops over time but is hidden by the organization’s belief that it has risk under control.  A system or component is driven into failure.  The latent failure (or Swiss cheese) model proposes that “disasters are characterized by a concatenation of several small failures and contributing events. . .” (p. 50)  While a practitioner may be closest to an accident, the associated latent failures were created by system managers, designers, maintainers or regulators.  All these models reinforce the search for human error (someone untrained, inattentive or a “bad apple) and the customary fixes (more training, procedure adherence and personal attention, or targeted discipline).  They represent a failure to adopt systems thinking and concepts of dynamics, learning, adaptation and the notion that a system can produce accidents as a natural consequence of its normal functioning.

A more sophisticated set of models is then discussed.  Perrow's normal accident theory says that “accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled.” (p. 61)  Such systems structurally confuse operators and prevent them from recovering when incipient failure is discovered.  People are part of the Perrowian system and can exhibit inadequate expertise.  Control theory sees systems as composed of components that must be kept in dynamic equilibrium based on feedback and continual control inputs—basically a system dynamics view.  Accidents are a result of normal system behavior and occur when components interact to violate safety constraints and the feedback (and control inputs) do not reflect the developing problems.  Small changes in the system can lead to huge consequences elsewhere.  Accident avoidance is based on making system performance boundaries explicit and known although the goal of efficiency will tend to push operations toward the boundaries.  In contrast, the authors would argue for a different focus: making the system more resilient, i.e., error-tolerant.**  High reliability theory describes how how-hazard activities can achieve safe performance through leadership, closed systems, functional decentralization, safety culture, redundancy and systematic learning.  High reliability means minimal variations in performance, which in the short-term, means safe performance but HROs are subject to incidents indicative of residual system noise and unseen changes from social forces, information management or new technologies. (See Weick, reviewed here)

Standing on the shoulders of the above sophisticated models, resilience engineering (RE) is proposed as a better way to think about safety.  According to this model, accidents “represent the breakdowns in the adaptations necessary to cope with the real world complexity. (p. 83)  The authors use the Columbia space shuttle disaster to illustrate patterns of failure evident in complex systems: drift toward failure, past success as reason for continued confidence, fragmented problem-solving, ignoring new evidence and intra-organizational communication breakdowns.  To oppose or compensate for these patterns, RE proposes monitoring or enhancing other system properties including: buffering capacity, flexibility, margin and tolerance (which means replacing quick collapse with graceful degradation).  RE “focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment.” (p. 93)  In practice, that means detecting signs of increasing risk, having resources for safety available, and recognizing when and where to invest to offset risk.  It also requires focusing on organizational decision making, e.g., cross checks for risky decisions, the safety-production-efficiency balance and the reporting and disposition of safety concerns.  “Enhancing error tolerance, detection and recovery together produce safety.” (p. 26)

Operating at the Sharp End

An organization's sharp end is where practitioners apply their expertise in an effort to achieve the organization's goals.  The blunt end is where support functions, from administration to engineering, work.  The blunt end designs the system, the sharp end operates it.  Practitioner performance is affected by cognitive activities in three areas: activation of knowledge, the flow of attention and interactions among multiple goals.

The knowledge available to practitioners arrives as organized content.  Challenges include: organization may be poor, the content may be incomplete or simply wrong.  Practitioner mental models may be inaccurate or incomplete without the practitioners realizing it, i.e., they may be poorly calibrated.  Knowledge may be inert, i.e., not accessed when it is needed.  Oversimplifications (heuristics) may work in some situations but produce errors in others and limit the practitioner's ability to account for uncertainties or conflicts that arise in individual cases.  The discussion of heuristics suggests Hollnagel, reviewed here.

Mindset is about attention and its control.” (p. 114)  Attention is a limited resource.  Problems with maintaining effective attention include loss of situational awareness, in which the practitioner's mental model of events doesn't match the real world, and fixation, where the practitioner's initial assessment of  a situation creates a going-forward bias against accepting discrepant data and a failure to trigger relevant inert knowledge.  Mindset seems similar to HRO mindfulness. (see Weick)

Goal conflict can arise from many sources including management policies, regulatory requirements, economic (cost) factors and risk of legal liability.  Decision making must consider goals (which may be implicit), values, costs and risks—which may be uncertain.  Normalization of deviance is a constant threat.  Decision makers may be held responsible for achieving a goal but lack the authority to do so.  The conflict between cost and safety may be subtle or unrecognized.  “Safety is not a concrete entity and the argument that one should always choose the safest path misrepresents the dilemmas that confront the practitioner.” (p. 139)  “[I]t is difficult for many organizations (particularly in regulated industries) to admit that goal conflicts and tradeoff decisions arise.” (p. 139)  Overall, the authors present a good discussion of goal conflict.

How Design Can Induce Error


The design of computerized devices intended to help practitioners can instead lead to greater risks of errors and incidents.  Specific causes of problems include clumsy automation, limited information visibility and mode errors. 

Automation is supposed to increase user effectiveness and efficiency.  However, clumsy automation creates situations where the user loses track of what the computer is set up to do, what it's doing and what it will do next.  If support systems are so flexible that users can't know all their possible configurations, they adopt simplifying strategies which may be inappropriate in some cases.  Clumsy automation leads to more (instead of less) cognitive work, user attention is diverted to the machine instead of the task, increased potential for new kinds of errors and the need for new user knowledge and judgments.  The machine effectively has its own model of the world, based on user inputs, data sensors and internal functioning, and passes that back to the user.

Machines often hide a mass of data behind a narrow keyhole of visibility into the system.  Successful design creates “a visible conceptual space meaningfully related to activities and constraints in a field of practice.” (p. 162)  In addition, “Effective representations highlight  'operationally interesting' changes for sequences of behavior . . .” (p. 167)  However, default displays typically do not make interesting events directly visible.

Mode errors occurs when an operator initiates an action that would be appropriate if the machine were in mode A but, in fact, it's in mode B.  (This may be a man-machine problem but it's not the machine's fault.)  A machine can change modes based on situational and system factors in addition to operator input.  Operators have to maintain mode awareness, not an easy task when viewing a small, cluttered display that may not highlight current mode or mode changes.

To cope with bad design “practitioners adapt information technology provided for them to the immediate tasks at hand in a locally pragmatic way, . . .” (p. 191)  They use system tailoring where they adapt the device, often by focusing on a feature set they consider useful and ignoring other machine capabilities.  They use task tailoring where they adapt strategies to accommodate constraints imposed by the new technology.  Both types of adaptation can lead to success or eventual failures. 

The authors suggest various countermeasures and design changes to address these problems. 

Reactions to Failure

Different approaches for analyzing accidents lead to different perspectives on human error. 

Hindsight bias is “the tendency for people to 'consistently exaggerate what could have been anticipated in foresight.'” (p. 15)  It reinforces the tendency to look for the human in the human error.  Operators are blamed for bad outcomes because they are available, tracking back to multiple contributing causes is difficult, most system performance is good and investigators tend to judge process quality by its outcome.  Outsiders tend to think operators knew more about their situation than they actually did.  Evaluating process instead of outcome is also problematic.  Process and outcome are loosely coupled and what standards should be used for process evaluation?  Formal work descriptions “underestimate the dilemmas, interactions between constraints, goal conflicts, and tradeoffs present in the actual workplace.” (p. 208)  A suggested alternative approach is to ask what other practitioners would have done in the same situation and build a set of contrast cases.  “What we should not do, . . . is rely on putatively objective external evaluations . . . such as . . . court cases or other formal hearings.  Such processes in fact institutionalize and legitimate the hindsight bias . . . leading to blame and a focus on individual actors at the expense of a system view.” (pp. 213-214)

Distancing through differencing is another risk.  In this practice, reviewers focus on differences between the context surrounding an accident and their own circumstance.  Blaming individuals reinforces belief that there are no lessons to be learned for other organizations.  If human error is local and individual (as opposed to systemic) then sanctions, exhortations to follow the procedures and remedial training are sufficient fixes.  There is a decent discussion of TMI here, where, in the authors' opinion, the initial sense of fundamental surprise and need for socio-technical fixes was soon replaced by a search for local, technologically-focused solutions.
      
There is often pressure to hold people accountable after incidents or accidents.  One answer is a “just culture” which views incidents as system learning opportunities but also draws a line between acceptable and unacceptable behavior.  Since the “line” is an attribution the key question for any organization is who gets to draw it.  Another challenge is defining the discretionary space where individuals alone have the authority to decide how to proceed.  There is more on just culture but this is all (or mostly) Dekker. (see our Just Culture commentary here)

The authors' recommendations for analyzing errors and improving safety can be summed up as follows: recognize that human error is an attribution; pursue second stories that reveal the multiple, systemic contributors to failure; avoid hindsight bias; understand how work really gets done; search for systemic vulnerabilities; study how practice creates safety; search for underlying patterns; examine how change will produce new vulnerabilities; use technology to enhance human expertise; and tame complexity. (p. 239)  “Safety is created at the sharp end as practitioners interact with hazardous processes . . . using the available tools and resources.” (p. 243)

Our Perspective

This is a book about organizational characteristics and socio-technical systems.  Recommendations and advice are aimed at organizational policy makers and incident investigators.  The discussion of a “just culture” is the only time culture is discussed in detail although safety culture is mentioned in passing in the HRO write-up.

Our first problem with the book is repeatedly referring to medicine, aviation, aircraft carrier operations and nuclear power plants as complex systems.***  Although medicine is definitely complex and aviation (including air traffic control) possibly is, carrier operations and nuclear power plants are simply complicated.  While carrier and nuclear personnel have to make some adaptations on the fly, they do not face sudden, disruptive changes in their technologies or operating environments and they are not exposed to cutthroat competition.  Their operations are tightly coordinated but, where possible, by design more loosely coupled to facilitate recovery if operations start to go sour.  In addition, calling nuclear power operations complex perpetuates the myth that nuclear is “unique and special” and thus merits some special place in the pantheon of industry.  It isn't and it doesn't.

Our second problem relates to the authors' recasting of the nature of human error.  We decry the rush to judgment after negative events, particularly a search limited to identifying culpable humans.  The search for bad apples or outright criminals satisfies society's perceived need to bring someone to justice and the corporate system's desire to appear to fix things through management exhortations and training without really admitting systemic problems or changing anything substantive, e.g., the management incentive plan.  The authors' plea for more systemic analysis is thus welcome.

But they push the pendulum too far in the opposite direction.  They appear to advocate replacing all human errors (except for gross negligence, willful violations or sabotage) with systemic explanations, aka rationalizations.  What is never mentioned is that medical errors lead to tens of thousands of preventable deaths per year.****  In contrast, U.S. commercial aviation has not experienced over a hundred fatalities (excluding 9/11) since 1996; carriers and nuclear power plants experience accidents, but there are few fatalities.  At worst, this book is a denial that real human errors (including bad decisions, slip ups, impairments, coverups) occur and a rationalization of medical mistakes caused by arrogance, incompetence, class structure and lack of accountability.

This is a dense book, 250 pages of small print, with an index that is nearly useless.  Pressures (most likely cost and schedule) have apparently pushed publishing to the system boundary for copy editing—there are extra, missing and wrong words throughout the text.

This 2010 second edition updates the original 1994 monograph.  Many of the original ideas have been fleshed out elsewhere by the authors (primarily Dekker) and others.  Some references, e.g., Hollnagel, Perrow and the HRO school, should be read in their original form. 


*  D.D. Woods, S. Dekker, R. Cook, L. Johannesen and N. Sarter, Behind Human Error, 2d ed.  (Ashgate, Burlington, VT: 2010).  Thanks to Bill Mullins for bringing this book to our attention.

**  There is considerable overlap of the perspectives of the authors and the control theorists (Leveson and Rasmussen are cited in the book).  As an aside, Dekker was a dissertation advisor for one of Leveson's MIT students.

***  The authors' different backgrounds contribute to this mash-up.  Cook is a physician, Dekker is a pilot and some of Woods' cited publications refer to nuclear power (and aviation).

****  M. Makary, “How to Stop Hospitals From Killing Us,” Wall Street Journal online (Sept. 21, 2012).  Retrieved July 4, 2013.