Showing posts with label James Reason. Show all posts
Showing posts with label James Reason. Show all posts

Monday, October 16, 2017

Nuclear Safety Culture: A Suggestion for Integrating “Just Culture” Concepts

All of you have heard of “Just Culture” (JC).  At heart, it is an attitude toward investigating and explaining errors that occur in organizations in terms of “why” an error occurred, including systemic reasons, rather than focusing on identifying someone to blame.  How might JC be applied in practice?  A paper* by Shem Malmquist describes how JC concepts could be used in the early phases of an investigation to mitigate cognitive bias on the part of the investigators.

The author asserts that “cognitive bias has a high probability of occurring, and becoming integrated into the investigators subconscious during the early stages of an accident investigation.” 

He recommends that, from the get-go, investigators categorize all pertinent actions that preceded the error as an error (unintentional act), at-risk behavior (intentional but for a good reason) or reckless (conscious disregard of a substantial risk or intentional rule violation). (p. 5)  For errors or at-risk actions, the investigator should analyze the system, e.g., policies, procedures, training or equipment, for deficiencies; for reckless behavior, the investigator should determine what system components, if any, broke down and allowed the behavior to occur. (p. 12).  Individuals should still be held responsible for deliberate actions that resulted in negative consequences.

Adding this step to a traditional event chain model will enrich the investigation and help keep investigators from going down the rabbit hole of following chains suggested by their own initial biases.

Because JC is added to traditional investigation techniques, Malmquist believes it might be more readily accepted than other approaches for conducting more systemic investigations, e.g., Leveson’s System Theoretic Accident Model and Processes (STAMP).  Such approaches are complex, require lots of data and implementing them can be daunting for even experienced investigators.  In our opinion, these models usually necessitate hiring model experts who may be the only ones who can interpret the ultimate findings—sort of like an ancient priest reading the entrails of a sacrificial animal.  Snide comment aside, we admire Leveson’s work and reviewed it in our Nov. 11, 2013 post.

Our Perspective

This paper is not some great new insight into accident investigation but it does describe an incremental step that could make traditional investigation methods more expansive in outlook and robust in their findings.

The paper also provides a simple introduction to the works of authors who cover JC or decision-making biases.  The former category includes Reason and Dekker and the latter one Kahneman, all of whom we have reviewed here at Safetymatters.  For Reason, see our Nov. 3, 2014 post; for Dekker, see our Aug. 3, 2009 and Dec. 5, 2012 posts; for Kahneman, see our Nov. 4, 2011 and Dec. 18, 2013 posts.

Bottom line: The parts describing and justifying the author’s proposed approach are worth reading.  You are already familiar with much of the contextual material he includes.  


*  S. Malmquist, “Just Culture Accident Model – JCAM” (June 2017).

Friday, January 27, 2017

Leadership, Decisions, Systems Thinking and Nuclear Safety Culture

AcciMap Excerpt
We recently read a paper* that echoes some of the themes we emphasize on Safetymatters, viz., leadership, decisions and a systems view.  Following is an excerpt from the abstract:

Leadership is progressively being recognized as a key** factor in supporting successful performance across a range of domains. . . . the decisions and actions that characterize safety leadership thus become important emergent properties in the prevention of incidents, which should be considered within the context of the broader organizational system and not merely constrained to understanding events or conditions that shape performance at the ‘sharp end’.”  [emphasis added]

The authors go on to analyze decisions and actions after a mining incident (landslide) using a combination of three different schemes: Rasmussen’s Risk Management Framework (RMF) and corresponding AcciMap, and the Critical Decision Method (CDM).

The RMF describes work systems as comprised of various levels and argues that safety performance is affected by decisions and actions at all levels from politicians in the external environment down through company executives and managers and finally to individual workers.  Rasmussen’s AcciMap is an expansive causal diagram for an accident or incident that displays the contributions (or omissions) at each level in the RMF and their connections.

CDM uses semi-structured interviews to obtain information about how individuals formulate their decisions, including context such as background knowledge and immediate influencing factors.  Consistent with the RMF, case study interviews were conducted with individuals at different organizational levels.  CDM data were used to construct the AcciMap.

We won’t go into the details of the analysis but it identified over a dozen key decisions made at different organizational levels before and during the incident; most were connected to at least one other key decision.  The AcciMap illustrates decisions and communications across multiple levels and thus provides a useful picture of how an organization anticipates and responds to an unusual situation.

Our Perspective

The authors argue, and we agree, that this type of analysis provides greater detail and insight into the performance of an organization’s safety management system than traditional accident investigations (especially those focused on finding someone to blame).

This article does not specifically discuss culture.  But the body of decisions an organization produces is the strongest evidence and most visible artifact of its culture.  Organizational decisions are far more important than responses to surveys or interviews where people can report what they believe (or hope) the culture is, or what they think their audience wants to hear.

We like that RMF and AcciMap are agnostic: they can be used to analyze either “what went wrong” or “what went right” scenarios.  (The case study was in the latter category because no one was hurt in the incident.)  If an assessor is looking at a sample of decisions to infer a nuclear organization’s culture, most of those decisions will have had positive (or at least no negative) consequences.

The authors are Australian academics but this short (8 pages total) paper is quite readable and a good introduction to CDM and Rasmussen’s constructs.  The references include people whose work we have positively reviewed on Safetymatters, including Dekker, Hollnagel, Leveson and Reason.

Bottom line: There is nothing about culture or nuclear here, but the overall message reinforces our beliefs about how to think about Nuclear Safety Culture.


*  S-L Donovana, P.M. Salmonb and M.G. LennĂ©a, “The leading edge: A systems thinking methodology for assessing safety leadership,” Procedia Manufacturing 3 (2015), pp. 6644–6651.  Available at sciencedirect.com; retrieved Jan. 19, 2017.

**  Note they do not say “one and only” or even “most important.”

Tuesday, May 26, 2015

Safety Culture “State of the Art” in 2002 per NUREG-1756

Here’s a trip down memory lane.  Back in 2002 a report* on the “state of the art” in safety culture (SC) thinking, research and regulation was prepared for the NRC Advisory Committee on Reactor Safeguards.  This post looks at some of the major observations of the 2002 report and compares them with what we believe is important today.

The report’s Abstract provides a clear summary of the report’s perspective:  “There is a widespread belief that safety culture is an important contributor to the safety of operations. . . . The commonly accepted attributes of safety culture include good organizational communication, good organizational learning, and senior management commitment to safety. . . . The role of regulatory bodies in fostering strong safety cultures remains unclear, and additional work is required to define the essential attributes of safety culture and to identify reliable performance indicators.” (p. iii) 

General Observations on Safety Performance 


A couple of quotes included in the report reflect views on how safety performance is managed or influenced.

 “"The traditional approach to safety . . . has been retrospective, built on precedents. Because it is necessary, it is easy to think it is sufficient.  It involves, first, a search for the primary (or "root") cause of a specific accident, a decision on whether the cause was an unsafe act or an unsafe condition, and finally the supposed prevention of a recurrence by devising a regulation if an unsafe act,** or a technical solution if an unsafe condition." . . . [This approach] has serious shortcomings.  Specifically, ". . . resources are diverted to prevent the accident that has happened rather than the one most likely to happen."” (p. 24)

“"There has been little direct research on the organizational factors that make for a good safety culture. However, there is an extensive literature if we make the indirect assumption that a relatively low accident plant must have a relatively good safety culture." The proponents of safety culture as a determinant of operational safety in the nuclear power industry rely, at least to some degree, on that indirect assumption.” (p. 37) 

Plenty of people today behave in accordance with the first observation and believe (or act as if they believe) the second one.  Both contribute to the nuclear industry’s unwillingness to consider new ways of thinking about how safe performance actually occurs.

Decision Making, Goal Conflict and the Reward System

Decision making processes, recognition of goal conflicts and an organization’s reward system are important aspects of SC and the report addressed them to varying degrees.

One author referenced had a contemporary view of decision making, noting that “in complex and ill-structured risk situations, decisionmakers are faced not only with the matter of risk, but also with fundamental uncertainty characterized by incompleteness of knowledge.” (p. 43)  That’s true in great tragedies like Fukushima and lesser unfortunate outcomes like the San Onofre steam generators.

Goal conflict was mentioned: “Managers should take opportunities to show that they will put safety concerns ahead of power production if circumstances warrant.” (p.7)

Rewards should promote good safety practices (p. 6) and be provided for identifying safety issues. (p. 37)  However, there is no mention of the executive compensation system.  As we have argued ad nauseam these systems often pay more for production than for safety.

The Role of the Regulator


“The regulatory dilemma is that the elements that are important to safety culture are difficult, if not impossible, to separate from the management of the organization.  [However,] historically, the NRC has been reluctant to regulate management functions in any direct way.” (pp. 37-38)  “Rather, the NRC " . . . infers licensee organization management performance based on a comprehensive review of inspection findings, licensee amendments, event reports, enforcement history, and performance indicators."” (p. 41)  From this starting point, we now have the current situation where the NRC has promulgated its SC Policy Statement and practices de facto SC regulation using the highly reliable “bring me another rock” method.

The Importance of Context when Errors Occur 


There are hints of modern thinking in the report.  It contains an extended summary of Reason’s work in Human Error.  The role of latent conditions, human error as consequence instead of cause, the obvious interaction between producers and production, and the “non-event” of safe operations are all mentioned. (p. 15)  However, a “just culture” or other more nuanced views of the context in which safety performance occurs had yet to be developed.

One author cited described “the paradox that culture can act simultaneously as a precondition for safe operations and an incubator for hazards.” (p. 43)  We see that in Reason and also in Hollnagel and Dekker: people going about business as usual with usually successful results but, on some occasions, with unfortunate outcomes.

Our Perspective

The report’s author provided a good logic model for getting from SC attributes to identifying useful risk metrics, i.e., from SC to one or more probabilistic risk assessment (PRA) parameters.  (pp. 18-20)  But none of the research reviewed completed all the steps in the model. (p. 36)  He concludes “What is not clear is the mechanism by which attitudes, or safety culture, affect the safety of operations.” (p. 43)  We are still talking about that mechanism today.   

But some things have changed.  For example, probabilistic thinking has achieved greater penetration and is no longer the sole province of the PRA types.  It’s accepted that Black Swans can occur (but not at our plant).

Bottom line: Every student of SC should take a look at this.  It includes a good survey of 20th century SC-related research in the nuclear industry and it’s part of our basic history.

“Those who cannot remember the past are condemned to repeat it.” — George Santayana (1863-1952)


*  J.N. Sorensen, “Safety Culture: A Survey of the State-of-the-Art,” NUREG-1756 (Jan. 2002).  ADAMS ML020520006.  (Disclosure: I worked alongside the author on a major nuclear power plant litigation project in the 1980s.  He was thoughtful and thorough, qualities that are apparent in this report.)

**  We would add “or reinforcing an existing regulation through stronger procedures, training or oversight.”

Monday, November 3, 2014

A Life In Error by James Reason



Most of us associate psychologist James Reason with the “Swiss Cheese Model” of defense in depth or possibly the notion of a “just culture.”  But his career has been more than two ideas, he has literally spent his professional life studying errors, their causes and contexts.  A Life In Error* is an academic memoir, recounting his study of errors starting with the individual and ending up with the organization (the “system”) including its safety culture (SC).  This post summarizes relevant portions of the book and provides our perspective.  It is going to read like a sub-titled movie on fast-forward but there are a lot of particulars packed in this short (124 pgs.) book. 

Slips and Mistakes 

People make plans and take action, consequences follow.  Errors occur when the intended goals are not achieved.  The plan may be adequate but the execution faulty because of slips (absent-mindedness) or trips (clumsy actions).  A plan that was inadequate to begin with is a mistake which is usually more subtle than a slip, and may go undetected for long periods of time if no obviously bad consequences occur. (pp. 10-12)  A mistake is a creation of higher-level mental activity than a slip.  Both slips and mistakes can take “strong but wrong” forms, where schema** that were effective in prior situations are selected even though they are not appropriate in the current situation.

Absent-minded slips can occur from misapplied competence where a planned routine is sidetracked into an unplanned one.  Such diversions can occur, for instance, when one’s attention is unexpectedly diverted just as one reaches a decision point and multiple schema are both available and actively vying to be chosen. (pp. 21-25)  Reason’s recipe for absent-minded errors is one part cognitive under-specification, e.g., insufficient knowledge, and one part the existence of an inappropriate response primed by prior, recent use and the situational conditions. (p. 49) 

Planning Biases 

The planning activity is subject to multiple biases.  An individual planner’s database may be incomplete or shaped by past experiences rather than future uncertainties, with greater emphasis on past successes than failures.  Planners can underestimate the influence of chance, overweight data that is emotionally charged, be overly influenced by their theories, misinterpret sample data or miss covariations, suffer hindsight bias or be overconfident.***  Once a plan is prepared, planners may focus only on confirmatory data and are usually resistant to changing the plan.  Planning in a group is subject to “groupthink” problems including overconfidence, rationalization, self-censorship and an illusion of unanimity.  (pp. 56-62)

Errors and Violations 

Violations are deliberate acts to break rules or procedures, although bad outcomes are not generally intended.  Violations arise from various motivational factors including the organizational culture.  Types of violations include corner-cutting to avoid clumsy procedures, necessary violations to get the job done because the procedures are unworkable, adjustments to satisfy conflicting goals and one-off actions (such as turning off a safety system) when faced with exceptional circumstances.  Violators perform a type of cost:benefit analysis biased by the fact that benefits are likely immediate while costs, if they occur, are usually somewhere in the future.  In Reason’s view, the proper course for the organization is to increase the perceived benefits of compliance not increase the costs (penalties) for violations.  (There is a hint of the “just culture” here.) 

Organizational Accidents 

Major accidents (TMI, Chernobyl, Challenger) have three common characteristics: contributing factors that were latent in the system, multiple levels of defense, and an unforeseen combination of latent factors and active failures (errors and/or violations) that defeated the defenses.  This is the well-known Swiss Cheese Model with the active failures opening short-lived holes and latent failures creating longer-lasting but unperceived holes.

Organizational accidents are low frequency, high severity events with causes that may date back years.  In contrast, individual accidents are more frequent but have limited consequences; they arise from slips, trips and lapses.  This is why organizations can have a good industrial accident record while they are on the road to a large-scale disaster, e.g., BP at Texas City. 

Organizational Culture 

Certain industries, including nuclear power, have defenses-in-depth distributed throughout the system but are vulnerable to something that is equally widespread.  According to Reason, “The most likely candidate is safety culture.  It can affect all elements in a system for good or ill.” (p. 81)  An inadequate SC can undermine the Swiss Cheese Model: there will be more active failures at the “sharp end”; more latent conditions created and sustained by management actions and policies, e.g., poor maintenance, inadequate equipment or downgrading training; and the organization will be reluctant to deal proactively with known problems. (pp. 82-83)

Reason describes a “cluster of organizational pathologies” that make an adverse system event more likely: blaming sharp-end operators, denying the existence of systemic inadequacies, and a narrow pursuit of production and financial goals.  He goes on to list some of the drivers of blame and denial.  The list includes: accepting human error as the root cause of an event; the hindsight bias; evaluating prior decisions based on their outcomes; shooting the messenger; belief in a bad apple but not a bad barrel (the system); failure to learn; a climate of silence; workarounds that compensate for systemic inadequacies’ and normalization of deviance.  (pp. 86-92)  Whew! 

Our Perspective 

Reason teaches us that the essence of understanding errors is nuance.  At one end of the spectrum, some errors are totally under the purview of the individual, at the other end they reside in the realm of the system.  The biases and issues described by Reason are familiar to Safetymatters readers and echo in the work of Dekker, Hollnagel, Kahneman and others.  We have been pounding the drum for a long time on the inadequacies of safety analyses that ignore systemic issues and corrective actions that are limited to individuals (e.g., more training and oversight, improved procedures and clearer expectations).

The book is densely packed with the work of a career.  One could easily use the contents to develop a SC assessment or self-assessment.  We did not report on the chapters covering research into absent-mindedness, Freud and medical errors (Reason’s current interest) but they are certainly worth reading.

Reason says this book is his valedictory: “I have nothing new to say and I’m well past my prime.” (p. 122)  We hope not.


*  J. Reason, A Life In Error: From Little Slips to Big Disasters (Burlington, VT: Ashgate, 2013).

**  Knowledge structures in long-term memory. (p. 24)

***  This will ring familiar to readers of Daniel Kahneman.  See our Dec. 18, 2013 post on Kahneman’s Thinking, Fast and Slow.

Monday, October 13, 2014

Systems Thinking in Air Traffic Management


A recent white paper* presents ten principles to consider when thinking about a complex socio-technical system, specifically European Air Traffic Management (ATM).  We review the principles below, highlighting aspects that might provide some insights for nuclear power plant operations and safety culture (SC).

Before we start, we should note that ATM is truly a complex** system.  Decisions involving safety and efficiency occur on a continuous basis.  There is always some difference between work-as-imagined and work-as-done.

In contrast, we have argued that a nuclear plant is a complicated system but it has some elements of complexity.  To the extent complexity exists, treating nuclear like a complicated machine via “analysing components using reductionist methods; identifying ‘root causes’ of problems or events; thinking in a linear and short-term way; . . . [or] making changes at the component level” is inadequate. (p. 5)  In other words, systemic factors may contribute to observed performance variability and frustrate efforts to achieve the goal in nuclear of eliminating all differences between work-as-planned and work-as-done.

Principles 1-3 relate to the view of people within systems – our view from the outside and their view from the inside.

1. Field Expert Involvement
“To understand work-as-done and improve how things really work, involve those who do the work.” (p. 8)
2. Local Rationality
“People do things that make sense to them given their goals, understanding of the situation and focus of attention at that time.” (p. 10)
3. Just Culture
“Adopt a mindset of openness, trust and fairness. Understand actions in context, and adopt systems language that is non-judgmental and non-blaming.” (p. 12)

Nuclear is pretty good at getting line personnel involved.  Adages such as “Operations owns the plant” are useful to the extent they are true.  Cross-functional teams can include operators or maintenance personnel.  An effective CAP that allows workers to identify and report problems with equipment, procedures, etc. is good; an evaluation and resolution process that involves members from the same class of workers is even better.  Having someone involved in an incident or near-miss go around to the tailgates and classes to share the lessons learned can be convincing.

But when something unexpected or bad happens, nuclear tends to spend too much time looking for the malfunctioning component (usually human).   “The assumption is that if the person would try harder, pay closer attention, do exactly what was prescribed, then things would go well. . . . [But a] focus on components becomes less effective with increasing system complexity and interactivity.” (p. 4)  An outside-in approach ignores the context in which the human performed, the information and time available, the competition for focus of attention, the physical conditions of the work, fatigue, etc.  Instead of insight into system nuances, the result is often limited to more training, supervision or discipline.

The notion of a “just culture” comes from James Reason.  It’s a culture where employees are not punished for their actions, omissions or decisions that are commensurate with their experience and training, but where gross negligence, willful violations and destructive acts are not tolerated.

Principles 4 and 5 relate to the system conditions and context that affect work.

4. Demand and Pressure
“Demands and pressures relating to efficiency and capacity have a fundamental effect on performance.” (p. 14)
5. Resources & Constraints

“Success depends on adequate resources and appropriate constraints.” (p. 16)

Fluctuating demand creates far more varied and unpredictable problems for ATM than it does in nuclear.  However, in nuclear the potential for goal conflicts between production, cost and safety is always present.  The problem arises from acting as if these conflicts don’t exist.

ATM has to “cope with variable demand and variable resources,” a situation that is also different from nuclear with its base load plants and established resource budgets.  The authors opine that for ATM, “a rigid regulatory environment destroys the capacity to adapt constantly to the environment.” (p. 2) Most of us think of nuclear as quite constrained by procedures, rules, policies, regulations, etc., but an important lesson from Fukushima was that under unforeseen conditions, the organization must be able to adapt according to local, knowledge-based decisions  Even the NRC recognizes that “flexibility may be necessary when responding to off-normal conditions.”***

Principles 6 through 10 concern the nature of system behavior, with 9 and 10 more concerned with system outcomes.  These do not have specific implications for SC other than keeping an open mind and being alert to systemic issues, e.g., complacency, drift or emergent behavior.

6. Interactions and Flows
“Understand system performance in the context of the flows of activities and functions, as well as the interactions that comprise these flows.” (p. 18)
7. Trade-Offs
“People have to apply trade-offs in order to resolve goal conflicts and to cope with the complexity of the system and the uncertainty of the environment.” (p. 20)
8. Performance variability
“Understand the variability of system conditions and behaviour.  Identify wanted and unwanted variability in light of the system’s need and tolerance for variability.” (p. 22)
9. Emergence
“System behaviour in complex systems is often emergent; it cannot be reduced to the behaviour of components and is often not as expected.” (p. 24)
10. Equivalence
“Success and failure come from the same source – ordinary work.” (p. 26)

Work flow certainly varies in ATM but is relatively well-understood in nuclear.  There’s really not much more to say on that topic.

Trade-offs occur in decision making in any context where more than one goal exists.  One useful mental model for conceptualizing trade-offs is Hollnagel’s efficiency-thoroughness construct, basically doing things quickly (to meet the production and cost goals) vs. doing things well (to meet the quality and possibly safety goals).  We reviewed his work on Jan. 3, 2013.

Performance variability occurs in all systems, including nuclear, but the outcomes are usually successful because a system has a certain range of tolerance and a certain capacity for resilience.  Performance drift happens slowly, and can be difficult to identify from the inside.  Dekker’s work speaks to this and we reviewed it on Dec. 5, 2012.

Nuclear is not fully complex but surprises do happen, some of them not caused by component failure.  Emergence (problems that arise from new or unforeseen system interactions) is more likely to occur following the implementation of new technical systems.  We discussed this possibility in a July 6, 2013 post on a book by Woods, Dekker et al.

Equivalence means that work that results in both good and bad outcomes starts out the same way, with people (saboteurs excepted) trying to be successful.  When bad things happen, we should cast a wide net in looking for different factors, including systemic ones, that aligned (like Swiss cheese slices) in the subject case.

The white paper also includes several real and hypothetical case studies illustrating the application of the principles to understanding safety performance challenges 

Our Perspective 

The authors draw on a familiar cast of characters, including Dekker, Hollnagel, Leveson and Reason.  We have posted about all these folks, just click on their label in the right hand column.

The principles are intended to help us form a more insightful mental model of a system under consideration, one that includes non-linear cause and effect relationships, and the possibility of emergent behavior.  The white paper is not a “must read” but may stimulate useful thinking about the nature of the nuclear operating organization.


*  European Organisation for the Safety of Air Navigation(EUROCONTROL), “Systems Thinking for Safety: Ten Principles” (Aug. 2014).  Thanks to Bill Mullins for bringing this white paper to our attention.

**  “[C]omplex systems involve large numbers of interacting elements and are typically highly dynamic and constantly changing with changes in conditions. Their cause-effect relations are non-linear; small changes can produce disproportionately large effects. Effects usually have multiple causes, though causes may not be traceable and are socially constructed.” (pp. 4-5)

Also see our Oct. 14, 2013 discussion of the California Independent System Operator for another example of a complex system.

***  “Work Processes,” NRC Safety Culture Trait Talk, no. 2 (July 2014), p. 1.  ADAMS ML14203A391.  Retrieved Oct. 8, 2014

Wednesday, September 10, 2014

A Safety Culture Guide for Regulators

This paper* was referenced in a safety culture (SC) presentation we recently reviewed.  It was prepared for Canadian offshore oil industry regulators.  Although not nuclear oriented, it’s a good introduction to SC basics, the different methods for evaluating SC and possible approaches to regulating SC.  We’ll summarize the paper then provide our perspective on it.  The authors probably did not invent anything other than the analysis discussed below but they used a decent set of references and picked appropriate points to highlight.

Introduction to SC and its Importance

 
The paper provides some background on SC, its origins and definition, then covers the Schein three-tier model of culture and the difference between SC and safety climate.  The last topic is covered concisely and clearly: “. . . safety climate is an outward manifestation of culture. Therefore, safety culture includes safety climate, but safety culture uniquely includes shared values about risk and safety.” (p. 11)  SC attributes (from the Canadian Nuclear Safety Commission) are described.  Under attributes, the authors stress one of our basic beliefs, viz., “The importance of safety is made clear by the decisions managers make and how they allocate resources.” (p. 12)  The authors also summarize the characteristics of High Reliability Organizations, Low Accident Organizations, and James Reason’s model of SC and symptoms of poor SC.

The chapter on SC as a causal factor in accidents contains an interesting original analysis.  The authors reviewed reports on 17 offshore or petroleum related accidents (ranging from helicopter crashes to oil rig explosions) and determined for each accident which of four negative SC factors (Normalization of deviance, Tolerance of inadequate systems and resources, Complacency, Work pressure) were present.  The number of negative SC factors per accident ranged from 0 (three instances) to 4 (also three instances, including two familiar to Safetymatters readers: BP Texas City and Deepwater Horizon).  The negative factor that appeared in the most accidents was Tolerance of inadequate systems and resources (10) and the least was Work pressure (4).

Assessing SC

 
The authors describe different SC assessment methods (questionnaires, interviews, focus groups, observations and document analysis) and cover the strengths and weaknesses of each method.  The authors note that no single method provides a comprehensive SC assessment and they recommend a multi-method approach.  This is familiar ground for Safetymatters readers; for other related posts, click on the “Assessment” label in the right hand column.

A couple of highlights stand out.  Under observations the authors urge caution:  “The fact that people are being observed is likely to influence their behaviour [the well-known Hawthorne Effect] so the results need to be treated with caution. The concrete nature of observations can result in too much weight being placed on the results of the observation versus other methods.“ (p. 37)  A strength of document analysis is it can evidence how (and how well) the organization identifies and corrects its problems, another key artifact in our view.

Influencing SC

This chapter covers leadership and the regulator’s role.  The section on leadership is well-trod ground so we won’t dwell on it.  It is a major (but in our opinion not the only) internal factor that can influence the evolution of SC.  The statement that “Leaders also shape the safety culture through the allocation of resources” (p. 42) is worth repeating.

The section on regulatory influence is more informative and describes three methods: the regulator’s practices, promotion of SC, and enforcement of SC regulations.  Practices refer to the ways the regulator goes about its inspection and enforcement activities with licensees.  For example, the regulator can promote organizational learning by requiring licensees to have effective incident investigation systems and monitoring how effectively such systems are used in practice. (p. 44)  In the U.S. the NRC constantly reinforces SC’s importance and, through its SC Policy Statement, the expectation that licensees will strive for a strong SC.

Promoting SC can occur through research, education and direct provision of SC-related services.  Regulators in other countries conduct their own surveys of industry personnel to appraise safety climate or they assess an organization’s SC and report their findings to the regulated entity.**  (pp. 45-46)  The NRC both supports and cooperates with industry groups on SC research and sponsors the Regulatory Information Conference (which has a SC module).

Regulation of SC means just what it says.  The authors point out that direct regulation in the offshore industry is controversial. (p. 47)  Such controversy notwithstanding, Norway has developed  regulations requiring offshore companies to promote a positive SC.  Norway’s experience has shown that SC regulations may be misinterpreted or result in unintended consequences. (pp. 48-50)  In the nuclear space, regulation of SC is a popular topic outside the U.S.; the IAEA even has a document describing how to go about it, which we reviewed on May 15, 2013.  More formal regulatory oversight of SC is being developed in Romania and Belgium.  We reported on the former on April 21, 2014 and the latter on June 23, 2014.

Our Perspective

 
This paper is written by academics but intended for a more general audience; it is easy reading.  The authors score points with us when they say: “Importantly, safety culture moves the focus beyond what happened to offer a potential explanation of why it happened.” (p. 7)  Important factors such as management decision making and work backlogs are mentioned.  The importance of an effective CAP is hinted at.

The paper does have some holes.  Most importantly, it limits the discussion on influencing SC to leadership and regulatory behavior.  There are many other factors that can affect an organization’s SC including existing management systems; the corporate owner’s culture, goals, priorities and policies; market factors or economic regulators; and political pressure.  The organization’s reward system is referred to multiple times but the focus appears to be on lower-level personnel; the management compensation scheme is not mentioned.

Bottom line: This paper is a good introduction to SC attributes, assessments and regulation.


*  M. Fleming and N. Scott, “A Regulator’s Guide to Safety Culture and Leadership” (no date).

**  No regulations exist in these cases; the regulator assesses SC and then uses its influence and persuasion to affect regulated entity behavior.

Monday, November 11, 2013

Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson

In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory.  The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics.  It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.

Part I
     
Part I sets the stage for a new safety paradigm.  Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6)   Traditional accident analysis techniques are no longer sufficient.  They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10)  Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).**   We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.

Event-based models are simply inadequate.  There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it.  The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24)  Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19)  In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)

Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors.  “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination.  When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent.  This dependence may be related to common systemic factors that do not appear in an event chain.  Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35)  “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36) 

The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38)  In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46) 

Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51)  Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52)  This is reminiscent of  Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.

When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56)  In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64)  Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60) 

Part II

Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes).  For our purposes we don't need to spend much space on this material.  “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”***   It attempts to achieve the goals listed at the end of Part I.

STAMP treats safety in a system as a control problem, not a reliability one.  Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76)  Controls may be physical or social, including cultural.  There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87)  “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)

Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system.  There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.

Part III

Part III describes in detail how STAMP can be applied.  There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section.  Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of  STAMP over traditional accident analysis techniques. 

We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.

Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421)  Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk.  The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40)  One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26)  Such an approach should be based on the organization's comprehensive hazards analysis.****

The safety culture discussion is also familiar. (pp. 426-33)  Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust.  In such a culture, Leveson says an early warning system for migration toward states of high risk can be established.  A section on Just Culture is taken directly from Dekker's work.  The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.

Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.

Our Perspective
 

Overall, we like this book.  It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial.  The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog.  The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
 

However, there are only a few direct references to nuclear.  The examples in the book are mostly from aerospace, aviation, maritime activities and the military.  Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory),  shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs.  Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428)  She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7)  In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective.  TMI is used to illustrate specific system design/operation problems.
 

We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists.  First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12)  But I'm not aware of any HRO work that has been done in an organization that is patently unsafe.  HRO asserts that reliability follows from practices that recognize and contain emerging problems.  She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers.  Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44)  Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could.  Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.”  In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410)  I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals.  Overall, this appears to be academic posturing and feather fluffing.
 

We offer no opinion on the efficacy of using Leveson's STAMP approach.  She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
 

In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii)  You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.


*  N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011)  The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed.  All quotes in this post were retyped from the original text.


**  We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier.  These authors are all aware of others' publications and contributions.  Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text. 

***  Nancy Leveson informal bio page.


****  “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157)  The hazards analysis identifies all major hazards the system may confront.  Baseline safety requirements follow from the hazards analysis.  Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk.  The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)