Showing posts with label Complacency. Show all posts
Showing posts with label Complacency. Show all posts

Thursday, October 20, 2016

Korean Perspective on Nuclear Safety Culture

Republic of Korea flag
We recently read two journal articles that present the Korean perspective on nuclear safety culture (NSC), one from a nuclear research institute and the other from the Korean nuclear regulator.  Selected highlights from each article are presented below, followed by our perspective on the articles’ value.

Warning:  Although the articles are in English, they were obviously translated from Korean, probably by a computer, and the translation is uneven.  However, the topics and references (including IAEA, NRC, J. Reason and Schein) will be familiar to you so with a little effort you can usually figure out what the authors are saying.

Korean NSC Situation and Issues*

The author is with the Korea Atomic Energy Research Institute.  He begins by describing a challenge facing the nuclear industry: avoiding complacency (because plant performance has been good) when the actual diffusion of NSC attributes among management and workers is unknown and major incidents, e.g., Fukushima, point to deficient NSC has a major contributor.  One consequence of this situation is that increased regulatory intervention in licensee NSC is a clear trend. (pp. 249, 254)

However, different countries have differing positions on how to intervene in or support NSC because (1) the objectification of an essentially qualitative factor is necessarily limited and (2) they fear diluting the licensee’s NSC responsibilities and/or causing unintended consequences. 

The U.S. NRC’s NSC history is summarized, including how NSC is addressed in the Reactor Oversight Process and relevant supplemental inspection procedures.  The author’s perception is “If safety culture vulnerability is judged to seriously affect the safety of a nuclear power plant, NRC orders the suspension of its operation, based on the judgment.” (p. 254)  In addition, the NRC has “developed and has been applying a licensee safety culture oversight program, based on site-stationed inspector's observation and assessment . . .” (ibid.)

The perception that the NRC would shut down a plant over NSC issues is a bit of a stretch.  While the agency is happy to pile on over NSC shortcomings when a plant has technical problems (see our June 16, 2016 post on ANO) it has also wrapped itself in knots to rationalize the acceptability of plant NSC in other cases (see our Jan. 30, 2013 post on Palisades).   

There is a passable discussion of the methods available for assessing NSC, ranging from observing top management leadership behavior to taking advantage of “Big data” approaches.  However, the author cautions against reliance on numeric indicators; they can have undesirable consequences.  He observes that Europe has a minimal number of NSC regulations while the U.S. has none.  He closes with recommendations for the Korean nuclear industry.

Regulatory Oversight of NSC**

The authors are with the Korea Institute of Nuclear Safety, the nuclear regulatory agency.  The article covers their philosophy and methods for regulating NSC.  It begins with a list of challenges associated with NSC regulatory oversight and a brief review of international efforts to date.  Regulatory approaches include monitoring onsite vulnerabilities (U.S.), performing standard reviews of licensee NSC evaluations (Canada, Korea) and using NSC indicators (Germany, Finland) although the authors note such indicators do not directly measure NSC. (pp. 267-68)

In the Korean view, the regulator should perform independent oversight but not directly intervene in licensee activities.  NSC assessment is separate and different from compliance-based inspection, requires effective two-way communications (i.e., a common language) and aims at creating long-term continuous improvement. (pp. 266-67)  Their NSC model uses a value-neutral definition of NSC (as opposed to strong vs. weak); incorporates Schein’s three levels; includes individuals, the organization and leaders; and emphasizes the characteristics shared by organization members.  It includes elements from IAEA GSR Part 2, the NRC, J. Reason's reporting culture, DOE, INPO, just culture and Korea-specific concerns about economics trumping safety. (pp. 268-69)***

In the detailed description of the model, we were pleased to see “Incentives, sanctions, and rewards correspond to safety competency of individuals.”  (p. 270)  An organization’s reward system has always been a hot-button issue for us; all nuclear organizations claim to value NSC, few are willing to pay for achieving or maintaining it.  Click the “Compensation” label to see all our posts on this topic.

The article presents a summary of an exercise to validate the model, i.e., link model components to actual plant safety performance.  The usual high-level mumbo-jumbo is not helped by the rough spots in the translation.  Inspection results, outage rates, scrams, incidents, unplanned shutdowns and radiation doses were claimed to be appropriately correlated with NSC model components.

There should be no surprise that the model was validated.  Getting a “right” answer is obviously good for the regulator.  We routinely express some skepticism over studies that validate models when we can’t see the actual data and we don’t know if the analysis was independently reviewed by anyone who actually understands or cares about the subject matter.

During the pilot study, several improvement areas in Korean NPP's safety culture were identified.  The approach has not been permanently installed.

Our Perspective

These articles are worth reading just to get a different, i.e., non-U.S., perspective on regulatory evaluation of (and possible intervention in) licensee SC.  It’s also worthwhile to get a non-U.S. perspective on what they think is going on in U.S. nuclear regulatory space.  Their information sources probably include a June 2015 NRC presentation to Korean regulators referenced in our Aug. 24, 2015 post.  

It’s interesting that Europe has some regulations that focus on ongoing communications with the licensees.  In contrast, the U.S. has no regulations but an approach that can stretch like a cheap blanket to cover all possible licensee situations.

Afterword

We haven’t posted for awhile.  It’s not because we’ve lost interest but there hasn’t been much worth reporting.  The big nuclear news in the U.S. is not about NSC, rather it’s about plants being scheduled for shutdown because of their economics.  International information sources have not been offering up much either.  For example, the LinkedIn NSC forum has pretty much dried up except for recycled observations and consultants’ self-serving white papers.


*  Y-H Lee, “Current Status and Issues of Nuclear Safety Culture,” Journal of the Ergonomics Society of Korea vol. 35 no. 4 (Aug 2016) 247-261.

**  YS Choi, SJ Jung and YH Chung, “Regulatory Oversight of Nuclear Safety Culture and the Validation Study on the Oversight Model Components,” Journal of the Ergonomics Society of Korea vol. 35 no. 4 (Aug 2016) 263-275.

***  Korea has had problems, mentioned in both articles, caused by deficient NSC.  Also see our Aug. 7, 2013 post for related information.

Monday, November 11, 2013

Engineering a Safer World: Systems Thinking Applied to Safety by Nancy Leveson

In this book* Leveson, an MIT professor, describes a comprehensive approach for designing and operating “safe” organizations based on systems theory.  The book presents the criticisms of traditional incident analysis methods, the principles of system dynamics, and essential safety-related organizational characteristics, including the role of culture, in one place; this review emphasizes those topics.  It should be noted the bulk of the book describes her accident causality model and how to apply it, including extensive case studies; this review does not fully address that material.

Part I
     
Part I sets the stage for a new safety paradigm.  Many contemporary socio-technical systems exhibit, among other characteristics, rapidly changing technology, increasing complexity and coupling, and pressures that put production ahead of safety. (pp. 3-6)   Traditional accident analysis techniques are no longer sufficient.  They too often focus on eliminating failures, esp. component failures or “human error,” instead of concentrating on eliminating hazards. (p. 10)  Some of Leveson's critique of traditional accident analysis echoes Dekker (esp. the shortcomings of Newtonian-Cartesian analysis, reviewed here).**   We devote space to Leveson's criticisms because she provides a legitimate perspective on techniques that comprise some of the nuclear industry's sacred cows.

Event-based models are simply inadequate.  There is subjectivity in selecting both the initiating event (the failure) and the causal chains backwards from it.  The root cause analysis often stops at the first root cause that is familiar, amenable to corrective action, difficult to get beyond (usually the human operator or other human role) or politically acceptable. (pp. 20-24)  Reason's Swiss cheese model is insufficient because of its assumption of direct, linear relationships between components. (pp. 17-19)  In addition, “event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the company or industry.” (p. 28)

Probabilistic Risk Assessment (PRA) studies specified failure modes in ever greater detail but ignores systemic factors.  “Most accidents in well-designed systems involve two or more low-probability events occurring in the worst possible combination.  When people attempt to predict system risk, they explicitly or implicitly multiply events with low probability—assuming independence—and come out with impossibly small numbers, when, in fact, the events are dependent.  This dependence may be related to common systemic factors that do not appear in an event chain.  Machol calls this phenomenon the Titanic coincidence . . . The most dangerous result of using PRA arises from considering only immediate physical failures.” (pp. 34-35)  “. . . current [PRA] methods . . . are not appropriate for systems controlled by software and by humans making cognitively complex decisions, and there is no effective way to incorporate management or organizational factors, such as flaws in the safety culture, . . .” (p. 36) 

The search for operator error (a fall guy who takes the heat off of system designers and managers) and hindsight bias also contribute to the inadequacy of current accident analysis approaches. (p. 38)  In contrast to looking for an individual's “bad” decision, Leveson says “the study of decision making cannot be separated from a simultaneous study of the social context, the value system in which it takes place, and the dynamic work process it is intended to control.” (p. 46) 

Leveson says “Systems are not static. . . . they tend to involve a migration to a state of increasing risk over time.” (p. 51)  Causes include adaptation in response to pressures and the effects of multiple independent decisions. (p. 52)  This is reminiscent of  Hollnagel's warning that cost pressure will eventually push production to the edge of the safety boundary.

When accidents or incidents occur, Leveson proposes that analysis should search for reasons (the Whys) rather than blame (usually defined as Who) and be based on systems theory. (pp. 55-56)  In a systems view, safety is an emergent property, i.e., system safety performance cannot be predicted by analyzing system components. (p. 64)  Some of the goals for a better model include analysis that goes beyond component failures and human errors, is more scientific and less subjective, includes the possibility of system design errors and dysfunctional system interactions, addresses software, focuses on mechanisms and factors that shape human behavior, examines processes and allows for multiple viewpoints in the incident analysis. (pp. 58-60) 

Part II

Part II describes Leveson's proposed accident causality model based on systems theory: STAMP (Systems-Theoretic Accident Model and Processes).  For our purposes we don't need to spend much space on this material.  “The model includes software, organizations, management, human decision-making, and migration of systems over time to states of heightened risk.”***   It attempts to achieve the goals listed at the end of Part I.

STAMP treats safety in a system as a control problem, not a reliability one.  Specifically, the overarching goal “is to control the behavior of the system by enforcing the safety constraints in its design and operation.” (p. 76)  Controls may be physical or social, including cultural.  There is a good discussion of the hierarchy of control in a complex system and the impact of possible system dynamics, e.g., time lags, feedback loops and changes in control structures. (pp. 80-87)  “The process leading up to an accident is described in STAMP in terms of an adaptive feedback function that fails to maintain safety as system performance changes over time to meet a complex set of goals and values.” (p. 90)

Leveson describes problems that can arise from an inaccurate mental model of a system or an inaccurate model displayed by a system.  There is a lengthy, detailed case study that uses STAMP to analyze a tragic incident, in this case a friendly fire accident where a U.S. Army helicopter was shot down by an Air Force plane over Iraq in 1994.

Part III

Part III describes in detail how STAMP can be applied.  There are many useful observations (e.g., problems with mode confusion on pp. 289-94) and detailed examples throughout this section.  Chapter 11 on using a STAMP-based accident analysis illustrates the claimed advantages of  STAMP over traditional accident analysis techniques. 

We will focus on a chapter 13, “Managing Safety and the Safety Culture,” which covers the multiple dimensions of safety management, including safety culture.

Leveson's list of the components of effective safety management is mostly familiar: management commitment and leadership, safety policy, communication, strong safety culture, safety information system, continual learning, education and training. (p. 421)  Two new components need a bit of explanation, a safety control structure and controls on system migration toward higher risk.  The safety control structure assigns specific safety-related responsibilities to management, system designers and operators. (pp. 436-40)  One of the control structure's responsibilities is to identify “the potential reasons for and types of migration toward higher risk need to be identified and controls instituted to prevent it.” (pp. 425-26)  Such an approach should be based on the organization's comprehensive hazards analysis.****

The safety culture discussion is also familiar. (pp. 426-33)  Leveson refers to the Schein model, discusses management's responsibility for establishing the values to be used in decision making, the need for open, non-judgmental communications, the freedom to raise safety questions without fear of reprisal and widespread trust.  In such a culture, Leveson says an early warning system for migration toward states of high risk can be established.  A section on Just Culture is taken directly from Dekker's work.  The risk of complacency, caused by inaccurate risk perception after a long history of success, is highlighted.

Although these management and safety culture contents are generally familiar, what's new is relating them to systems concepts such as control loops and feedback and taking a systems view of the safety control system.

Our Perspective
 

Overall, we like this book.  It is Leveson's magnum opus, 500+ pages of theory, rationale, explanation, examples and infomercial.  The emphasis on the need for a systems perspective and a search for Why accidents/incidents occur (as opposed to What happened or Who is at fault) is consistent with what we've been saying on this blog.  The book explains and supports many of the beliefs we have been promoting on Safetymatters: the shortcomings of traditional (but commonly used) methods of incident investigation; the central role of decision making; and how management commitment, financial and non-financial rewards, and a strong safety culture contribute to system safety performance.
 

However, there are only a few direct references to nuclear.  The examples in the book are mostly from aerospace, aviation, maritime activities and the military.  Establishing a safety control structure is probably easier to accomplish in a new aerospace project than in an existing nuclear organization with a long history (aka memory),  shifting external pressures, and deliberate incremental changes to hardware, software, policies, procedures and programs.  Leveson does mention John Carroll's (her MIT colleague) work at Millstone. (p. 428)  She praises nuclear LER reporting as a mechanism for sharing and learning across the industry. (pp. 406-7)  In our view, LERs should be helpful but they are short on looking at why incidents occur, i.e., most LER analysis does not look at incidents from a systems perspective.  TMI is used to illustrate specific system design/operation problems.
 

We don't agree with the pot shots Leveson takes at High Reliability Organization (HRO) theorists.  First, she accuses HRO of confusing reliability with safety, in other words, an unsafe system can function very reliably. (pp. 7, 12)  But I'm not aware of any HRO work that has been done in an organization that is patently unsafe.  HRO asserts that reliability follows from practices that recognize and contain emerging problems.  She takes another swipe at HRO when she says HRO suggests that, during crises, decision making migrates to frontline workers.  Leveson's problem with that is “the assumption that frontline workers will have the necessary knowledge and judgment to make decisions is not necessarily true.” (p. 44)  Her position may be correct in some cases but as we saw in our review of CAISO, when the system was veering off into new territory, no one had the necessary knowledge and it was up to the operators to cope as best they could.  Finally, she criticizes HRO advice for operators to be on the lookout for “weak signals.”  In her view, “Telling managers and operators to be “mindful of weak signals” simply creates a pretext for blame after a loss event occurs.” (p. 410)  I don't think it's pretext but it is challenging to maintain mindfulness and sense faint signals.  Overall, this appears to be academic posturing and feather fluffing.
 

We offer no opinion on the efficacy of using Leveson's STAMP approach.  She is quick to point out a very real problem in getting organizations to use STAMP: its lack of focus on finding someone/something to blame means it does not help identify subjects for discipline, lawsuits or criminal charges. (p. 86)
 

In Leveson's words, “The book is written for the sophisticated practitioner . . .” (p. xviii)  You don't need to run out and buy this book unless you have a deep interest in accident/incident analysis and/or are willing to invest the time required to determine exactly how STAMP might be applied in your organization.


*  N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, Cambridge, MA: 2011)  The link goes to a page where a free pdf version of the book can be downloaded; the pdf cannot be copied or printed.  All quotes in this post were retyped from the original text.


**  We're not saying Dekker or Hollnagel developed their analytic viewpoints ahead of Leveson; we simply reviewed their work earlier.  These authors are all aware of others' publications and contributions.  Leveson includes Dekker in her Acknowledgments and draws from Just Culture: Balancing Safety and Accountability in her text. 

***  Nancy Leveson informal bio page.


****  “A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to an accident.” (p. 157)  The hazards analysis identifies all major hazards the system may confront.  Baseline safety requirements follow from the hazards analysis.  Responsibilities are assigned to the safety control structure for ensuring baseline requirements are not violated while allowing changes that do not raise risk.  The identification of system safety constraints allows the possibility of identifying leading indicators for a specific system. (pp. 337-38)

Friday, October 18, 2013

When Apples Decay

In our experience education is perceived as a continual process, accumulating knowledge progressively over time.  A shiny apple exemplifies the learning student or an inspiring insight (see Newton, Sir Isaac). Less consideration is given to the fact that the educational process can work in reverse leading to a loss of capability over time.  In other words, the apple decays.  As Martin Weller notes on his blog The Ed Techie, “education is about selling apples...we need to recognise and facilitate learning that takes ten minutes or involves extended participation in a community over a number of years.”*

This leads us to a recent Wall Street Journal piece, “Americans Need a Simple Retirement System”.**  The article is about the failure of educational efforts to improve financial literacy.  We admit that this is a bit out of context for nuclear safety culture; nevertheless it provides a useful perspective that seems to be overlooked in within the nuclear industry.  The article notes:

“The problem is that, like all educational efforts, financial education decays over time and has negligible effects on behavior after 20 months. The authors suggest that, given this decay, “just in time” financial education...might be a more effective way to proceed.”

We tend to view the safety culture training provided at nuclear plants to be of the 10 minute variety, selling apples that may vary in size and color but are just apples.   Additional training is routinely prescribed in response to findings of inadequate safety culture.  Yet we cannot recall a single reported instance where safety culture issues were associated with inadequate or ineffective training in the first place.  Nor do we see explicit recognition that such training efforts have very limited half lives, creating cycles of future problems.  We have blogged about the decay of training based reinforcement (see our March 22, 2010 post) and the contribution of decay and training saturation to complacency (see our Dec. 8, 2011 post).

The fact that safety culture knowledge and “strength” decays over time is just one example of the dynamics associated with safety management.  Arguably one could assert that an effective learning process itself is a (the?) key to building and maintaining strong safety culture.  And further that it is consistently missing in current nuclear industry programs that emphasize indoctrination in traits and values.  It’s time for better and more innovative approaches - not just more apples.



*  M. Weller, "The long-awaited 'education as fruit' metaphor," The Ed Techie blog (Sept. 10, 2009).  Retrieved Oct. 18, 2013.

**  A.H. Munnell, "Americans need a simple retirement system," MarketWatch blog (Oct. 16, 2013).  Retrieved Oct. 18, 2013.

Monday, October 14, 2013

High Reliability Management by Roe and Schulman

This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market.  CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world.  In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.

The High Reliability Management (HRM) Model

The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3)  System Volatility refers to the magnitude and rate of change of  CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall).  Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs. 

System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each.  All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid.  Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes.  Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)

High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance.  Some strategies can be substituted for others.  It is a dynamic but manageable environment.

High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance.  They run from pillar to post; it is highly stressful.  Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error.  Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills.  It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)

Low Volatility and Low Options mean generation and demand are not changing quickly.  The critical feature here is demand has been reduced by load shedding.  The operators have exhausted all other strategies for maintaining balance.  It is a command-and-control approach, effected by declaring a  Stage 3 grid situation and run using formal rules and procedures.  It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished. 

Low Volatility and High Options is an HRM's preferred mode.  Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available.  Procedures based on analyzed conditions exist and are used.  There are few, if any, surprises.  Learning can occur but it is incremental, the result of new methods or analysis.  Performance is important and system behavior operates within a narrow bandwidth.  Loss of attention (complacency) is a risk.  Is this starting to sound familiar?  This is the domain of High Reliability Organization (HRO) theory and practice.  Nuclear power operations is an example of an HRO. (pp. 60-62)          

Lessons for Nuclear Operations 


Nuclear plants work hard to stay in the Low Volatility/High Options mode.  If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62)  In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses.  Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.**  Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.   

In an HRO, trial and error is not an acceptable method for trying out new options.  No one wants cowboy operators in the control room.  But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233)  In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)  

The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear.  The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system.  In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance.  In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126) 

The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)

Implications for Safety Culture

Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2)  Both organizational types need a focus on operations as the central activity.  Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.

The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants.  Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220) 

The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC.  We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?” 

Conclusion

This review gives short shrift to the authors' broad and deep description and analysis of CAISO.****  The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.

The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management.  But it's a good read and full of insightful observations, e.g., the description of  CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)

(High Reliability Management was recommended by a Safetymatters reader.  If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)

*  E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008)  This book reports the authors' study of CAISO from 2001 through 2006. 

**  By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment.  Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.

The importance of considering low-probability, major consequence events is argued by Taleb (see here) and Dédale (see here).

***  Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits. 

The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone.  For more on Cynefin, see here and here.

****  For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO.  As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)

Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits.  This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.

Wednesday, August 7, 2013

Nuclear Industry Scandal in South Korea

As you know, over the past year trouble has been brewing in the South Korean nuclear industry.  A recent New York Times article* provides a good current status report.  The most visible problem is the falsification of test documents for nuclear plant parts.  Executives have been fired, employees of both a testing company and the state-owned entity that inspects parts and validates their safety certificates have been indicted.

It should be no surprise that the underlying causes are rooted in the industry structure and culture.  South Korea has only one nuclear utility, state-owned Korea Electric Power Corporation (Kepco).  Kepco retirees go to work for parts suppliers or invest in them.  Cultural attributes include valuing personal ties over regulations, and school and hometown connections.  Bribery is used as a lubricating agent.

As a consequence,  “In the past 30 years, our nuclear energy industry has become an increasingly closed community that emphasized its specialty in dealing with nuclear materials and yet allowed little oversight and intervention,” the government’s Ministry of Trade, Industry and Energy said in a recent report to lawmakers. “It spawned a litany of corruption, an opaque system and a business practice replete with complacency.”

Couldn't happen here, right?  I hope not, but the U.S. nuclear industry, while not as closed a system as its Korean counterpart, is hardly an open community.  The “unique and special” mantra promotes insular thinking and encourages insiders to view outsiders with suspicion.  The secret practices of the industry's self-regulator do not inspire public confidence.  A familiar cast of NEI/INPO participants at NRC stakeholder meetings fuels concern over the degree to which the NRC has been captured by industry.  Utility business decisions that ultimately killed plants (CR3, Kewaunee, San Onofre) appear to have been made in conference rooms isolated from any informed awareness of worst-case technical/commercial consequences.  Our industry has many positive attributes but some others ask us to stop and reflect.  

*  C. Sang-Hun, “Scandal in South Korea Over Nuclear Revelations,” New York Times (Aug. 3, 2013).  Retrieved Aug. 6, 2013.

Friday, June 29, 2012

Modeling Safety Culture (Part 2): Safety Culture as Pressure Boundary

No, this is not an attempt to incorporate safety culture into the ASME code.  As introduced in Part 1 we want to offer a relatively simple construct for safety culture - hoping to provide a useful starting point for a model of safety culture and a bridge between safety culture as amorphous values and beliefs, and safety culture that helps achieve desired balances in outcomes.

We propose that safety culture be considered “the willingness and ability of an organization to resist undue pressure on safety from competing business priorities”.  Clearly this is a 30,000 foot view of safety culture and does not try to address the myriad ways in which it materializes within the organization. This is intentional since there are so many possible moving parts at the individual level making it too easy to lose sight of the macro forces. 

The following diagram conceptualizes the boundary between safety priorities (i.e., safety culture) and other organizational priorities (business pressure).  The plotted line is essentially a threshold where the pressure for maintaining safety priorities (created by culture) may start to yield to increasing amounts of pressure to address other business priorities.  In the region to the left of the plot line, safety and business priorities exist in an equilibrium.  To the right of the line business pressure exceeds that of the safety culture and can lead to compromises.  Note that this construct supports the view that strong safety performance is consistent with strong overall performance.  Strong overall performance, in areas such as production, cost and schedule, ensure that business pressures are relatively low and in equilibrium with reasonably strong safety culture.  (A larger figure with additional explanatory notes is available here.)



The arc of the plot line suggests that the safety/business threshold increases (requires greater business pressure) as safety culture becomes stronger.  It also illustrates that safety priorities may be maintained even at lower safety culture strengths when there is little competing business pressure.  This aspect seems particularly consistent with determinations at certain plants that safety culture is “adequate” but still requires strengthening.  It also provides an appealing explanation for how complacency can over time erode a relatively strong safety culture . If overall performance is good, resulting in minimal business pressures, the culture might not be “challenged” or noticed even as culture becomes degraded.

Another perspective on safety culture as pressure boundary is what happens when business pressure elevates to a point where the threshold is crossed.  One reason that organizations with strong culture may be able to resist more pressure is a greater ability to manage business challenges that arise and/or a willingness to adjust business goals before they become overwhelming.  And even at the threshold such organizations may be better able to identify compensatory actions that have only minimal and short term safety impacts.  For organizations with weaker safety culture, the threshold may lead to more immediate and direct tradeoffs of safety priorities.  In addition, the feedback effects of safety compromises (e.g., larger backlogs of unresolved problems) can compound business performance deficiencies and further increase business pressure.  One possible insight from the pressure model is that in some cases, perceived safety culture issues may be more a situation of reasonably strong safety culture being over matched by excessive business pressures.  The solution may be more about relieving business pressures than exclusively trying to reinforce culture.

In Part 3 we hope to further develop this approach through some simple simulations that illustrate the interaction of managing resources and balancing pressures.  In the meantime we would like to hear reactions from readers to this concept.

Monday, February 13, 2012

Is Safety Culture An Inherently Stable System?

The short answer:  No.

“Stable” means that an organization’s safety culture effectiveness remains at about the same level* over time.  However, if a safety culture effectiveness meter existed and we attached it to an organization, we would see that, over time, the effectiveness level rises and falls, possibly even dropping to an unacceptable level.  Level changes occur because of shocks to the system and internal system dynamics.

Shocks

Sudden changes or challenges to safety culture stability can originate from external (exogenous) or internal (endogenous) sources.

Exogenous shocks include significant changes in regulatory requirements, such as occurred after TMI or the Browns Ferry fire, or “it’s not supposed to happen” events that do, in fact, occur, such as a large earthquake in Virginia or a devastating tsunami in Japan that give operators pause, even before any regulatory response.

Organizations have to react to such external events and their reaction is aimed at increasing plant safety.  However, while the organization’s focus is on its response to the external event, it may take its eye off the ball with respect to its pre-existing and ongoing responsibilities.  It is conceivable that the reaction to significant external events may distract the organization and actually lower overall safety culture effectiveness.

Endogenous shocks include the near-misses that occur at an organization’s own plant.  While it is unfortunate that such events occur, it is probably good for safety culture, at least for awhile.  Who hasn’t paid greater attention to their driving after almost crashing into another vehicle?

The insertion of new management, e.g., after a plant has experienced a series of performance or regulatory problems, is another type of internal shock.  This can also raise the level of safety culture—IF the new management exercises competent leadership and makes progress on solving the real problems. 

Internal Dynamics    

Absent any other influence, safety culture will not remain at a given level because of an irreducible tendency to decay.  Decay occurs because of rising complacency, over-confidence, goal conflicts, shifting priorities and management incentives.  Cultural corrosion, in the form of normalization of deviance, is always pressing against the door, waiting for the slightest crack to appear.  We have previously discussed these challenges here.

An organization may assert that its safety culture is a stability-seeking system, one that detects problems, corrects them and returns to the desired level.  However, performance with respect to the goal may not be knowable with accuracy because of measurement issues.  There is no safety culture effectiveness meter, surveys only provide snapshots of instant safety climate and even a lengthy interview-based investigation may not lead to repeatable results, i.e, a different team of evaluators might (or might not) reach different conclusions.  That’s why creeping decay is difficult to perceive. 

Conclusion

Many different forces can affect an organization’s safety culture effectiveness, some pushing it higher while others lower it.  Measurement problems make it difficult to know what the level is and the trend, if any.  The takeaway is there is no reason to assume that safety culture is a stable system whose effectiveness can be maintained at or above an acceptable level.


*  “Level” is a term borrowed from system dynamics, and refers to the quantity of a variable in a model.  We recognize that safety culture is an organizational property, not something stored in a tank, but we are using “level” to communicate the notion that safety culture effectiveness is something that can improve (go up) or degrade (go down).

Thursday, January 5, 2012

2011 End of Year Summary

We thought we would take this opportunity to do a little rummaging around in the Google analytics and report on some of the statistics for the safetymatters blog.

The first thing that caught our attention was the big increase in page views (see chart below) for the blog this past year.  We are now averaging more than 1000 per month and we appreciate every one of the readers who visits the blog.  We hope that the increased readership reflects that the content is interesting, thought provoking and perhaps even a bit provocative.  We are pretty sure people who are interested in nuclear safety culture cannot find comparable content elsewhere.

The following table lists the top ten blog posts.  The overwhelming favorite has been the "Normalization of Deviation" post from March 10, 2010.  We have consistently commented positively on this concept introduced by Diane Vaughan in her book The Challenger Launch Decision.  Most recently Red Conner noted in his December 8, 2011 post the potential role of normalization of deviation in contributing to complacency.  This may appear to be a bit of a departure from the general concept of complacency as primarily a passive occurrence.  Red notes that the gradual and sometimes hardly perceptive acceptance of lesser standards or non-conforming results may be more insidious than a failure to challenge the status quo.  We would appreciate hearing from readers on their views of “normalization”, whether they believe it is occurring in their organizations (and if so how is it detected?) and what steps might be taken to minimize its effect.



A common denominator among a number of the popular posts is safety culture assessment, whether in the form of surveys, performance indicators, or other means to gauge the current state of an organization.  Our sense is there is a widespread appetite for approaches to measuring safety culture in some meaningful way; such interest perhaps also indicates that current methods, heavily dependent on surveys, are not meeting needs.  What is even more clear in our research is the lack of initiative by the industry and regulators to promote or fund research into this critical area.   

A final observation:  The Google stats on frequency of page views indicate two of the top three pages were the “Score Decision” pages for the two decision examples we put forward.  They each had a 100 or more views.  Unfortunately only a small percentage of the page views translated into scoring inputs for the decisions.  We’re not sure why the lack of inputs since they are anonymous and purely a matter of the reader’s judgment.  Having a larger data set from which to evaluate the decision scoring process would be very useful and we would encourage anyone who did visit but not score to reconsider.  And of course, anyone who hasn’t yet visited these examples, please do and see how you rate these actual decisions from operating nuclear plants.

Thursday, December 8, 2011

Nuclear Industry Complacency: Root Causes

NRC Chairman Jaczko, addressing the recent INPO CEO conference, warned about possible increasing complacency in the nuclear industry.*  To support his point, he noted the two plants in column four of the ROP Action Matrix and two plants in column three, the increased number of special inspections in the past year, and the three units in extended shutdowns.  The Chairman then moved on to discuss other industry issues. 

The speech spurred us to ask: Why does the risk of complacency increase over time?  Given our interest in analyzing organizational processes, it should come as no surprise that we believe complacency is more complicated than the lack of safety-related incidents leading to reduced attention to safety.

An increase in complacency means that an organization’s safety culture has somehow changed.  Causes of such change include shifts in the organization’s underlying assumptions and decay.

Underlying Assumptions

We know from the Schein model that underlying assumptions are the bedrock for culture.  One can take those underlying assumptions and construct an (incomplete) mental model of the organization—what it values, how it operates and how it makes decisions.  Over time, as the organization builds an apparently successful safety record, the mental weights that people assign to decision factors can undergo a subtle but persistent shift to favor the visible production and cost goals over the inherently invisible safety factor.  At the same time, opportunities exist for corrosive issues, e.g., normalization of deviance, to attach themselves to the underlying assumptions.  Normalization of deviance can manifest anywhere, from slipping maintenance standards to a greater tolerance for increasing work backlogs.

Decay

An organization’s safety culture will inevitably decay over time absent effective maintenance.  In part this is caused by the shift in underlying assumptions.  In addition, decay results from saturation effects.  Saturation occurs because beating people over the head with either the same thing, e.g., espoused values, or too many different things, e.g., one safety program or similar intervention after another, has lower and lower marginal effectiveness over time.  That’s one reason new leaders are brought in to “problem” plants: to boost the safety culture by using a new messenger with a different version of the message, reset the decision making factor weights and clear the backlogs.

None of this is new to regular readers of this blog.  But we wanted to gather our ideas about complacency in one post.  Complacency is not some free-floating “thing,” it is an organizational trait that emerges because of multiple dynamics operating below the level of clear visibility or measurement.  

     
*  G.B. Jaczko, Prepared Remarks at the Institute of Nuclear Power Operations CEO Conference, Atlanta, GA (Nov. 10, 2011), p. 2, ADAMS Accession Number ML11318A134.

Thursday, March 3, 2011

Safety Culture in the DOE Complex

This post reviews a Department of Energy (DOE) effort to provide safety culture assessment and improvement tools for its own operations and those of its contractors.

Introduction

The DOE is responsible for a vast array of organizations that work on DOE’s programs.  These organizations range from very small to huge in size and include private contractors, government facilities, specialty shops, niche manufacturers, labs and factories.  Many are engaged in high-hazard activities (including nuclear) so DOE is interested in promoting an effective safety culture across the complex.

To that end, a task team* was established in 2007 “to identify a consensus set of safety culture principles, along with implementation practices that could be used by DOE . . .  and their contractors. . . . The goal of this effort was to achieve an improved safety culture through ISMS [Integrated Safety Management System] continuous improvement, building on operating experience from similar industries, such as the domestic and international commercial nuclear and chemical industries.”  (Final Report**, p. 2)

It appears the team performed most of its research during 2008, conducted a pilot program in 2009 and published its final report in 2010.  Research included reviewing the space shuttle and Texas City disasters, the Davis-Besse incident, works by gurus such as James Reason, and guidance and practices published by NASA, NRC, IAEA, INPO and OSHA.

Major Results

The team developed a definition of safety culture and described a process whereby using organizations could assess their safety culture and, if necessary, take steps to improve it.

The team’s definition of safety culture:

“An organization’s values and behaviors modeled by its leaders and internalized by its members, which serve to make safe performance of work the overriding priority to protect the workers, public, and the environment.” (Final Report, p. 5)

After presenting this definition, the report goes on to say “The Team believes that voluntary, proactive pursuit of excellence is preferable to regulatory approaches to address safety culture because it is difficult to regulate values and behaviors. DOE is not currently considering regulation or requirements relative to safety culture.” (Final Report, pp. 5-6)

The team identified three focus areas that were judged to have the most impact on improving safety and production performance within the DOE complex: Leadership, Employee/Worker Engagement, and Organizational Learning. For each of these three focus areas, the team identified related attributes.

The overall process for a using organization is to review the focus areas and attributes, assess the current safety culture, select and use appropriate improvement tools, and reinforce results. 

The list of tools to assess safety culture includes direct observations, causal factors analysis (CFA), surveys, interviews, review of key processes, performance indicators, Voluntary Protection Program (VPP) assessments, stream analysis and Human Performance Improvement (HPI) assessments.***  The Final Report also mentioned performance metrics and workshops. (Final Report, p. 9)

Tools to improve safety culture include senior management commitment, clear expectations, ISMS training, managers spending time in the field, coaching and mentoring, Behavior Based Safety (BBS), VPP, Six Sigma, the problem identification process, and HPI.****  The Final Report also mentioned High Reliability Organization (HRO), Safety Conscious Work Environment (SCWE) and Differing Professional Opinion (DPO). (Final Report, p. 9)  Whew.

The results of a one-year pilot program at multiple contractors were evaluated and the lessons learned were incorporated in the final report.

Our Assessment

Given the diversity of the DOE complex, it’s obvious that no “one size fits all” approach is likely to be effective.  But it’s not clear that what the team has provided will be all that effective either.  The team’s product is really a collection of concepts and tools culled from the work of outsiders, combined with DOE’s existing management programs, and repackaged as a combination of overall process and laundry lists.  Users are left to determine for themselves exactly which sub-set of tools might be useful in their individual situations.

It’s not that the report is bad.  For example, the general discussion of safety culture improvement emphasizes the importance of creating a learning organization focused on continuous improvement.  In addition, a major point they got right was recognizing that safety can contribute to better mission performance.  “The strong correlation between good safety performance with good mission performance (or productivity or reliability) has been observed in many different contexts, including industrial, chemical, and nuclear operations.” (Final Report, p. 20)

On the other hand, the team has adopted the works of others but does not appear to recognize how, in a systems sense, safety culture is interwoven into the fabric of an organization.  For example, feedback loops from the multitude of possible interventions to overall safety culture are not even mentioned.  And this is not a trivial issue.  An intervention can provide an initial boost to safety culture but then safety culture may start to decay because of saturation effects, especially if the organization is hit with one intervention after another.

In addition, some of the major, omnipresent threats to safety culture do not get the emphasis they deserve.  Goal conflict, normalization of deviance and institutional complacency are included in a list of issues from the Columbia, Davis-Besse and Texas City events (Final Report, p. 13-15) but the authors do not give them the overarching importance they merit.  Goal conflict, often expressed as safety vs mission, should obviously be avoided but its insidiousness is not adequately recognized; the other two factors are treated in a similar manner. 

Two final picky points:  First, the report says it’s difficult to regulate behavior.  That’s true but companies and government do it all the time.  DOE could definitely promulgate a behavior-based safety culture regulatory requirement if it chose to do so.  Second, the final report (p. 9) mentions leading (vs lagging) indicators as part of assessment but the guidelines do not provide any examples.  If someone has some useful leading indicators, we’d definitely like to know about them. 

Bottom line, the DOE effort draws from many sources and probably represents consensus building among stakeholders on an epic scale.  However, the team provides no new insights into safety culture and, in fact, may not be taking advantage of the state of the art in our understanding of how safety culture interacts with other organizational attributes. 


*  Energy Facility Contractors Group (EFCOG)/DOE Integrated Safety Management System (ISMS) Safety Culture Task Team.

**  J. McDonald, P. Worthington, N. Barker, G. Podonsky, “EFCOG/DOE ISMS Safety Culture Task Team Final Report”  (Jun 4, 2010).

***  EFCOG/DOE ISMS Safety Culture Task Team, “Assessing Safety Culture in DOE Facilities,” EFCOG meeting handout (Jan 23, 2009).

****  EFCOG/DOE ISMS Safety Culture Task Team, “Activities to Improve Safety Culture in DOE Facilities,” EFCOG meeting handout (Jan 23, 2009).

Friday, January 14, 2011

ACRS Weighs In on Safety Culture Policy

In mid-December the Advisory Committee on Reactor Safeguards (ACRS) provided the results of its review of the NRC’s proposed nuclear safety culture policy in a letter to NRC Chairman Jaczko.*  The letter reiterated the approach and general structure of the proposed policy and reached a favorable conclusion.  Perhaps the most interesting comment in the main body of the letter is the following:

“Well-intentioned attempts at improving safety and effectiveness have faltered through efforts to overly prescribe correct behavior and to apply rigid scoring systems. We urge that the staff encourage approaches that emphasize thinking and safety awareness over scorecards of metrics that can induce complacency and rote compliance. Issuance of a policy statement, rather than a regulation, is likely to be a more effective way to appropriately engage all the stakeholders.” (p. 4)

The statement is a bit cryptic and we can only guess what the ACRS has in mind when it refers to “scorecards of metrics” or “overly prescribing behavior”.  Are they referring to the ROP?  Is the ACRS concerned that reliance on the ROP metrics (and their almost uniformly green status) may be lulling the industry and the NRC into complacency?  Equally uncertain is why the ACRS believes that a policy statement will lead to more effective results. 

Apparently we are not the only ones to suffer uncertainty.  The ACRS letter includes “Additional Comments” (read: dissenting comments) by three members** who state:

“It is not entirely clear to us what is meant by implementing a policy statement that lacks the authority of regulation. It appears that implementation of the safety culture policy statement may be an indirect method of imposing requirements on licensees without the discipline of the regulatory process. This, of course, is not acceptable.” (p. 4)

Part of the confusion may lie in the intent and authority associated with NRC policy statements.  It appears that the dissenting members feel that a policy statement would be a back door method to impose “requirements”.  Is that true?  We will follow with a detailed look at policy statements and their effect.


*  Letter dated Dec 15, 2010 from S. Abdel-Khalik (ACRS) to G. Jaczko (NRC), subject "Safety Culture Policy Statement," ADAMS Accession Number ML103410358.

** D.A. Powers, J.S. Armijo and J.L. Rempe.

Wednesday, June 30, 2010

Can Safety Culture Be Regulated? (Part 2)

Part 1 of this topic covered the factors important to safety culture and amenable to measurement or assessment, the “known knowns.”   In this Part 2 we’ll review other factors we believe are important to safety culture but cannot be assessed very well, if at all, the “known unknowns” and the potential for factors or relationships important to safety culture that we don’t know about, the “unknown unknowns.”

Known Unknowns

These are factors that are probably important to regulating safety culture but cannot be assessed or cannot be assessed very well.  The hazard they pose is that deficient or declining performance may, over time, damage and degrade a previously adequate safety culture.

Measuring Safety Culture

This is the largest issue facing a regulator.  There is no meter or method that can be applied to an organization to obtain the value of some safety culture metric.  It’s challenging (impossible?) to robustly and validly assess, much less regulate, a variable that cannot be measured.  For a more complete discussion of this issue, please see our June 15, 2010 post

Trust

If the plant staff does not trust management to do the right thing, even when it costs significant resources, then safety culture will be negatively affected.  How does one measure trust, with a survey?  I don’t think surveys offer more than an instantaneous estimate of any trust metric’s value.

Complacency

Organizations that accept things as they are, or always have been, and see no opportunity or need for improvement are guilty of complacency or worse, hubris.  Lack of organizational reinforcement for a questioning attitude, especially when the questions may result in lost production or financial costs, is a de facto endorsement of complacency.  Complacency is often easy to see a posteriori, hard to detect as it occurs.  

Management competence

Does management implement and maintain consistent and effective management policies and processes?  Is the potential for goal conflict recognized and dealt with (i.e., are priorities set) in a transparent and widely accepted manner?  Organizations may get opinions on their managers’ competence, but not from the regulator.

The NRC does not evaluate plant or owner management competence.  They used to, or at least appeared to be trying to.  Remember the NRC senior management meetings, trending letters, and the Watch List?  While all the “problem” plants had material or work process issues, I believe a contributing factor was the regulator had lost confidence in the competence of plant management.  This system led to the epidemic of shutdown plants in the 1990s.*   In reaction, politicians became concerned over the financial losses to plant owners and employees, and the Commission become concerned that the staff’s explicit/implicit management evaluation process was neither robust and nor valid.

So the NRC replaced a data-informed subjective process with the Reactor Oversight Program (ROP) which looks at a set of “objective” performance indicators and a more subjective inference of cross-cutting issues: human performance, finding and fixing problems (CAP, a known), and management attention to safety and workers' ability to raise safety issues (SCWE, part known and part unknown).  I don’t believe that anyone, especially an outsider like a regulator, can get a reasonable picture of a plant’s safety culture from the “Rope.”  There most certainly are no leading or predictive safety performance indicators in this system.

External influences

These factors include changes in plant ownership, financial health of the owner, environmental regulations, employee perceptions about management’s “real” priorities, third-party assessments, local socio-political pressures and the like.  Any change in these factors could have some effect on safety culture.

Unknown Unknowns

These are the factors that affect safety culture but we don’t know about.  While a lot of smart people have invested significant time and effort in identifying factors that influence safety culture, new possibilities can still emerge.

For example, a new factor has just appeared on our radar screen: executive compensation.  Bob Cudlin has been researching the compensation packages for senior nuclear executives and some of the numbers are eye-popping, especially in comparison to historical utility norms.  Bob will soon post on his findings, including where safety figures into the compensation schemes, an important consideration since much executive compensation is incentive-based.

In addition, it could well be that there are interactions (feedback loops and the like), perhaps varying in structure and intensity over time, between and among the known and unknown factors, that have varying impacts on the evolutionary arc of an organization’s safety culture.  Because of such factors, our hope that safety culture is essentially stable, with a relatively long decay time, may be false; safety culture may be susceptible to sudden drop-offs. 

The Bottom Line

Can safety culture be regulated?  At the current state of knowledge, with some “known knowns” but no standard approach to measuring safety culture and no leading safety performance indicators, we’d have to say “Yes, but only to some degree.”  The regulator may claim to have a handle on an organization’s safety culture through SCWE observations and indirect evidence, but we don’t think the regulator is in a good position to predict or even anticipate the next issue or incident related to safety culture in the nuclear industry. 

* In the U.S. in 1997, one couldn’t swing a dead cat without hitting a shutdown nuclear power plant.  17 units were shutdown during all or part of that year, out of a total population of 108 units. 

Sunday, April 18, 2010

Safety Culture: Cause or Context (part 1)

As we have mentioned before, we are perplexed that people are still spending time working on safety culture definitions. After all, it’s not because of some definitional issue that problems associated with safety culture arise at nuclear plants. Perhaps one contributing factor to the ongoing discussion is that people hold different views of what the essence of safety culture is, views that are influenced by individuals’ backgrounds, experiences and expectations. Consultants, lawyers, engineers, managers, workers and social scientists can and do have different perceptions of safety culture. Using a term from system dynamics, they have different “mental models.”

Examining these mental models is not an empty semantic exercise; one’s mental model of safety culture determines (a) the degree to which one believes it is measurable, manageable or independent, i.e. separate from other organizational features, (b) whether safety culture is causally related to actions or simply a context for actions, and (c) most importantly, what specific strategies for improving safety performance might work.

To help identify different mental models, we will refer to a 2009 academic article by Susan Silbey,* a sociology professor at MIT. Her article does a good job of reviewing the voluminous safety culture literature and assigning authors and concepts into three main categories: Culture as (a) Causal Attitude, (b) Engineered Organization, and (c) Emergent and Indeterminate. To fit into our blog format, we will greatly summarize her paper, focusing on points that illustrate our notion of different mental models, and publish this analysis in two parts.

Safety Culture as Causal Attitude

In this model, safety culture is a general concept that refers to an organization’s collective values, beliefs, assumptions, and norms, often assessed using survey instruments. Explanations of accidents and incidents that focus on or blame an organization’s safety culture are really saying that the then-existing safety culture somehow caused the negative events to occur or can be linked to the events by some causal chain. (For an example of this approach, refer to the Baker Report on the 2005 BP Texas City refinery accident.)

Adopting this mental model, it follows logically that the corrective action should be to fix the safety culture. We’ve all seen, or been a part of, this – a new management team, more training, different procedures, meetings, closer supervision – all intended to fix something that cannot be seen but is explicitly or implicitly believed to be changeable and to some extent measurable.

This approach can and does work in the short run. Problems can arise in the longer-term as non-safety performance goals demand attention; apparent success in the safety area breeds complacency; or repetitive, monotonous reinforcement becomes less effective, leading to safety culture decay. See our post of March 22, 2010 for a discussion of the decay phenomenon.

Perhaps because this model reinforces the notion that safety culture is an independent organizational characteristic, the model encourages involved parties (plant owners, regulators, the public) to view safety culture with a relatively narrow field of view. Periodic surveys and regulatory observations conclude a plant’s safety culture is satisfactory and everyone who counts accepts that conclusion. But then an event occurs like the recent situation at Vermont Yankee and suddenly people (or at least we) are asking: How can eleven employees at a plant with a good safety culture (as indicated by survey) produce or endorse a report that can mislead reviewers on a topic that can affect public health and safety?

Safety Culture as Engineered Organization

The model is evidenced in the work of the High Reliability Organization (HRO) writers. Their general concept of safety culture appears similar to the Causal Attitude camp but HRO differs in “its explicit articulation of the organizational configuration and practices that should make organizations more reliably safe.” (Silbey, p. 353) It focuses on an organization’s learning culture where “organizational learning takes place through trial and error, supplemented by anticipatory simulations.” Believers are basically optimistic that effective organizational prescriptions for achieving safety goals can be identified, specified and implemented.

This model appears to work best in a command and control organization, i.e., the military. Why? Primarily because a specific military service is characterized by a homogeneous organizational culture, i.e., norms are shared both hierarchically (up and down) and across the service. Frequent personnel transfers at all organizational levels remove people from one situation and reinsert them into another, similar situation. Many of the physical settings are similar – one ship of a certain type and class looks pretty much like another; military bases have a common set of facilities.

In contrast, commercial nuclear plants represent a somewhat different population. Many staff members work more or less permanently at a specific plant and the industry could not have come up with more unique physical plant configurations if it had tried. Perhaps it is not surprising that HRO research, including reviews of nuclear plants, has shown strong cultural homogeneity within individual organizations but lack of a shared culture across organizations.

At its best, the model can instill “processes of collective mindfulness” or “interpretive work directed at weak signals.” At its worst, if everyone sees things alike, an organization can “[drift] toward[s] inertia without consideration that things could be different.” (Weick 1999, quoted in Silbey, p.354) Because HRO is highly dependent on cultural homogeneity, it may be less conscious of growing problems if the organization starts to slowly go off the rails, a la the space shuttle Challenger.

We have seen efforts to implement this model at individual nuclear plants, usually by trying to get everything done “the Navy way.” We have even promoted this view when we talked back in the late 1990s about the benefits of industry consolidation and the best practices that were being implemented by Advanced Nuclear Enterprises (a term Bob coined in 1996). Today, we can see that this model provides a temporary, partial answer but can face challenges in the longer run if it does not constantly adjust to the dynamic nature of safety culture.

Stay tuned for Safety Culture: Cause or Context (part 2).

* Susan S. Silbey, "Taming Prometheus: Talk of Safety and Culture," Annual Review of Sociology, Volume 35, September 2009, pp. 341-369.

Friday, March 19, 2010

“We have a great safety culture = deep trouble” or what squirrels can teach us...

This podcast is excerpted from an interview of Dr. James Reason in association with VoiceMap (www.voicemap.net), a provider of live guidance applications to improve human performance. In this third segment, Dr. Reason discusses impediments to safety culture. He observes that when management announces that we have a great safety culture, it should be taken as a symptom of an organization that is vulnerable. The proper posture according to Dr. Reason is the “chronic unease” that he sees embodied in squirrels and other species that see constant vulnerability, even when there is no apparent immediate threat. The inverse of chronic unease is, of course, complacency. The “c” word has been invoked more frequently of late by the NRC (see our November 12, 2009 post) which could be viewed as threat enough.

Wednesday, March 10, 2010

"Normalization of a Deviation"

These are the words of John Carlin, Vice President at the Ginna Nuclear Plant, referring to a situation in the past where chronic water leakages from the reactor refueling pit were tolerated by the plant’s former owners. 

The quote is from a piece reported by Energy & Environment Publishing’s Peter Behr in its ClimateWire online publication titled, “Aging Reactors Put Nuclear Power Plant ‘Safety Culture’ in the Spotlight” and also published in The New York Times.  The focus is on a series of incidents with safety culture implications that have occurred at the Nine Mile Point and Ginna plants now owned and operated by Constellation Energy.

The recitation of events and the responses of managers and regulators are very familiar.  The drip, drip, drip is not the sound of water leaking but the uninspired give and take of the safety culture dialogue that occurs each time there is an incident or series of incidents that suggest safety culture is not working as it should.

Managers admit they need to adopt a questioning attitude and improve the rigor of decision making; ensure they have the right “mindset”; and corporate promises “a campaign to make sure its employees across the company buy into the need for an exacting attention to safety.”  Regulators remind the licensee, "The nuclear industry remains ... just one incident away from retrenchment..." but must be wondering why these events are occurring when NRC performance indicators for the plants and INPO rankings do not indicate problems.  Pledges to improve safety culture are put forth earnestly and (I believe) in good faith.

The drip, drip, drip of safety culture failures may not be cause for outright alarm or questioning of the fundamental safety of nuclear operations, but it does highlight what seems to be a condition of safety culture stasis - a standoff of sorts where significant progress has been made but problems continue to arise, and the same palliatives are applied.  Perhaps more significantly, where continued evolution of thinking regarding safety culture has plateaued.  Peaking too early is a problem in politics and sports, and so it appears in nuclear safety culture.

This is why the remark by John Carlin was so refreshing.  For those not familiar with the context of his words, “normalization of deviation” is a concept developed by Diane Vaughan in her exceptional study of the space shuttle Challenger accident.  Readers of this blog will recall that we are fans her book, The Challenger Launch Decision, where a mechanism she identifies as “normalization of deviance” is used to explain the gradual acceptance of performance results that are outside normal acceptance criteria.  Most scary, an organization's standards can decay and no one even notices.  How this occurs and what can be done about it are concepts that should be central to current considerations of safety culture. 

For further thoughts from our blog on this subject, refer to our posts dated October 6, 2009 and November 12, 2009.  In the latter, we discuss the nature of complacency and its insidious impact on the very process that is designed to avoid it in the first place.