Monday, April 27, 2015

INPO’s View on Fukushima Safety Culture Lessons Learned

In November 2011 the Institute of Nuclear Power Operations (INPO) published a special report* on the March 2011 Fukushima accident.  The report provided an overview and timeline for the accident, focusing on the evolution of the situation during the first several days after the earthquake and tsunami.  Safety culture (SC) was not mentioned in the report.

In August 2012 INPO issued an addendum** to the report covering Fukushima lessons learned in eight areas, including SC.  Each area contains a lengthy discussion of relevant plant activities and experiences, followed by specific lessons learned.  According to INPO, some lessons learned may be new or different from those published elsewhere.  Several caught our attention as we paged through the addendum: Invest resources to assess low-probability, high-consequence events (Black Swans).  Beef up available plant staffing to support regular staff in case a severe, long duration event inconveniently occurs on a weekend.  Evaluate the robustness of off-site event management facilities (TEPCO’s was inaccessible, lost power and did not have filtered ventilation).  Be aware that assigning most decision making authority to the control room crew (as TEPCO did) meant other plant groups could not challenge or check ops’ decisions—efficiency at the cost of thoroughness.  Conduct additional training for a high-dose environment when normal dosage limits are replaced with emergency ones.  Ensure that key personnel have in-depth reactor and power plant knowledge to respond effectively if situations evolve beyond established procedures and flexibility is required.

Focusing on SC, the introduction to this section is clear and unexpectedly strong: “History has shown that accidents and their precursors at commercial nuclear electric generating stations result from a series of decisions and actions that reflect flaws in the shared assumptions, values, and beliefs of the operating organization.” (p. 33)

The SC lessons learned are helpful.  INPO observed that while TEPCO had taken several steps over the years to strengthen its SC, it missed big picture issues including cultivating a questioning attitude, challenging assumptions, practicing safety-first decision making and promoting organizational learning.  In each of these areas, the report covers specific deficiencies or challenges faced at Fukushima followed by questions aimed at readers asking them to consider if similar conditions exist or could exist at their own facilities.

Our Perspective

The addendum has a significant scope limitation: it does not address public policy (e.g., regulatory or governmental) factors that contributed to the Fukushima accident and yielded their own lessons learned.***  However, given the specified scope, a quick read of the entire addendum suggests it’s reasonably thorough, the SC section certainly is.  The questions aimed at report readers are the kind we ask all the time on Safetymatters but we award INPO full marks for addressing these general, qualitative, open-ended subjects.  One question INPO raised that we have not specifically asked is “To what extent are the safety implications considered during enterprise business planning and budgeting?” (italics added)  Another, inferred from the report text, is “How do operators create complex, realistic scenarios (e.g., with insufficient information and/or personnel under stress) during emergency training?”  These are legitimate additions to the repertoire.  

The addendum is not perfect.  For example, INPO trots out the “special and unique” mantra when discussing the essential requirements to maintain core cooling capability and containment integrity (esp. with respect to venting at Fukushima).  This mantra, coupled with INPO’s usual penchant for secrecy, undermines public support for commercial nuclear power.  INPO can be a force for good when its work products, like this report and addendum, are publicly available.  It would be better for the industry if INPO were more transparent and if commercial nuclear power were characterized as a safety-intense industrial process run by ordinary, albeit highly trained, people.

Bottom line, you should read the addendum looking for bits that apply to your own situation.


*  INPO, “Special Report on the Nuclear Accident at the Fukushima Daiichi Nuclear Power Station,” INPO 11-005 Rev. 0 (Nov. 2011).

**  INPO, “Lessons Learned from the Nuclear Accident at the Fukushima Daiichi Nuclear Power Station,” INPO 11-005 Rev. 0 Addendum (Aug. 2012).  Thanks to Madalina Tronea for publicizing this document.  Dr. Tronea is the founder/moderator of the LinkedIn Nuclear Safety discussion group.

***  Regulatory, government and corporate governance lessons learned have been publicized by other Fukushima reviewers and the findings widely distributed, including on Safetymatters.  Click on the Fukushima label to see our related posts. 

Wednesday, April 22, 2015

More Evidence of Weak Safety Culture in DOE

DNFSB Headquarters
We have posted many times about safety culture (SC) issues in the Department of Energy (DOE) empire.  Many of those issues have been raised by the Defense Nuclear Facilities Safety Board (DNFSB), an overseer of DOE activities.  Following is a recent example based on a DNFSB staff report.*

The Radcalc Imbroglio

Radcalc is a computer program used across the DOE complex (and beyond) to determine the transportation package classification for radioactive materials, including radioactive waste, based on the isotopic content.  Radcalc errors could lead to serious consequences, e.g., exposure to radiation or explosions, in the event of a transportation accident.  DOE classified Radcalc as safety software and assigned it the second highest level of rigor in DOE’s software quality assurance (SQA) procedures.

A DNFSB audit found multiple deficiencies with respect to Radcalc, most prominently DOE’s inability to provide any evidence of federal oversight of Radcalc during the software's lifetime (which dates back to the mid-1990s).  In addition, there was no evidence DOE contractors had any Radcalc-related QA plans or programs, or maintained software configuration management.  Neither DOE nor the contractors effectively used their corrective action program to identify and correct software problems.  DNFSB identified other problems but you get the idea.

DNFSB Analysis

As part of its analysis of problems and causes, the DNFSB identified multiple contributing factors including the following related to organization.  “There is an apparent lack of a systematic, structured, and documented approach to determine which organization within DOE is responsible to perform QA audits of contractor organizations.  During the review, different organizations within DOE stated that they thought another organization was responsible for performing Radcalc contractor QA audits.  DOE procedures do not clearly delineate which organization is responsible for QA/SQA audits and assessments.” (Report, p. 4)

Later, the report says “In addition, this review identified potentially significant systemic [emphasis added] concerns that could affect other safety software. These are: inadequate QA/SQA requirement specification in DOE contracts and the lack of policy identifying the DOE organizations in charge of performing QA assessments to ensure compliance; unqualified and/or inadequate numbers of qualified federal personnel to oversee contract work; . . . and additional instances of inadequate oversight of computer work within DOE (e.g., Radtran).” (Report, p. 5)

Our Perspective

Even without the DNFSB pointing out “systemic” concerns, this report practically shouts the question “What kind of SC would let this happen?”  We are talking about a large group of organizations where a significant, safety-related activity failed to take place and the primary reason (excuse) is “Not my group’s job.”  And no one took on the task to determine whose job it was.  This underlying cultural attitude could be as significant as the highly publicized SC problems at individual DOE facilities, e.g., the Hanford Waste Treatment Plant or the Waste Isolation Pilot Plant.

The DNFSB asked DOE to respond to the report within 90 days.  What will such a report say?  Let’s go out on a limb here and predict the report will call for “improved procedures, training and oversight.”  The probability of anyone facing discipline over this lapse: zero.  The probability of DOE investigating its own and/or contractor cultures for a possible systemic weakness: also zero.  Why?  Because there’s no money in it for DOE or the contractors and the DNFSB doesn’t have the organizational or moral authority to force it to happen.

We’ve always championed the DNFSB as the good guys, trying to do the right thing with few resources.  But the sad reality is they are a largely invisible backroom bureaucracy.  When a refinery catches fire, the Chemical Safety Board is front and center explaining what happened and what they’ll recommend to keep it from happening again.  When was the last time you saw the DNFSB on the news or testifying before Congress?  Their former chairman retired suddenly late last year, with zero fanfare; we think it’s highly likely the SC initiative he championed and attempted to promulgate throughout DOE went out the door with him.


*  J.H. Roberson (DNFSB) to D.M. Klaus (DOE), letter (Mar. 16, 2015) with enclosed Staff Issue Report “Review of Federal Oversight of Software Quality Assurance for Radcalc” (Dec. 17, 2014).  Thanks to Bill Mullins for bringing this document to our attention.

Monday, April 13, 2015

Safety-I and Safety-II: The Past and Future of Safety Management by Erik Hollnagel

This book* discusses two different ways of conceptualizing safety performance problems (e.g., near-misses, incidents and accidents) and safety management in socio-technical systems.  This post describes each approach and provides our perspective on Hollnagel’s efforts.  As usual, our interest lies in the potential value new ways of thinking can offer to the nuclear industry.

Safety-I

This is the common way of looking at safety performance problems.  It is reactive, i.e., it waits for problems to arise** and analytic, e.g., it uses specific methods to work back from the problem to its root causes.  The key assumption is that something in the system has failed or malfunctioned and the purpose of an investigation is to identify the causes and correct them so the problem will not recur.  A second assumption is that chains of causes and effects are linear, i.e., it is actually feasible to start with a problem and work back to its causes.  A third assumption is that a single solution (the “first story”) can be found. (pp. 86, 175-76)***  Underlying biases include the hindsight bias (p. 176) and the belief that the human is usually the weak link. (pp. 78-79)  The focus of safety management is minimizing the number of things that go wrong.

Our treatment of Safety-I is brief because we have reported on criticism of linear thinking/models elsewhere, primarily in the work of Dekker, Woods et al, and Leveson.  See our posts of Dec. 5, 2012; July 6, 2013; and Nov. 11, 2013 for details.

Safety-II

Safety-II is proposed as a different way to look at safety performance.  It is proactive, i.e., it looks at the ways work is actually performed on a day-to-day basis and tries to identify causes of performance variability and then manage them.  A key cause of variability is the regular adjustments people make in performing their jobs in order to keep the system running.  In Hollnagel’s view, “Finding out what these [performance] adjustments are and trying to learn from them can be more important than finding the causes of infrequent adverse outcomes!” (p. 149)  The focus of safety management is on increasing the likelihood that things will go right and developing “the ability to succeed under varying conditions, . . .” (p. 137).

Performance is variable because, among other reasons, people are always making trade-offs between thoroughness and efficiency.  They may use heuristics or have to compensate for something that is missing or take some steps today to avoid future problems.  The underlying assumption of Safety-II is that the same behaviors that almost always lead to successful outcomes can occasionally lead to problems because of performance variability that goes beyond the boundary of the control space.  A second assumption is that chains of causes and effects may be non-linear, i.e., a small variance may lead to a large problem, and may have an emergent aspect where a specific performance variability may occur then disappear or the Swiss cheese holes may momentarily line up exposing the system to latent hazards. (pp. 66, 131-32)  There may be multiple explanations (“second stories”) for why a particular problem occurred.  Finally, Safety-II accepts that there are often differences between Work-as-Imagined (esp. as imagined by folks at the blunt end) and Work-as-Done (by people at the sharp end). (pp. 40-41)***

The Two Approaches

Safety-I and Safety-II are not in some winner-take-all competitive struggle.  Hollnagel notes there are plenty of problems for which a Safety-I investigation is appropriate and adequate. (pp. 141, 146)

Safety-I expenditures are viewed as a cost (to reduce errors). (p. 57)  In contrast, Safety-II expenditures are viewed as bona fide investments to create more correct outcomes. (p. 166)

In all cases, organizational factors, such as safety culture, can impact safety performance and organizational learning. (p. 31)

Our Perspective

The more complex a socio-technical entity is, the more it exhibits emergent properties and the more appropriate Safety-II thinking is.  And nuclear has some elements of complexity.****  In addition, Hollnagel notes that a common explanation for failures that occur in a System-I world is “it was never imagined something like that could happen.” (p. 172)  To avoid being the one in front of the cameras saying that, it might be helpful for you to spend a little time reflecting on how System-II thinking might apply in your world.

Why do most things go right?  Is it due to strict compliance with procedures?  Does personal creativity or insight contribute to successful plant performance?  Do you talk with your colleagues about possible efficiency-thoroughness trade-offs (short cuts) that you or others make?  Can thinking about why things go right make one more alert to situations where things are heading south?  Does more automation (intended to reduce reliance on fallible humans) actually move performance closer to the control boundary because it removes the human’s ability to make useful adjustments?  Has any of your root cause evaluations appeared to miss other plausible explanations for why a problem occurred?

Some of the Safety-II material is not new.  Performance variability in Safety-II builds on Hollnagel’s earlier work on the efficiency-thoroughness trade-off (ETTO) principle.  (See our Jan. 3, 2013 post.)   His call for mindfulness and constant alertness to problems is straight out of the High Reliability Organization playbook. (pp. 36, 163-64)  (See our May 3, 2013 post.)

A definite shortcoming is the lack of concrete examples in the Safety-II discussion.  If someone has tried to do this, it would be nice to hear about it.

Bottom line, Hollnagel has some interesting observations although his Safety-II model is probably not the Next Big Thing for nuclear safety management.

 

*  E. Hollnagel, Safety-I and Safety-II: The Past and Future of Safety Management  (Burlington, VT: Ashgate , 2014)

**  In the author’s view, forward-looking risk analysis is not proactive because it is infrequently performed. (p. 57) 

***  There are other assumptions in the Safety-I approach (see pp. 97-104) but for the sake of efficiency, they are omitted from this post.

****  Nuclear power plants have some aspects of a complex socio-technical system but other aspects are merely complicated.   On the operations side, activities are tightly coupled (one attribute of complexity) but most of the internal organizational workings are complicated.  The lack of sudden environmental disrupters (excepting natural disasters) means they have time to adapt to changes in their financial or regulatory environment, reducing complexity.