Showing posts with label Taleb. Show all posts
Showing posts with label Taleb. Show all posts

Monday, December 14, 2020

Implications of Randomness: Lessons from Nassim Taleb

Most of us know Nassim Nicholas Taleb from his bestseller The Black Swan. However, he wrote an earlier book, Fooled by Randomness*, in which he laid out one of his seminal propositions: a lot of things in life that we believe have identifiable, deterministic causes such as prescient decision making or exceptional skills, are actually the result of more random processes. Taleb focuses on financial markets but we believe his observations can refine our thinking about organizational decision making, mental models, and culture.

We'll begin with an example of how Taleb believes we misperceive reality. Consider a group of stockbrokers with successful 5-year track records. Most of us will assume they must be unusually skilled. However, we fail to consider how many other people started out as stockbrokers 5 years ago and fell by the wayside because of poor performance. Even if all the stockbrokers were less skilled than a simple coin flipper, some would still be successful over a 5 year period. The survivors are the result of an essentially random process and their track records mean very little going forward.

Taleb ascribes our failure to correctly see things (our inadequate mental models) to several biases. First is the hindsight bias where the past is always seen as deterministic and feeds our willingness to backfit theories or models to experience after it occurs. Causality can be very complex but we prefer to simplify it. Second, because of survivorship bias, we see and consider only the current survivors from an initial cohort; the losers do not show up in our assessment of the probability of success going forward. Our attribution bias tells us that successes are due to skills, and failures to randomness.

Taleb describes other factors that prevent us from being the rational thinkers postulated by classical economics or Cartesian philosophy. One set of factors arises from how are brains are hardwired and another set from the way we incorrectly process data presented to us.

The brain wiring issues include the work of Daniel Kahneman who describes how we use and rely on heuristics (mental shortcuts that we invoke automatically) to make day-to-day decisions. Thus, we make many decisions without really thinking or applying reason, and we are subject to other built-in biases, including our overconfidence in small samples and the role of emotions in driving our decisions. We reviewed Kahneman's work at length in our Dec. 18, 2013 post. Taleb notes that we also have a hard time recognizing and dealing with risk. Risk detection and risk avoidance are mediated in the emotional part of the brain, not the thinking part, so rational thinking has little to do with risk avoidance.

We also make errors when handling data in a more formal setting. For example, we ignore the mathematical truth that initial sample sizes matter greatly, much more than the sample size as a percentage of the overall population. We also ignore regression to the mean, which says that absent systemic changes, performance will eventually return to its average value. More perniciously, ignorant or unethical researchers will direct their computers to look for any significant relationship in a data set, a practice that can often produce a spurious relationship because all the individual tests have their own error rates. “Data snoops” will define some rule, then go looking for data that supports it. Why are researchers inclined to fudge their analyses? Because research with no significant result does not get published.

Our Perspective

We'll start with the obvious: Taleb has a large ego and is not shy about calling out people with whom he disagrees or does not respect. That said, his observations have useful implications for how we conceptualize the socio-technical systems in which we operate, i.e., our mental models, and present specific challenges for the culture of our organizations.

In our view, the three driving functions for any system's performance over time are determinism (cause and effect), choice (decision making), and probability. At heart, Taleb's world view is that the world functions more probabilistically than most people realize. A method he employs to illustrate alternative futures is Monte Carlo simulation, which we used to forecast nuclear power plant performance back in the 1990s. We wanted plant operators to see that certain low-probability events, i.e., Black Swans**, could occur in spite of the best efforts to eliminate them via plant design, improved equipment and procedures, and other means. Some unfortunate outcomes could occur because they were baked into the system from the get-go and eventually manifested. This is what Charles Perrow meant by “normal accidents” where normal system performance excursions go beyond system boundaries. For more on Perrow, see our Aug. 29,2013 post.

Of course, the probability distribution of system performance may not be stationary over time. In the most extreme case, when all system attributes change, it's called regime change. In addition, system performance may be nonlinear, where small inputs may lead to a disproportionate response, or poor performance can build slowly and suddenly cascade into failure. For some systems, no matter how specifically they are described, there will inherently be some possibility of errors, e.g., consider healthcare tests and diagnoses where both false positives and false negatives can be non-trivial occurrences.

What does this mean for organizational culture? For starters, the organization must acknowledge that many of its members are inherently somewhat irrational. It can try to force greater rationality on its members through policies, procedures, and practices, instilled by training and enforced by supervision, but there will always be leaks. A better approach would be to develop defense in depth designs, error-tolerant sub-systems with error correction capabilities, and a “just culture” that recognizes that honest mistakes will occur.

Bottom line: You should think awhile about how many aspects of your work environment have probabilistic attributes.

 

* N.N. Taleb, Fooled by Randomness, 2nd ed. (New York: Random House) 2004.

** Black swans are not always bad. For example, an actor can have one breakthrough role that leads to fame and fortune; far more actors will always be waiting tables and parking cars.

Tuesday, June 7, 2016

The Criminalization of Safety (Part 3)


Our Perspective

The facts and circumstances of the events described in Table 1 in Part 1 point to a common driver - the collision of business and safety priorities, with safety being compromised.  Culture is inferred as the “cause” in several of the events but with little amplification or specifics.[1]  The compromises in some cases were intentional, others a product of a more complex rationalization.  The events have been accompanied by increased criminal prosecutions with varied success. 

We think it is fair to say that so far, criminalization of safety performance does not appear to be an effective remedy.  Statutory limitations and proof issues are significant limitations with no easy solution. The reality is that criminalization is at its core a “disincentive”.  To be effective it would have to deter actions or decisions that are not consistent with safety but not create a minefield of culpability.  It is also a blunt instrument requiring rather egregious behavior to rise to the level of criminality.  Its best use is probably as an ultimate boundary, to deter intentional misconduct but not be an unintended trap for bad judgment or inadequate performance.  In another vein, criminalization would also seem incompatible with the concept of a “just culture” other than for situations involving intentional misconduct or gross negligence.

Whether effective or not, criminalization reflects the urgency felt by government authorities to constrain excessive risk taking, intentional or not, and enhance oversight.  It is increasingly clear that current regulatory approaches are missing the mark.  All of the events catalogued in Table 1 occurred in industries that are subject to detailed safety and environmental regulation.  After the fact assessments highlight missed opportunities for more assertive regulatory intervention, and in the Flint cases there are actual criminal charges being applied to regulators.  The Fukushima event precipitated a complete overhaul of the nuclear regulatory structure in Japan, still a work in progress.  Post hoc punishments, no matter how severe, are not a substitute.

Nuclear Regulation Initiatives

Looking specifically at nuclear regulation in the U.S. we believe several specific reforms should be considered. It is always difficult to reform without the impetus of a major safety event, but we could see these actions as ones that could appear obvious in a post-event assessment if there was ever an “O-ring” moment in the nuclear industry.[2]

1. The NRC should include the safety management system in its regulatory activities.

The NRC has effectively constructed a cordon sanitaire around safety management by decreeing that “management” is beyond the scope of regulation.  The NRC relies on the fact that licensees bear the primary responsibility for safety and the NRC should not intrude into that role.  If one contemplates the trend of recent events scrutinizing the performance of regulators following safety events, this legalistic “defense” may not fare well in a situation where more intrusive regulation could have made the difference.

The NRC does monitor “safety culture” and often requires licensees to address weaknesses in culture following performance issues.  In essence safety culture has become an anodyne for avoiding direct confrontation of safety management issues.  Cynically one could say it is the ultimate conspiracy - where regulators and “stakeholders” come together to accept something that is non-contentious and conveniently abstract to prevent a necessary but unwanted (apparently by both sides) intrusion into safety management.

As readers of this blog know, our unyielding focus has been on the role of the complex socio-technical system that functions within a nuclear organization to operate nuclear plants effectively and safely.  This management system includes many drivers, variables, feedbacks, culture, and time delays in its processes, not all of which are explicit or linear.  The outputs of the system are the actions and decisions that ultimately produce tangible outcomes for production and safety.  Thus it is a safety system and a legitimate and necessary area for regulation.

NRC review of safety management need not focus on traditional management issues which would remain the province of the licensee.  So organizational structure, personnel decisions, etc. need not be considered.[3]  But here we should heed the view of Daniel Kahneman where he suggests we think of organizations as “factories for producing decisions” and therefore, think of decisions as a product.  (See our Nov. 4,2011 post, A Factory for Producing Decisions.)  Decisions are in fact the key product of the safety management system.  Regulatory focus on how the management system functions and the decisions it produces could be an effective and proactive approach.

We suggest two areas of the management system that could be addressed as a first priority: (1) Increased transparency of how the management system produces specific safety decisions including the capture of objective data on each such decision, and (2) review of management compensation plans to minimize the potential for incentives to promote excessive risk taking in operations.

2. The NRC should require greater transparency in licensee management decisions with potential safety impacts.

Managing nuclear operations involves a continuum of decisions balancing a variety of factors including production and safety.  These decisions may occur with individuals or with larger groups in meetings or other forums.  Some may involve multiple reviews and concurrences.  But in general the details of decision making, i.e., how the sausage is made, are rarely captured in detail during the process or preserved for later assessment.[4]  Typically only decisions that happen to yield a bad outcome (e.g., prompt the issuance of an LER or similar) become subject to more intensive review and post mortem.  Or actions that require specific, advance regulatory approval and require an SER or equivalent.[5]  

Transparency is key.  Some say the true test of ethics is what people do when no one is looking.  Well the converse of that may also be true - do people behave better when they know oversight is or could be occurring?  We think a lot of the NRC’s regulatory scheme is already built on this premise, relying as it does on auditing licensee activities and work products.

Thinking back to the Davis Besse example, the criminal prosecutions of both the corporate entity and individuals were limited to providing false or incomplete information to the NRC.  There was no attempt to charge on the basis of the actual decisions to propose, advocate for, and attempt to justify, that the plant could continue to operate beyond the NRC’s specified date for corrective actions.  The case made by First Energy was questionable as presented to the NRC and simply unjustified when accounting for the real facts behind their vessel head inspections.

Transparency would be served by documenting and preserving the decision process on safety significant issues.  These data might include the safety significance and applicable criteria, the potential impact on business performance (plant output, cost, schedule, etc), alternatives considered, and the participants and their inputs to the decision making process, and how a final decision was reached.   These are the specifics that are so hard or impossible to reproduce after the fact.[6]  The not unexpected result: blaming someone or something but not gaining insight into how the management system failed.

This approach would provide an opportunity for the NRC to audit decisions on a routine basis.  Licensee self assessment would also be served through safety committee review and other oversight including INPO.  Knowing that decisions will be subject to such scrutiny also can promote careful balancing of factors in safety decisions and serve to articulate how those balances are achieved and safety is served.  Having such tangible information shared throughout the organization could be the strongest way to reinforce the desired safety culture.

3. As part of its regulation of the safety management system, the NRC should restrict incentive compensation for nuclear management that is based on meeting business goals.

We started this series of posts focusing on criminalization of safety.  One of the arguments for more aggressive criminalization is essentially to offset the powerful pull of business-based incentives with the fear of criminal sanctions.  This has proved to elusive.  Similarly attempting to balance business incentives with safety incentives also is problematic.  The Transocean experience illustrates that quite vividly.[7]

Our survey several years ago of nuclear executive compensation indicated (1) the amounts of compensation are very significant for the top nuclear executives, (2) the compensation is heavily dependent on each years performance, and (3) business performance measured by EPS is the key to compensation, safety performance is a minor contributor.  A corollary to the third point might be that in no cases that we could identify was safety performance a condition precedent or qualification for earning the business-based incentives. (See our July 9, 2010 post, Nuclear Management Compensation (Part 2)).  With 60-70% of total compensation at risk, executives can see their compensation, and that of the entire management team, impacted by as much as several million dollars in a year.  Can this type of compensation structure impact safety?  Intuition says it creates both risk and a perception problems.  Virtually every significant safety event in Table 1 has reference to the undue influence of production priorities on safety.  The issue was directly raised in at least one nuclear organization[8] which revised its compensation system to avoid undermining safety culture. 

We believe a more effective approach is to minimize the business pressures in the first place.  We believe there is a need for a regulatory policy that discourages or prohibits licensee organizations from utilizing significant incentives based on financial performance.  Such incentives invariably target production and budget goals as they are fundamental to business success.  To the extent safety goals are included they are a small factor or based on metrics that do not reflect fundamental safety.  Assuring safety is the highest priority is not subject to easily quantifiable and measurable metrics - it is judgmental and implicit in many actions and decisions taken on a day-to-day basis at all levels of the organization.  Organizations should pay nuclear management competitively and generously and make informed judgments about their overall performance.

Others have recognized the problem and taken similar steps to address it.  For example, in the aftermath of the financial crisis of 2008 the Federal Reserve Board has been doing some arm twisting with U.S. financial services companies to adjust their executive compensation plans - and those plans are in fact being modified to cap bonuses associated with achieving performance goals. (See our April 25, 2013 post, Inhibiting Excessive Risk Taking by Executives.)

Nick Taleb (of Black Swan fame) believes that bonuses provide an incentive to take risks.  He states, “The asymmetric nature of the bonus (an incentive for success without a corresponding disincentive for failure) causes hidden risks to accumulate in the financial system and become a catalyst for disaster.”  Now just substitute “nuclear operations” for “the financial system”.

Central to Talebs thesis is his belief that management has a large informational advantage over outside regulators and will always know more about risks being taken within their operation. (See our Nov. 9, 2011 post, Ultimate Bonuses.)  Eliminating the force of incentives and providing greater transparency to safety management decisions could reduce risk and improve everybody’s insight into those risks deemed acceptable.

Conclusion

In industries outside the commercial nuclear space, criminal charges have been brought for bad outcomes that resulted, at least in part, from decisions that did not appropriately consider overall system safety (or, in the worst cases, simply ignored it.)  Our suggestions are intended to reduce the probability of such events occurring in the nuclear industry.





[1] It raises the question whether anytime business priorities trump safety it is a case of deficient culture.  We have argued in other blog posts that sufficiently high business or political pressure can compromise even a very strong safety culture.  So reflexive resort to safety culture may be easy but not be very helpful.
[2] Credit to Adam Steltzner author of The Right Kind of Crazy recounting his and other engineers’ roles in the design of the Mars rovers.  His reference is to the failure of O-ring seals on the space shuttle Challenger.
[3] We do recognize that there are regulatory criteria for general organizational matters such as for the training and qualification of personnel. 
[4] In essence this creates a “safe harbor” for most safety judgments and to which the NRC is effectively blind.
[5] In Davis Besse much of the “proof” that was relied on in the prosecutions of individuals was based on concurrence chains for key documents and NRC staff recollections of what was said in meetings.  There was no contemporaneous documentation of how First Energy made its threshold decision that postponing the outage was acceptable, who participated, and who made the ultimate decision.  Much was made of the fact that management was putting great pressure on maintaining schedule but there was no way to establish how that might have directly affected decision making.
[6] Kahneman believes there is “hindsight bias”.  Hindsight is 20/20 and it supposedly shows what decision makers could (and should) have known and done instead of their actual decisions that led to an unfavorable outcome, incident, accident or worse.  We now know that when the past was the present, things may not have been so clear-cut.  See our Dec.18, 2013 post, Thinking, Fast and Slow by Daniel Kahneman.
[7] Transocean, owner of the Deepwater Horizon oil rig, awarded millions of dollars in bonuses to its executives after “the best year in safety performance in our companys history,” according to an annual report…’Notwithstanding the tragic loss of life in the Gulf of Mexico, we achieved an exemplary statistical safety record as measured by our total recordable incident rate and total potential severity rate.’”  See our April 7, 2011 post for the original citation in Transocean's annual report and further discussion.
[8] “The reward and recognition system is perceived to be heavily weighted toward production over safety”.  The reward system was revised "to ensure consistent health of NSC”.  See our July 29, 2010 post, NRC Decision on FPL (Part 2).

Monday, October 14, 2013

High Reliability Management by Roe and Schulman

This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market.  CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world.  In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.

The High Reliability Management (HRM) Model

The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3)  System Volatility refers to the magnitude and rate of change of  CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall).  Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs. 

System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each.  All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid.  Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes.  Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)

High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance.  Some strategies can be substituted for others.  It is a dynamic but manageable environment.

High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance.  They run from pillar to post; it is highly stressful.  Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error.  Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills.  It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)

Low Volatility and Low Options mean generation and demand are not changing quickly.  The critical feature here is demand has been reduced by load shedding.  The operators have exhausted all other strategies for maintaining balance.  It is a command-and-control approach, effected by declaring a  Stage 3 grid situation and run using formal rules and procedures.  It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished. 

Low Volatility and High Options is an HRM's preferred mode.  Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available.  Procedures based on analyzed conditions exist and are used.  There are few, if any, surprises.  Learning can occur but it is incremental, the result of new methods or analysis.  Performance is important and system behavior operates within a narrow bandwidth.  Loss of attention (complacency) is a risk.  Is this starting to sound familiar?  This is the domain of High Reliability Organization (HRO) theory and practice.  Nuclear power operations is an example of an HRO. (pp. 60-62)          

Lessons for Nuclear Operations 


Nuclear plants work hard to stay in the Low Volatility/High Options mode.  If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62)  In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses.  Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.**  Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.   

In an HRO, trial and error is not an acceptable method for trying out new options.  No one wants cowboy operators in the control room.  But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233)  In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)  

The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear.  The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system.  In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance.  In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126) 

The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)

Implications for Safety Culture

Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2)  Both organizational types need a focus on operations as the central activity.  Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.

The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants.  Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220) 

The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC.  We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?” 

Conclusion

This review gives short shrift to the authors' broad and deep description and analysis of CAISO.****  The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.

The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management.  But it's a good read and full of insightful observations, e.g., the description of  CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)

(High Reliability Management was recommended by a Safetymatters reader.  If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)

*  E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008)  This book reports the authors' study of CAISO from 2001 through 2006. 

**  By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment.  Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.

The importance of considering low-probability, major consequence events is argued by Taleb (see here) and Dédale (see here).

***  Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits. 

The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone.  For more on Cynefin, see here and here.

****  For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO.  As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)

Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits.  This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.

Tuesday, September 24, 2013

Safety Paradigm Shift

We came across a provocative and persuasive presentation by Jean Pariès Dédale, "Why a Paradigm Shift Is Needed" from the IAEA Experts Meeting in May of this year.*  Many of the points resonate with our views on nuclear safety management; in particular complexity, the fallacy of the "predetermination envelope"- making a system more reliable within its design envelope but more susceptible outside that envelope; deterministic and probabilistic rationalization that avoids dealing with complexity of the system; and unknown-unknowns.  We also believe it will take a paradigm shift, however unlikely it may be at least in the U.S. nuclear industry.  Interestingly, Dédale does not appear to have a nuclear power background and develops his paradigm argument across multiple events and industries.

Dédale poses a very fundamental question: since the current safety construct has shown vulnerabilities to actual off-normal events should the response be, do more of the same but better and with more rigor? Or should the safety paradigm itself be challenged?  The key issue underlying the challenge to this construct is how to cope with complexity.  He means complexity in the same terms we have posted about numerous times.

Dédale notes “The uncertainty generated by the complexity of the system itself and by its environment is skirted through deterministic or probabilistic rationality.” (p. 8)  We agree.  Any review of condition reports and Tech Spec variances indicates a wholesale reliance on risk based rationale for deviations from nominal requirements.  And the risk based argument is almost always based on an estimated small probability of an event that would challenge safety, often enhanced by a relatively short exposure time frame.  As we highlighted in a prior post, Nick Taleb has long cautioned against making decisions based on assessments of probabilities, which he asserts we cannot know, versus consequences which are (sometimes uncomfortably) knowable.

How does this relate to safety management issues including culture?

We see a parallel between constructs for nuclear safety and safety culture.  The nuclear safety construct is constrained both in focus and evolution, heavily reliant on the design basis philosophy (what Dédale labels “predetermination fallacy”) dating back to the 1960s.  Little has changed over the succeeding 50 years; even the advent of PRA has been limited to “informing” the implementation of this approach.  Safety culture has emerged over the last 10+ years as an added regulatory emphasis though highly constrained in its manifestation as a policy statement.  (It is in fact still quite difficult to square the NRC’s characterization of safety culture as critical to safety** yet stopping way short of any regulation or requirements.)  The definitional scope of safety culture is expressed in a set of traits and related values and behaviors.  As with nuclear safety it has a limited scope and relies on abstractions emphasizing, in essence, individual morality.  It does not look beyond people to the larger environment and “system” within which people function.  This environment can bring to bear significant influences that can challenge the desired traits and values of safety culture policy and muddle their application to decisions and actions.  The limitations can be seen in the assessments of safety culture (surveys and similar) as well as the investigation of specific events, violations or non-conformances by licensees and the NRC.  We’ve read many of these and rarely have we encountered any probing of the “why” associated with perceived breakdowns in safety culture.

One exception and a very powerful case in point is contained in our post dated July 29, 2010.  The cited reference is an internal root cause analysis performed by FPL to address employee concerns and identified weaknesses in their corrective action program.  They cite production pressures as negatively impacting employee trust and recognition, and perceptions of management and operational decisions.  FPL took steps to change the origin and impact of production pressures, relieving some of the burden on the organization to contain those influences within the boundaries of safe operation.

Perhaps the NRC believes that it does not have the jurisdiction to probe these types of issues or even require licensees to assess their influence.  Yet the NRC routinely refers to “licensee burden” - cost, schedule, production impacts - in accepting deviations from nominal safety standards.****  We wonder if a broader view of safety culture in the context of the socio-technical system might better “inform” both regulatory policy and decisions and enhance safety management.


*  J.P. Dédale, "Why a Paradigm Shift Is Needed," IAEA International Experts’ Meeting on Human and Organizational Factors in Nuclear Safety in the Light of the Accident at the Fukushima Daiichi Nuclear Power Plant, Vienna May 21-24, 2013.


**  The NRC’s Information Notice 2013-15 states that safety culture is “essential to nuclear safety in all phases…”
 

***  "NRC Decision on FPL (Part 2)," Safetymatters (July 29, 2010).  See slide 18, Root Cause 2 and Contributing Causes 2.2 and 2.4. 

****  10 CFR 50.55a(g)(6)(i) states that the Commission may grant such relief and may impose such alternative requirements as it determines is authorized by law and will not endanger life or property or the common defense and security and is otherwise in the public interest, given the consideration of the burden upon the licensee (emphasis added).

Tuesday, June 18, 2013

The Incredible Shrinking Nuclear Industry

News last week that the San Onofre units would permanently shutdown - joining Crystal River 3 (CR3) and Kewaunee as the latest early retirees - and filling in the last leg of a nuclear bad news trifecta.  This is distressing on many fronts, not the least of which is the loss of jobs for thousands of highly qualified nuclear personnel, and perhaps the suggestion of a larger trend.  Almost as distressing is the characterization by NEI that San Onofre is a unique situation - as were CR3 and Kewaunee by the way - and placing primary blame on the NRC.*  Really?  The more useful question to ponder is what decisions led up to the need for plant closures and whether there is a common denominator? 

We can think of one: decisions that failed to adequately account for the “tail” of the risk distribution where outcomes, albeit of low probability, carry high consequences.  On this score checking in with Nick Taleb is always instructive.  He observes “This idea that in order to make a decision you need to focus on the consequences (which you can know) rather than the probability (which you can’t know) is the central idea of uncertainty.”**
  • For Kewaunee the decision to purchase the plant with a power purchase agreement (PPA) that extended only for eight years;
  • For CR3, the decision to undertake cutting the containment with in-house expertise;
  • For SONGs the decision to purchase and install new design steam generators from a vendor working beyond its historical experience envelope.
Whether the decision makers understood this, or even imagined that their decisions included the potential to lose the plants, the results speak for themselves.  These people were in Black Swan and fat tail territory and didn’t realize it.  Let’s look at a few details.

Kewaunee

Many commentators at this point are writing off the Kewaunee retirement based on the miracle of low gas prices.  Dominion cites gas prices and the inability to acquire additional nuclear units in the upper Midwest to achieve economies of scale.  But there is a far greater misstep in the story.  When Dominion purchased Kewaunee from Wisconsin Public Service in 2005, a PPA was included as part of the transaction.  This is an expected and necessary part of the transaction as it established set prices for the sale of the plant’s output for a period of time.  A key consideration in structuring deals such as this is not only the specific pricing terms for the asset and the PPA, but the duration of the PPA.  In the case of Kewaunee the PPA ran for only 8 years, through December 2013.  After 8 years Dominion would have to negotiate another PPA with the local utilities or others or sell into the market.  The question is - when buying an asset with a useful life of 28 years (with grant of the 20 year license extension), why would Dominion be OK with just an 8 year PPA?  Perhaps Dominion assumed that market prices would be higher in 8 years and wanted to capitalize on those higher prices.  Opponents to the transaction believed this to be the case.***  The prevailing expectation at the time was that demand would continue along with appropriate pricing necessary to accommodate current and planned generating units.  But the economic downturn capped demand and left a surplus of baseload.  Local utilities faced with the option of negotiating a PPA for Kewaunee - or thinning the field and protecting their own assets - did what was in their interest. 

The reality is that Dominion rolled the dice on future power prices.  Interestingly, in the same time frame, 2007, the Point Beach units were purchased by NextEra Energy Resources (formerly FPL Energy).  In this transaction PPAs were negotiated through the end of the extended license terms of the units, 2030 and 2033, providing the basis for a continuing and productive future.

Crystal River 3

In 2009 Progress Energy undertook a project to replace the steam generators in CR3.  As with some other nuclear plants this necessitated cutting into the containment to allow removal of the old generators and placement of the new. 

Apparently just two companies, Bechtel and SGT, had managed all the previous 34 steam generator replacement projects at U.S. nuclear power plants. Of those, at least 13 had involved cutting into the containment building. All 34 projects were successful.

For the management portion of the job, Progress got bids from both Bechtel and SGT. The lowest was from SGT but Progress opted to self-manage the project to save an estimated $15 million.  During the containment cutting process delamination of concrete occurred in several places.  Subsequently an outside engineering firm hired to do the failure analysis stated that cutting the steel tensioning bands in the sequence done by Progress Energy along with removing of the concrete had caused the containment building to crack.  Progress Energy disagreed stating the cracks “could not have been predicted”.  (See Taleb’s view on uncertainty above.)

“Last year, the PSC endorsed a settlement agreement that let Progress Energy refund $288 million to customers in exchange for ending a public investigation of how the utility broke the nuclear plant.”****

When it came time to assess how to fix the damage, Progress Energy took a far more conservative and comprehensive approach.  They engaged multiple outside consultants and evaluated numerous possible repair options.  After Duke Energy acquired Progress, Duke engaged an independent, third-party review of the engineering and construction plan developed by Progress.  The independent review suggested that the cost was likely to be almost $1.5 billion. However, in the worst-case scenario, it could cost almost $3.5 billion and take eight years to complete.   “...the [independent consultant] report concluded that the current repair plan ‘appears to be technically feasible, but significant risks and technical issues still need to be resolved, including the ultimate scope of any repair work.’"*****  Ultimately consideration of the potentially huge cost and schedule consequences caused Duke to pull the plug.  Taleb would approve.

San Onofre

Southern California Edison undertook a project to replace its steam generators almost 10 years ago.  It decided to contract with Mitsubishi Heavy Industries (MHI) to design and construct the generators.  This would be new territory for Mitsubishi in terms of the size of the generators and design complexity.  Following installation and operation for a period of time, tube leakage occurred due to excessive vibrations.  The NRC determined that the problems in the steam generators were associated with errors in MHI's computer modeling, which led to underestimation of thermal hydraulic conditions in the generators.

“Success in developing a new and larger steam generator design requires a full understanding of the risks inherent in this process and putting in place measures to manage these risks….Based upon these observations, I am concerned that there is the potential that design flaws could be inadvertently introduced into the steam generator design that will lead to unacceptable consequences (e.g., tube wear and eventually tube plugging). This would be a disastrous outcome for both of us and a result each of our companies desire to avoid. In evaluating this concern, it would appear that one way to avoid this outcome is to ensure that relevant experience in designing larger sized steam generators be utilized. It is my understanding the Mitsubishi Heavy Industries is considering the use of Westinghouse in several areas related to scaling up of your current steam generator design (as noted above). I applaud your effort in this regard and endorse your attempt to draw upon the expertise of other individuals and company's to improve the likelihood of a successful outcome for this project.”#

Unfortunately these concerns raised by SCE came after letting the contract to Mitsubishi.  SCE placed (all of) its hopes on improving the likelihood of a successful outcome at the same time stating that a design flaw would be “disastrous”.  They were right about the disaster part.

Take Away

These are cautionary tales on a significant scale.  Delving into how such high risk (technical and financial) decisions were made and turned out so badly could provide useful lessons learned.  That doesn’t appear likely given the interests of the parties and being inconsistent with the industry predicate of operational excellence.

With regard to our subject of interest, safety culture, the dynamics of safety decisions are subject to similar issues and bear directly on safety outcomes.  Recall that in our recent posts on implementing safety culture policy, we proposed a scoring system for decisions that includes the safety significance and uncertainty associated with the issue under consideration.  The analog to Taleb’s “central idea of uncertainty” is intentional and necessary.  Taleb argues you can’t know the probability of consequences.  We don’t disagree but as a “known unknown” we think it is useful for decision makers to recognize how uncertain the significance (consequences) may be and calibrate their decision accordingly.


*  “Of course, it’s regrettable...Crystal River is closing, the reasons are easy to grasp, and they are unique to the plant. Even San Onofre, which has also been closed for technical reasons (steam generator problems there), is quite different in specifics and probable outcome. So – unfortunate, yes; a dire pox upon the industry, not so much.”  NEI Nuclear Notes (Feb. 7, 2013).  Retrieved June 17, 2013.  For the NEI/SCE perspective on regulatory foot-dragging and uncertainty, see W. Freebairn et al, "SoCal Ed to retire San Onofre nuclear units, blames NRC delays," Platts (June 7, 2013).  Retrieved June 17, 2013.  And "NEI's Peterson discusses politics surrounding NRC confirmation, San Onofre closure," Environment & Energy Publishing OnPoint (June 17, 2013).  Retrieved June 17, 2013.

**  N. Taleb, The Black Swan (New York: Random House, 2007), p. 211.  See also our post on Taleb dated Nov. 9, 2011.

***  The Customers First coalition that opposed the sale of the plant in 2004 argued: “Until 2013, a complex purchased-power agreement subject to federal jurisdiction will replace PSCW review. After 2013, the plant’s output will be sold at prices that are likely to substantially exceed cost.”  Customers First!, "Statement of Position: Proposed Sale of the Kewaunee Nuclear Power Plant April 2004" (April, 2004).  Retrieved June 17, 2013.

****  R. Trigaux, "Who's to blame for the early demise of Crystal River nuclear power plant?" Tampa Bay Times (Feb. 5, 2013).  Retrieved Jun 17, 2013.  We posted on CR3's blunder and unfolding financial mess on Nov. 11, 2011.

*****  "Costly estimates for Crystal River repairs," World Nuclear News (Oct. 2, 2012).  Retrieved June 17, 2013.

#  D.E. Nunn (SCE) to A. Sawa (Mitsubishi), "Replacement Steam Generators San Onofre Nuclear Generating Station, Units 2 & 3" (Nov. 30, 2004).  Copy retrieved June 17, 2013 from U.S. Senate Committee on Environment & Public Works, attachment to Sen. Boxer's May 28, 2013 press release.