Monday, October 14, 2013

High Reliability Management by Roe and Schulman

This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market.  CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world.  In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.

The High Reliability Management (HRM) Model

The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3)  System Volatility refers to the magnitude and rate of change of  CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall).  Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs. 

System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each.  All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid.  Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes.  Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)

High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance.  Some strategies can be substituted for others.  It is a dynamic but manageable environment.

High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance.  They run from pillar to post; it is highly stressful.  Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error.  Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills.  It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)

Low Volatility and Low Options mean generation and demand are not changing quickly.  The critical feature here is demand has been reduced by load shedding.  The operators have exhausted all other strategies for maintaining balance.  It is a command-and-control approach, effected by declaring a  Stage 3 grid situation and run using formal rules and procedures.  It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished. 

Low Volatility and High Options is an HRM's preferred mode.  Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available.  Procedures based on analyzed conditions exist and are used.  There are few, if any, surprises.  Learning can occur but it is incremental, the result of new methods or analysis.  Performance is important and system behavior operates within a narrow bandwidth.  Loss of attention (complacency) is a risk.  Is this starting to sound familiar?  This is the domain of High Reliability Organization (HRO) theory and practice.  Nuclear power operations is an example of an HRO. (pp. 60-62)          

Lessons for Nuclear Operations 


Nuclear plants work hard to stay in the Low Volatility/High Options mode.  If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62)  In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses.  Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.**  Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.   

In an HRO, trial and error is not an acceptable method for trying out new options.  No one wants cowboy operators in the control room.  But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233)  In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)  

The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear.  The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system.  In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance.  In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126) 

The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)

Implications for Safety Culture

Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2)  Both organizational types need a focus on operations as the central activity.  Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.

The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants.  Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220) 

The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC.  We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?” 

Conclusion

This review gives short shrift to the authors' broad and deep description and analysis of CAISO.****  The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.

The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management.  But it's a good read and full of insightful observations, e.g., the description of  CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)

(High Reliability Management was recommended by a Safetymatters reader.  If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)

*  E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008)  This book reports the authors' study of CAISO from 2001 through 2006. 

**  By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment.  Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.

The importance of considering low-probability, major consequence events is argued by Taleb (see here) and D├ędale (see here).

***  Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits. 

The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone.  For more on Cynefin, see here and here.

****  For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO.  As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)

Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits.  This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.

1 comment:

  1. Great writeup Lew and I'm eagerly awaiting my copy so that I can delve into the details.

    So these comments are preliminary, more lines of inquiry I'll have than anything firm at the moment.

    Since my experience in the Navy, mode shifts have been central to my conception of how uncertainty and complexity compound during employment of the highly engineered artifacts. Over the course of a 12-15 month deployment cycle, the crew of a submarine will likely encounter upwards of 20 distinct modes. A few of them relate to "all hands on deck" response to high uncertainty or significantly off-normal circumstances - most represent a pre-tailored response pattern that brings extra functional specialists and generalist oversight to bear such as during the Maneuvering Watch when coming in and out of port.

    Over the operations cycle, there is a recognizable meta-pattern of stepwise approach to peak fitness, as a crew moves its proficiency from operational effectiveness in a repair and refit situation each stage includes drills, classroom study, rehearsals and walk-throughs an typically culminates in a verification exercise.

    This is one pattern of mode-recognition and response - I anticipate that each recognizable deployment cycle mode would fit clearly in one of the four corners of the author's 2X2 - Variability-Diversity matrix.

    When we look at the CAISO or the National Air Traffic Control System we can see the same kinds of anticipated variability but these are open systems in continual operation - they generally do not follow the "deployment cycle" pattern of military units. Also the tend to have some goodly degree of hub and spoke structure as opposed to the self-healing, auto-catalytic one of the immune system.

    It seems to me, from a systems perspective, there is a pattern matching challenge relating the CAISO experience to the domain of mode recognition and management in commercial nuclear power operations. The challenge is to detect which features of the typical NP enterprise appear similar to each different mode category in terms of the range of potential volatility and the provision for sustainment of diverse defense in depth against both anticipated and some more unlikely patterns of events.

    In the case of a NPP fleet operator like Excelon, Entergy or Duke, a similar identification of similarity exercise would appear to be warranted to the Navy's various nuclear fleet operations experience.

    On first look, I'm inclined to set aside operational tempo as a primary source of difference. The CAISO reviewers point out that in at least one mode complaisance and inattention to details with know legacy importance is a key analytical feature. This sounds a lot like NASA's loss of corporate memory between Challenger and Columbia.

    My sense is that while the character of the equipment generally determines the "pace of variation" during "normal operations" it is our brain's tendency to "regularize" every aspect of anticipation to the equipment's master clock (whatever speed that runs at). This seems to make complacency a comparable phenomenon anywhere significant circumstance changes can be anticipated.

    While I've more thinking to do, I'm leaning toward the notion that it is the critical maintenance and modifications aspects of NPP operations that are the place where considerations of volatility and preservation of diversity will provide highly useful insights from the CAISO to NPP situation.

    As I was first reading this post, the recent Palisades trip during electrical maintenance and the whole subsequent NSC upgrade saga came to mind - the tell in that one for me has always been the on call Senior Manager's over-engagement in the maintenance planning to the point where he violated the overtime limits - that combined with no one from the Control Room at the work site during the actual repair. Inadequate gear-shifting (vigilance mode upgrading) was never described in the RCA or the NRC's SIT report!

    ReplyDelete

Thanks for your comment. We read them all. We'd like to display them under their respective posts on our main page but that's not how Blogger works.