Showing posts with label Cynefin. Show all posts
Showing posts with label Cynefin. Show all posts

Monday, October 14, 2013

High Reliability Management by Roe and Schulman

This book* presents a multi-year case study of the California Independent System Operator (CAISO), the government entity created to operate California's electricity grid when the state deregulated its electricity market.  CAISO's travails read like The Perils of Pauline but our primary interest lies in the authors' observations of the different grid management strategies CAISO used under various operating conditions; it is a comprehensive description of contingency management in the real world.  In this post we summarize the authors' management model, discuss the application to nuclear management and opine on the implications for nuclear safety culture.

The High Reliability Management (HRM) Model

The authors call the model they developed High Reliability Management and present it in a 2x2 matrix where the axes are System Volatility and Network Options Variety. (Ch. 3)  System Volatility refers to the magnitude and rate of change of  CAISO's environmental variables including generator and transmission availability, reserves, electricity prices, contracts, the extent to which providers are playing fair or gaming the system, weather, temperature and electricity demand (regional and overall).  Network Options Variety refers to the range of resources and strategies available for meeting demand (basically in real time) given the current inputs. 

System Volatility and Network Options Variety can each be High or Low so there are four possible modes and a distinctive operating management approach for each.  All modes must address CAISO's two missions of matching electricity supply and demand, and protecting the grid.  Operators must manage the system inside an acceptable or tolerable performance bandwidth (invariant output performance is a practical impossibility) in all modes.  Operating conditions are challenging: supply and demand are inherently unstable (p. 34), inadequate supply means some load cannot be served and too much generation can damage the grid. (pp. 27, 142)

High Volatility and High Options mean both generation (supply) and demand are changing quickly and the operators have multiple strategies available for maintaining balance.  Some strategies can be substituted for others.  It is a dynamic but manageable environment.

High Volatility and Low Options mean both generation and demand are changing quickly but the operators have few strategies available for maintaining balance.  They run from pillar to post; it is highly stressful.  Sometimes they have to create ad hoc (undocumented and perhaps untried) approaches using trail and error.  Demand can be satisfied but regulatory limits may be exceeded and the system is running closer to the edge of technical capabilities and operator skills.  It is the most unstable performance mode and untenable because the operators are losing control and one perturbation can amplify into another. (p. 37)

Low Volatility and Low Options mean generation and demand are not changing quickly.  The critical feature here is demand has been reduced by load shedding.  The operators have exhausted all other strategies for maintaining balance.  It is a command-and-control approach, effected by declaring a  Stage 3 grid situation and run using formal rules and procedures.  It is the least desirable domain because one primary mission, to meet all demand, is not being accomplished. 

Low Volatility and High Options is an HRM's preferred mode.  Actual demand follows the forecast, generators are producing as expected, reserves are on hand, and there is no congestion on transmission lines or backup routes are available.  Procedures based on analyzed conditions exist and are used.  There are few, if any, surprises.  Learning can occur but it is incremental, the result of new methods or analysis.  Performance is important and system behavior operates within a narrow bandwidth.  Loss of attention (complacency) is a risk.  Is this starting to sound familiar?  This is the domain of High Reliability Organization (HRO) theory and practice.  Nuclear power operations is an example of an HRO. (pp. 60-62)          

Lessons for Nuclear Operations 


Nuclear plants work hard to stay in the Low Volatility/High Options mode.  If they stray into the Low Options column, they run the risks of facing unanalyzed situations and regulatory non-compliance. (p. 62)  In their effort to optimize performance in the desired mode, plants examine their performance risks to ever finer granularity through new methods and analyses.  Because of the organizations' narrow focus, few resources are directed at identifying, contemplating and planning for very low probability events (the tails of distributions) that might force a plant into a different mode or have enormous potential negative consequences.**  Design changes (especially new technologies) that increase output or efficiency may mask subtle warning signs of problems; organizations must be mindful to performance drift and nascent problems.   

In an HRO, trial and error is not an acceptable method for trying out new options.  No one wants cowboy operators in the control room.  But examining new options using off-line methods, in particular simulation, is highly desirable. (pp. 111, 233)  In addition, building reactive capacity in the organization can be a substitute for foresight to accommodate the unexpected and unanalyzed. (pp. 116-17)  

The focus on the external changes that buffeted CAISO leads to a shortcoming when looking for lessons for nuclear.  The book emphasizes CAISO's adaptability to new environmental demands, requirements and constraints but does not adequately recognize the natural evolution of the system.  In nuclear, it's natural evolution that may quietly lead to performance drift and normalization of deviance.  In a similar vein, CAISO has to worry about complacency in just one mode, for nuclear it's effectively the only mode and complacency is an omnipresent threat. (p. 126) 

The risk of cognitive overload occurs more often for CAISO operators but it has visible precursors; for nuclear operators the risk is overload might occur suddenly and with little or no warning.*** Anticipation and resilience are more obvious needs at CAISO but also necessary in nuclear operations. (pp. 5, 124)

Implications for Safety Culture

Both HRMs and HROs need cultures that value continuous training, open communications, team players able to adjust authority relationships when facing emergent issues, personal responsibility for safety (i.e., safety does not inhere in technology), ongoing learning to do things better and reduce inherent hazards, rewards for achieving safety and penalties for compromising it, and an overall discipline dedicated to failure-free performance. (pp. 198, App. 2)  Both organizational types need a focus on operations as the central activity.  Nuclear is good at this, certainly better than CAISO where entities outside of operations promulgated system changes and the operators were stuck with making them work.

The willingness to report errors should be encouraged but we have seen that is a thin spot in the SC at some plants.  Errors can be a gateway into learning how to create more reliable performance and error tolerance vs. intolerance is a critical cultural issue. (pp. 111-12, 220) 

The simultaneous needs to operate within a prescribed envelope while considering how the envelope might be breached has implications for SC.  We have argued before that a nuclear organization is well-served by having a diversity of opinions and some people who don't subscribe to group think and instead keep asking “What's the worst case scenario and how would we manage it to an acceptable conclusion?” 

Conclusion

This review gives short shrift to the authors' broad and deep description and analysis of CAISO.****  The reason is that the major takeaway for CAISO, viz., the need to recognize mode shifts and switch management strategies accordingly as the manifestation of “normal” operations, is not really applicable to day-to-day nuclear operations.

The book describes a rare breed, the socio-technical-political start-up, and has too much scope for the average nuclear practitioner to plow through searching for newfound nuggets that can be applied to nuclear management.  But it's a good read and full of insightful observations, e.g., the description of  CAISO's early days (ca. 2001-2004) when system changes driven by engineers, politicians and regulators, coupled with changing challenges from market participants, prevented the organization from settling in and effectively created a negative learning curve with operators reporting less confidence in their ability to manage the grid and accomplish the mission in 2004 vs. 2001. (Ch. 5)

(High Reliability Management was recommended by a Safetymatters reader.  If you have a suggestion for material you would like to see promoted and reviewed, please contact us.)

*  E. Roe and P. Schulman, High Reliability Management (Stanford Univ. Press, Stanford, CA: 2008)  This book reports the authors' study of CAISO from 2001 through 2006. 

**  By their nature as baseload generating units, usually with long-term sales contracts, nuclear plants are unlikely to face a highly volatile business environment.  Their political and social environment is similar: The NRC buffers them from direct interference by politicians although activists prodding state and regional authorities, e.g., water quality boards, can cause distractions and disruptions.

The importance of considering low-probability, major consequence events is argued by Taleb (see here) and D├ędale (see here).

***  Over the course of the authors' investigation, technical and management changes at CAISO intended to make operations more reliable often had the unintended effect of moving the edge of the prescribed performance envelope closer to the operators' cognitive and skill capacity limits. 

The Cynefin model describes how organizational decision making can suddenly slip from the Simple domain to the Chaotic domain via the Complacent zone.  For more on Cynefin, see here and here.

****  For instance, ch. 4 presents a good discussion of the inadequate or incomplete applicability of Normal Accident Theory (Perrow, see here) or High Reliability Organization theory (Weick, see here) to the behavior the authors observed at CAISO.  As an example, tight coupling (a threat according to NAT) can be used as a strength when operators need to stitch together an ad hoc solution to meet demand. (p. 135)

Ch. 11 presents a detailed regression analysis linking volatility in selected inputs to volatility in output, measured by the periods when electricity made available (compared to demand) fell outside regulatory limits.  This analysis illustrated how well CAISO's operators were able to manage in different modes and how close they were coming to the edge of their ability to control the system, in other words, performance as precursor to the need to go to Stage 3 command-and-control load shedding.

Wednesday, December 12, 2012

“Overpursuit” of Goals

We return to a favorite subject, the impact of goals and incentives on safety culture and performance. Interestingly this subject comes up in an essay by Oliver Burkeman, “The Power of Negative Thinking,”* which may seem unusual as most people think of goals and achievement of goals as the product of a positive approach. Traditional business thinking is to set hard, quantitative goals, the bigger the better. But futures are inherently uncertain and goals generally are not so. The counter intuitive argument suggests the most effective way to address future performance is to focus on worst case outcomes. Burkeman observes that “...rigid goals may encourage employees to cut ethical corners” and “Focusing on one goal at the expense of all other factors also can distort a corporate mission or an individual life…” and result in “...the ‘overpursuit’ of goals…” Case in point, yellow jerseys.

This raises some interesting points for nuclear safety. First we would remind our readers of Snowden’s Cynefin decision context framework, specifically his “complex” space which is indicative of where nuclear safety decisions reside. In this environment there are many interacting causes and effects, making it difficult or impossible to pursue specific goals along defined paths. Clearly an uncertain landscape. As Simon French argues: “Decision support will be more focused on exploring judgement and issues, and on developing broad strategies that are flexible enough to accommodate changes as the situation evolves.”** This would suggest the pursuit of specific, aspirational goals may be misguided or counterproductive.

Second, safety performance goals are hard to identify anyway. Is it the absence of bad outcomes? Or the maintenance of, say, a “strong” safety culture - whatever that is. One indication of the elusiveness of safety goals is their absence as targets in incentive programs. So there is probably little likelihood of overemphasizing safety performance as a goal. But is the same true for operational type goals such as capacity factor, refuel outage durations, and production costs? Can an overly strong focus on such short term goals, often associated with stretching performance, lead to overpursuit? What if large financial incentives are attached to the achievement of the goals?

The answer is not: “Safety is our highest priority”. More likely it is an approach that considers the complexity and uncertainty of nuclear operating space and the potential for hard goals to cut both ways. It might value how a management team prosecutes its responsibilities more than the outcome itself.


* O. Burkeman, “The Power of Negative Thinking,” Wall Street Journal online (Dec. 7, 2012).

** S. French, “Cynefin: repeatability, science and values,” Newsletter of the European Working Group “Multiple Criteria Decision Aiding,” series 3, no. 17 (Spring 2008) p. 2. We posted on Cynefin and French's paper here.

Wednesday, December 5, 2012

Drift Into Failure by Sydney Dekker

Sydney Dekker's Drift Into Failure* is a noteworthy effort to provide new insights into how accidents and other bad outcomes occur in large organizations. He begins by describing two competing world views, the essentially mechanical view of the world spawned by Newton and Descartes (among others), and a view based on complexity in socio-technical organizations and a systems approach. He shows how each world view biases the search for the “truth” behind how accidents and incidents occur.

Newtonian-Cartesian (N-C) Vision

Issac Newton and Rene Descartes were leading thinkers during the dawn of the Age of Reason. Newton used the language of mathematics to describe the world while Descartes relied on the inner process of reason. Both believed there was a single reality that could be investigated, understood and explained through careful analysis and thought—complete knowledge was possible if investigators looked long and hard enough. The assumptions and rules that started with them, and were extended by others over time, have been passed on and most of us accept them, uncritically, as common sense, the most effective way to look at the world.

The N-C world is ruled by invariant cause-and-effect; it is, in fact, a machine. If something bad happens, then there was a unique cause or set of causes. Investigators search for these broken components, which could be physical or human. It is assumed that a clear line exists between the broken part(s) and the overall behavior of the system. The explicit assumption of determinism leads to an implicit assumption of time reversibility—because system performance can be predicted from time A if we know the starting conditions and the functional relationships of all components, then we can start from a later time B (the bad outcome) and work back to the true causes. (p. 84) Root cause analysis and criminal investigations are steeped in this world view.

In this view, decision makers are expected to be rational people who “make decisions by systematically and consciously weighing all possible outcomes along all relevant criteria.” (p. 3) Bad outcomes are caused by incompetent or worse, corrupt decision makers. Fixes include more communications, training, procedures, supervision, exhortations to try harder and criminal charges.

Dekker credits Newton et al for giving man the wherewithal to probe Nature's secrets and build amazing machines. However, Newtonian-Cartesian vision is not the only way to view the world, especially the world of complex, socio-technical systems. For that a new model, with different concepts and operating principles, is required.

The Complex System

Characteristics

The sheer number of parts does not make a system complex, only complicated. A truly complex system is open (it interacts with its environment), has components that act locally and don't know the full effects of their actions, is constantly making decisions to maintain performance and adapt to changing circumstances, and has non-linear interactions (small events can cause large results) because of multipliers and feedback loops. Complexity is a result of the ever-changing relationships between components. (pp.138-144)

Adding to the myriad information confronting a manager or observer, system performance is often optimized at the edge of chaos, where competitors are perpetually vying for relative advantage at an affordable cost.** The system is constantly balancing its efforts between exploration (which will definitely incur costs but may lead to new advantages) and exploitation (which reaps benefits of current advantages but will likely dissipate over time). (pp. 164-165)

The most important feature of a complex system is that it adapts to its environment over time in order to survive. And its environment is characterized by resource scarcity and competition. There is continuous pressure to maintain production and increase efficiency (and their visible artifacts: output, costs, profits, market share, etc) and less visible outputs, e.g., safety, will receive less attention. After all, “Though safety is a (stated) priority, operational systems do not exist to be safe. They exist to provide a service or product . . . .” (p. 99) And the cumulative effect of multiple adaptive decisions can be an erosion of safety margins and a changed response of the entire system. Such responses may be beneficial or harmful—a drift into failure.

Drift by a complex system exhibits several characteristics. First, as mentioned above, it is driven by environmental factors. Second, drift occurs in small steps so changes can be hardly noticed, and even applauded if they result in local performance improvement; “. . . successful outcomes keep giving the impression that risk is under control” (p. 106) as a series of small decisions whittle away at safety margins. Third, these complex systems contain unruly technology (think deepwater drilling) where uncertainties exist about how the technology may be ultimately deployed and how it may fail. Fourth, there is significant interaction with a key environmental player, the regulator, and regulatory capture can occur, resulting in toothless oversight.

“Drifting into failure is not so much about breakdowns or malfunctioning of components, as it is about an organization not adapting effectively to cope with the complexity of its own structure and environment.” (p. 121) Drift and occasionally accidents occur because of ordinary system functioning, normal people going about their regular activities making ordinary decisions “against a background of uncertain technology and imperfect information.” Accidents, like safety, can be viewed as an emergent system property, i.e., they are the result of system relationships but cannot be predicted by examining any particular system component.

Managers' roles

Managers should not try to transform complex organizations into merely complicated ones, even if it's possible. Complexity is necessary for long-term survival as it maximizes organizational adaptability. The question is how to manage in a complex system. One key is increasing the diversity of personnel in the organization. More diversity means less group think and more creativity and greater capacity for adaptation. In practice, this means validation of minority opinions and encouragement of dissent, reflecting on the small decisions as they are made, stopping to ponder why some technical feature or process is not working exactly as expected and creating slack to reduce the chances of small events snowballing into large failures. With proper guidance, organizations can drift their way to success.

Accountability

Amoral and criminal behavior certainly exist in large organizations but bad outcomes can also result from normal system functioning. That's why the search for culprits (bad actors or broken parts) may not always be appropriate or adequate. This is a point Dekker has explored before, in Just Culture (briefly reviewed here) where he suggests using accountability as a means to understand the system-based contributors to failure and resolve those contributors in a manner that will avoid recurrence.

Application to Nuclear Safety Culture

A commercial nuclear power plant or fleet is probably not a complete complex system. It interacts with environmental factors but in limited ways; it's certainly not directly exposed to the Wild West competition of say, the cell phone industry. Group think and normalization of deviance*** is a constant threat. The technology is reasonably well-understood but changes, e.g., uprates based on more software-intensive instrumentation and control, may be invisibly sanding away safety margin. Both the industry and the regulator would deny regulatory capture has occurred but an outside observer may think the relationship is a little too cozy. Overall, the fit is sufficiently good that students of safety culture should pay close attention to Dekker's observations.

In contrast, the Hanford Waste Treatment Plant (Vit Plant) is almost certainly a complex system and this book should be required reading for all managers in that program.

Conclusion

Drift Into Failure is not a quick read. Dekker spends a lot of time developing his theory, then circling back to further explain it or emphasize individual pieces. He reviews incidents (airplane crashes, a medical error resulting in patient death, software problems, public water supply contamination) and descriptions of organization evolution (NASA, international drug smuggling, “conflict minerals” in Africa, drilling for oil, terrorist tactics, Enron) to illustrate how his approach results in broader and arguably more meaningful insights than the reports of official investigations. Standing on the shoulders of others, especially Diane Vaughan, Dekker gives us a rich model for what might be called the “banality of normalization of deviance.” 


* S. Dekker, Drift Into Failure: From Hunting Broken Components to Understanding Complex Systems (Burlington VT: Ashgate 2011).

** See our Sept. 4, 2012 post onCynefin for another description of how the decisions an organization faces can suddenly slip from the Simple space to the Chaotic space.

*** We have posted many times about normalization of deviance, the corrosive organizational process by which the yesterday's “unacceptable” becomes today's “good enough.”

Tuesday, September 4, 2012

More on Cynefin

Bob Cudlin recently posted on the work of David Snowden, a decision theorist and originator of the Cynefin decision construct.  Snowden’s Cognitive Edge website has a lot of information related to Cynefin, perhaps too much to swallow at once.  For those who want an introduction to the concepts, focusing on their implications for decision-making, we suggest a paper “Cynefin: repeatability, science and values”* by Prof. Simon French.

In brief, the Cynefin model divides decision contexts into four spaces: Known (or Simple), Knowable (or Complicated), Complex and Chaotic.  Knowledge about cause-and-effect relationships (and thus, appropriate decision making approaches) differs for each space.  In the Simple space, cause-and-effect is known and rules or processes can be established for decision makers; “best” practices are possible.  In the Complicated space, cause-and-effect is generally known but individual decisions require additional data and analysis, perhaps with probabilistic attributes; different practices may achieve equal results.  In the Complex space, cause-and-effect may only be identified after an event takes place so decision making must work on broad, flexible strategies that can be adjusted as a situation evolves; new practices emerge.  In the Chaotic space, there are no applicable analysis methods so decision makers must try things, see what happens and attempt to stabilize the situation; a novel (one-off) practice obtains.   

The model in the 2008 French paper is not in complete accord with the Cynefin model currently described by Snowden but French’s description of the underlying considerations for decision makers remains useful.  French’s paper also relates Cynefin to the views of other academics in the field of decision making.  

For an overview of Cynefin in Snowden’s own words, check out “The Cynefin Framework” on YouTube.  There he discusses a fifth space, Disorder, which is basically where a decision maker starts when confronted with a new decision situation.  Importantly, a decision maker will instinctively try to frame the decision in the Cynefin decision space most familiar to the decision maker based on personal history, professional experience, values and preference for action. 

In addition, Snowden describes the boundary between the Simple and Chaotic as the “complacent zone,” a potentially dangerous place.  In the Simple space, the world appears well-understood but as near-misses and low-signal events are ignored, the system can drift toward the boundary and slip into the Chaotic space where a crisis can arise and decision makers risk being overwhelmed.

Both decision maker bias and complacency present challenges to maintaining a strong safety culture.  The former can lead to faulty analysis of problems, forcing complex issues with multiple interactive causes through a one-size-fits-all solution protocol.  The latter can lead to disasters, great and small.  We have posted many times on the dangers of complacency.  To access those posts, click “complacency” in the Labels box.


*  S. French, “Cynefin: repeatability, science and values,” Newsletter of the European Working Group “Multiple Criteria Decision Aiding,” series 3, no. 17 (Spring 2008).  Thanks to Bill Mullins for bringing this paper to our attention. 

Tuesday, August 28, 2012

Confusion of Properties and Qualities

Dave Snowden
In this post we highlight a provocative, and we believe, accurate criticism of the approach taken by many management scientists in focusing on behaviors as the determinant of desired outcomes.  The source is Dave Snowden, a Welsh lecturer, consultant and researcher in the field of knowledge management.  For those of you interested in finding out more about him, the website http://cognitive-edge.com/main for Cognitive Edge, founded by Snowden, contains an abundant amount of accessible content.

Snowden is a proponent of applying complexity science to inform managers’ decision making and actions.  He is perhaps best known for developing the Cynefin framework which is designed to help managers understand their operational context - based on four archetypes: simple, complicated, complex and chaotic. In considering the archetypes one can see how various aspects of nuclear operations might fit within the simple or complicated frameworks; frameworks where tools such as best practices and root cause analysis are applicable.  But one can also see the limitations of these frameworks in more complex situations, particularly those involving nuanced safety decisions which are at the heart of nuclear safety culture.  Snowden describes “complex adaptive systems” as ones where the system and its participants evolve together through ongoing interaction and influence, and system behavior is “emergent” from that process.  Perhaps most provocatively for nuclear managers is his contention that CDA systems are “non-causal” in nature, meaning one shouldn’t think in terms of linear cause and effect and shouldn’t expect that root cause analysis will provide the needed insight into system failures.

With all that said, we want to focus on a quote from one of Snowden’s lectures in 2008 “Complexity Applied to Systems”.*  In the lecture at approximately the 15:00 minute mark, he comments on a “fundamental error of logic” he calls “confusion of properties and qualities”.  He says:

“...all of management science, they observe the behaviors of people who have desirable properties, then try to achieve those desirable properties by replicating the behaviors”.

By way of a pithy illustration Snowden says, “...if I go to France and the first ten people I see are wearing glasses, I shouldn’t conclude that all Frenchmen wear glasses.  And I certainly shouldn’t conclude if I put on glasses, I will become French.”

For us Snowden’s observation generated an immediate connection to the approach being implemented around the nuclear enterprise.  Think about the common definitions of safety culture adopted by the NRC and industry.  The NRC definition specifies “... the core values and behaviors…” and “Experience has shown that certain personal and organizational traits are present in a positive safety culture. A trait, in this case, is a pattern of thinking, feeling, and behaving that emphasizes safety, particularly in goal conflict situations, e.g., production, schedule, and the cost of the effort versus safety.”**

The INPO definition defines safety culture as “An organization's values and behaviors – modeled by its leaders and internalized by its members…”***

In keeping with these definitions the NRC and industry rely heavily on the results of safety culture surveys to ascertain areas in need of improvement.  These surveys overwhelmingly focus on whether nuclear personnel are “modeling” the definitional traits, values and behaviors.  This seems to fall squarely in the realm described by Snowden of looking to replicate behaviors in hopes of achieving the desired culture and results.  Most often, identified deficiencies are subject to retraining to reinforce the desired safety culture traits.  But what seems to be lacking is a determination of why the traits were not exhibited in the first place.  Followup surveys may be conducted periodically, again to measure compliance with traits.  This recipe is considered sufficient until the next time there are suspect decisions or actions by the licensee. 

Bottom Line

The nuclear enterprise - NRC and industry - appear to be locked into a simplistic and linear view of safety culture.  Values and traits produce desired behaviors; desired behaviors produce appropriate safety management.  Bad results?  Go back to values and traits and retrain.  Have management reiterate that safety is their highest priority.  Put up more posters. 

But what if Snowden’s concept of complex adaptive systems is really an applicable model, and the safety management system is a much more complicated, continuously, self-evolving process?  It is a question well worth pondering - and may have far more impact than much of the hardware centric issues currently being pursued.

Footnote: Snowden is an immensely informative and entertaining lecturer and a large number of his lectures are available via podcasts on the Cognitive Edge website and through YouTube videos.  They could easily provide a stimulating input to safety culture training sessions.

*  Podcast available at http://cognitive-edge.com/library/more/podcasts/agile-conference-complexity-applied-to-systems-2008/. 

**  NRC Safety Culture Policy Statement (June 14, 2011).

***  INPO Definition of Safety Culture (2004).