Monday, October 13, 2014

Systems Thinking in Air Traffic Management

A recent white paper* presents ten principles to consider when thinking about a complex socio-technical system, specifically European Air Traffic Management (ATM).  We review the principles below, highlighting aspects that might provide some insights for nuclear power plant operations and safety culture (SC).

Before we start, we should note that ATM is truly a complex** system.  Decisions involving safety and efficiency occur on a continuous basis.  There is always some difference between work-as-imagined and work-as-done.

In contrast, we have argued that a nuclear plant is a complicated system but it has some elements of complexity.  To the extent complexity exists, treating nuclear like a complicated machine via “analysing components using reductionist methods; identifying ‘root causes’ of problems or events; thinking in a linear and short-term way; . . . [or] making changes at the component level” is inadequate. (p. 5)  In other words, systemic factors may contribute to observed performance variability and frustrate efforts to achieve the goal in nuclear of eliminating all differences between work-as-planned and work-as-done.

Principles 1-3 relate to the view of people within systems – our view from the outside and their view from the inside.

1. Field Expert Involvement
“To understand work-as-done and improve how things really work, involve those who do the work.” (p. 8)
2. Local Rationality
“People do things that make sense to them given their goals, understanding of the situation and focus of attention at that time.” (p. 10)
3. Just Culture
“Adopt a mindset of openness, trust and fairness. Understand actions in context, and adopt systems language that is non-judgmental and non-blaming.” (p. 12)

Nuclear is pretty good at getting line personnel involved.  Adages such as “Operations owns the plant” are useful to the extent they are true.  Cross-functional teams can include operators or maintenance personnel.  An effective CAP that allows workers to identify and report problems with equipment, procedures, etc. is good; an evaluation and resolution process that involves members from the same class of workers is even better.  Having someone involved in an incident or near-miss go around to the tailgates and classes to share the lessons learned can be convincing.

But when something unexpected or bad happens, nuclear tends to spend too much time looking for the malfunctioning component (usually human).   “The assumption is that if the person would try harder, pay closer attention, do exactly what was prescribed, then things would go well. . . . [But a] focus on components becomes less effective with increasing system complexity and interactivity.” (p. 4)  An outside-in approach ignores the context in which the human performed, the information and time available, the competition for focus of attention, the physical conditions of the work, fatigue, etc.  Instead of insight into system nuances, the result is often limited to more training, supervision or discipline.

The notion of a “just culture” comes from James Reason.  It’s a culture where employees are not punished for their actions, omissions or decisions that are commensurate with their experience and training, but where gross negligence, willful violations and destructive acts are not tolerated.

Principles 4 and 5 relate to the system conditions and context that affect work.

4. Demand and Pressure
“Demands and pressures relating to efficiency and capacity have a fundamental effect on performance.” (p. 14)
5. Resources & Constraints

“Success depends on adequate resources and appropriate constraints.” (p. 16)

Fluctuating demand creates far more varied and unpredictable problems for ATM than it does in nuclear.  However, in nuclear the potential for goal conflicts between production, cost and safety is always present.  The problem arises from acting as if these conflicts don’t exist.

ATM has to “cope with variable demand and variable resources,” a situation that is also different from nuclear with its base load plants and established resource budgets.  The authors opine that for ATM, “a rigid regulatory environment destroys the capacity to adapt constantly to the environment.” (p. 2) Most of us think of nuclear as quite constrained by procedures, rules, policies, regulations, etc., but an important lesson from Fukushima was that under unforeseen conditions, the organization must be able to adapt according to local, knowledge-based decisions  Even the NRC recognizes that “flexibility may be necessary when responding to off-normal conditions.”***

Principles 6 through 10 concern the nature of system behavior, with 9 and 10 more concerned with system outcomes.  These do not have specific implications for SC other than keeping an open mind and being alert to systemic issues, e.g., complacency, drift or emergent behavior.

6. Interactions and Flows
“Understand system performance in the context of the flows of activities and functions, as well as the interactions that comprise these flows.” (p. 18)
7. Trade-Offs
“People have to apply trade-offs in order to resolve goal conflicts and to cope with the complexity of the system and the uncertainty of the environment.” (p. 20)
8. Performance variability
“Understand the variability of system conditions and behaviour.  Identify wanted and unwanted variability in light of the system’s need and tolerance for variability.” (p. 22)
9. Emergence
“System behaviour in complex systems is often emergent; it cannot be reduced to the behaviour of components and is often not as expected.” (p. 24)
10. Equivalence
“Success and failure come from the same source – ordinary work.” (p. 26)

Work flow certainly varies in ATM but is relatively well-understood in nuclear.  There’s really not much more to say on that topic.

Trade-offs occur in decision making in any context where more than one goal exists.  One useful mental model for conceptualizing trade-offs is Hollnagel’s efficiency-thoroughness construct, basically doing things quickly (to meet the production and cost goals) vs. doing things well (to meet the quality and possibly safety goals).  We reviewed his work on Jan. 3, 2013.

Performance variability occurs in all systems, including nuclear, but the outcomes are usually successful because a system has a certain range of tolerance and a certain capacity for resilience.  Performance drift happens slowly, and can be difficult to identify from the inside.  Dekker’s work speaks to this and we reviewed it on Dec. 5, 2012.

Nuclear is not fully complex but surprises do happen, some of them not caused by component failure.  Emergence (problems that arise from new or unforeseen system interactions) is more likely to occur following the implementation of new technical systems.  We discussed this possibility in a July 6, 2013 post on a book by Woods, Dekker et al.

Equivalence means that work that results in both good and bad outcomes starts out the same way, with people (saboteurs excepted) trying to be successful.  When bad things happen, we should cast a wide net in looking for different factors, including systemic ones, that aligned (like Swiss cheese slices) in the subject case.

The white paper also includes several real and hypothetical case studies illustrating the application of the principles to understanding safety performance challenges 

Our Perspective 

The authors draw on a familiar cast of characters, including Dekker, Hollnagel, Leveson and Reason.  We have posted about all these folks, just click on their label in the right hand column.

The principles are intended to help us form a more insightful mental model of a system under consideration, one that includes non-linear cause and effect relationships, and the possibility of emergent behavior.  The white paper is not a “must read” but may stimulate useful thinking about the nature of the nuclear operating organization.

*  European Organisation for the Safety of Air Navigation(EUROCONTROL), “Systems Thinking for Safety: Ten Principles” (Aug. 2014).  Thanks to Bill Mullins for bringing this white paper to our attention.

**  “[C]omplex systems involve large numbers of interacting elements and are typically highly dynamic and constantly changing with changes in conditions. Their cause-effect relations are non-linear; small changes can produce disproportionately large effects. Effects usually have multiple causes, though causes may not be traceable and are socially constructed.” (pp. 4-5)

Also see our Oct. 14, 2013 discussion of the California Independent System Operator for another example of a complex system.

***  “Work Processes,” NRC Safety Culture Trait Talk, no. 2 (July 2014), p. 1.  ADAMS ML14203A391.  Retrieved Oct. 8, 2014

1 comment:

  1. Lew,

    I can agree that NPPs are complicated at the individual plant level for purposes of licensing power operations. But I would suggest that from experience we know that adding plants at a site, compiling them into fleets, driving for ever shorter outages and more on-line maintenance and while operating under merchant generation financial, are complexifying factors, not simple sources of extended complication.

    Throw in stress corrosion cracking (e.g. S/G replacement), and life cycle extension with the attendant equipment aging and staff experience variation and the US National Nuclear Energy Enterprise is surely a complex System of Systems to a degree that is comparable to that of the National Air Traffic Control System.

    What is different between the two enterprises is the time constant of variation (latency period) introduced with faulty or delayed decision-making, or as a result of imprudent over-emphasis on production expediency.

    In the ATC case, there are many smallish units of work (individual flights) and lots of daily experience with more or less "line of sight" variation from complexity to be dealt with. This makes for a lot of real time experience building operational Resilience in all the actors, even as the system works to optimize throughput (Reliability). It seems significant that the Government has an active role in the conduct of the ATC System - a feature of DOE and Naval Reactors Systems but lacking in the Commercial NP sector.

    In the US NNEE, a practical problem has been that while treating the industry as a complicated sum, when only 10 plants were in service, could be achieved by independent oversight of each plant, the individual plant ceases to be the only source of unsafe variability as the total fleet size increases. System of Systems complexity grows exponentially, particularly when there are long latency periods from decision to its manifestation (cf. SONGS S/G design errors) and where the most consequential accidents still happen only very rarely. Long latency in complex feedback loops is not a sign of limited complexity - quite the opposite I suspect is true.

    The statistics of off-normal variability in the ATC case is much better known just owing to the need to incorporate weather variation. Part of the reason that INSAG 4 (and the Naval Reactors Program) made such a big deal about Issues Management (particularly unanticipated ones) is that this is the umbrella under which the most significant residual vulnerabilities rest - most of the ones that can addressed by engineered safety features have been.

    The real value of an an institution-wide questioning attitude is in the way it encourages Systems Thinkers to challenge the issue formulation of unusual circumstances. Alert recognition of anticipated defects is smart business, but Defense in Depth design provides affordance for some delay in recognition.

    Where things get dicey are when DiD is used as cover for an institutional failure to take seriously a conspicuous operations discipline breakdown (ala the 2003 Callaway unmonitored reactivity drift during shutdown). These are the seeds of future troubles - clearly it is much harder for NP managers and sometimes regulators to get a response handle on this type of vulnerability even if they recognize it.

    There is a reason that Commercial Air Transport grows and evolves regularly through generational improvement in technology while maintaining hazard rates at ever larger annual capacity. In nuclear we should be taking lessons from this very large body of experience, but it is far from obvious that we even try. People like Dekker, Reason and Hollnagel have all worked on the ATC issues; but they are generally dismissed in centers of nuclear industry thinking like INPO/WANO, IAEA and NRC. That is something I really don't get.


Thanks for your comment. We read them all. The moderator will publish comments that are related to our content.