image alt text

Managing the risks of complex systems, and increasing systemic robustness, is a challenging task. Systems can fail on multiple levels. While some of these potential failures are straightforward to identify and protect against–e.g., an engine failure on a plane is mitigated by having multiple engines–complex systems can fail in multiple ways, creating unknown complications.

To make matters worse, such failures are often correlated, so multiple simple failures are likely to happen simultaneously. Consequently, unexpected correlations can cause systems to suffer from illusory redundancy, where a backup component is unexpectedly vulnerable to the very risks against which it was designed to protect. The interaction of multiple correlated failures of a complex system can drive the system into a state of criticality.

Criticality is a property of a complex system that occurs when the system’s complex interactions produce wide-ranging consequences. In particular, the system exhibits high correlation at different scales and operates with low tolerance, as a failure in one place is now directly related to widespread failures. When that criticality emerges unexpectedly from the local properties of the system (rather than a central source) it is known as Self-Organized Criticality (SOC) (I am paraphrasing some of the very technical definitions developed from physics.). Typical models of SOC study “toy systems” with simple dynamics, like simulated sand piles and cellular automata. Systems that exhibit SOC often have similarities, though their study is primarily empirical–that is, there are no analytical rules that guarantee a system will demonstrate SOC.

Though real events like natural disasters are challenging to simulate with simple models, they do provide a hint of SOC. For example, one property of SOC is that, once a system reaches a state of criticality, it is often hard to recover from without significant external inputs - a property common to some large-scale disaster responses.

For instance, the March 2011 9.0 magnitude earthquake in Tohoku, Japan and subsequent events surrounding the Fukushima Dai-Ichi power plant are an example of multiple failures leading to a state of criticality. The shutdown of the power plant, and failure of backup generator systems, disabled the cooling systems for the power plant, requiring the deployment of tremendous resources to mitigate disastrous consequences. Furthermore, the loss of power from the plant itself had widespread consequences, as subsequent blackouts shut down train service and workplaces, and reduced industry output, hampering the economic recovery beyond the direct physical damage that occurred. Whereas a naive model may have predicted very local consequences (like potential local radiation exposure and the evacuation of surrounding areas), the actual consequences extended far beyond the scale affected by the plant itself.

While Fukushima is a striking example, criticality can come from seemingly prosaic interactions as well. For example, after a flood, an extended loss of power may occur if transmission equipment is damaged. As a response, generators would be required for lighting and to operate pumps, slightly increasing the cost and the timeline for the recovery effort. But, the correlated failure of transportation systems may prevent those generators from being deployed, and result in the uniform and widespread disruption of recovery efforts. A similar cascade of public health effects can occur when spoiled food leads to an increase in the rat population and accompanying health concerns.

Criticality can also be a longer-term phenomenon. For example, residents leaving an area after a disaster might reduce the tax base and cause a feedback loop that makes recovery efforts harder to fund and justify. Such an abandonment can also lead to infrastructure concerns, which might create further deteriorating conditions and result in widespread blight.

The challenges of managing criticality are not insurmountable. What we’ve learned from natural disasters and the responses required is that complex interactions that can lead to criticality are hard to simulate and rarely encountered. But identifying SOC as a potential source of catastrophic failure, and bringing together disparate groups with insights into correlated interactions, is a step in the right direction.

Chris Clearfield is a principal at System Logic, an independent consulting firm that helps organizations manage issues of risk and complexity. Follow him on Twitter, and check out his other writings.

As originally published in Forbes.

Figure: Nature News Blog