Professionalism/Northeast Blackout of 2003

The Northeast blackout of 2003 was a widespread power outage that occurred throughout parts of the Northeastern and Midwestern United States and the Canadian province of Ontario on Thursday, August 14, 2003.

Causes
A primary cause of the blackout was determined to be a then-unknown software bug in General Electric’s commonly used XA/21 grid control system, specifically in FirstEnergy’s Eastlake 5 power substation in Ohio. The Eastlake plant had experienced recent maintenance issues, and was having difficulty keeping up with demand on the hot summer day. (The plant has since been shut down for reliability and expense-of-upgrade issues. ) A particular set of conditions at the plant triggered the software bug, resulting in a race condition, which prevented necessary alarms from tripping during an overload, and slowed its performance for successive errors. The bug also caused the background server system to fail under the increased load, further exacerbating the situation. FirstEnergy employees elsewhere did not notice these initial failures of the Eastlake plant, and were unaware when a few lines to be excessively loaded. Sag from thermal expansion allowed one high-voltage line to contact a tree and triggered a cascading failure, in which one failure lead to another; each event strained the remaining equipment, making it more likely to fail similarly or to take itself offline, extending the effects. By this process, a large region around Lake Erie and New York state eventually totally isolated itself and shut down, crippling the region and leaving most of its containing areas without primary or backup power.

Results
The failure of the Eastlake 5 substation and other power generation sites caused widespread power loss across a large region of the northeast United States and Ontario, affecting over 50 million people. . The affected region came to a near-total technological standstill. Several areas lost water because of a lack of pressure in the pumps and were put under contamination advisories. Cleveland, New York, Kingston, and Newark experienced major sewage spills into waterways. Transportation lines shut down across the area. All trains in New York City were rendered inoperable by the blackout, though diesel-powered service eventually came online. Gas stations were unable to pump gas, resulting in the stoppage of trucking and supply services. Airports in the region were also unable to function because of security concerns, and even after power returned, computer systems had difficulty accessing ticket information. Cellular communication was totally interrupted, and many factories were unable to operate for days because of a lack of not only power but also supplies. New York City was the largest city affected and completely shut down. People were stranded in the city due to lack of transportation, and major gridlock occurred without lights to regulate traffic. The problems were exacerbated by the heat and high humidity, and many slept outside to avoid the heat in buildings without air conditioning.

Other Effects
In the United States, the Bush administration had emphasized the need for changes to the U.S. national energy policy, critical infrastructure protection, and homeland security. During the blackout, most detection systems failed. It highlighted the ease with which the power grid could be disabled and sparked worries about potential exploitation for terrorism.

After power was restored, some cities in Ontario took part in power conservation challenges such as the Voluntary Blackout Day hosted by the Ontario Power Authority. During these events, citizens were encouraged to maximize their energy conservation activities.

Dispersion of Responsibility
The blackout was not attributable to one specific event. Rather, multiple components were involved in the chain of events.

The North American Electric Reliability Corporation (NERC) investigated the blackout extensively and determined several individual points of failure. According to its report, “the causes of the blackout were rooted in deficiencies resulting from decisions, actions, and the failure to act of the individuals, groups, and organizations involved.” Specifically, it broke these causes into three main categories.

Ineffective support and communication
Other Operator: “Hey, do you think you could help out the 345 voltage a little?” Eastlake 5 Operator: “Buddy, I am — yeah, I’ll push it to my max max. You’re only going to get a little bit.” Other Operator: “That’s okay, that’s all I can ask.” The Midcontinent Independent System Operator (MISO) is an Indiana-based regional electric grid management and coordinating group responsible for overseeing and monitoring operations within the middle of the United States and Manitoba. However, in 2003, these monitoring operations were conducted using systems which were not designed to provide truly real-time data. As a result, MISO was unable to provide the real-time support that was necessary to stop the failure before it began cascading. It also did not have adequate internal procedures to respond to preliminary overload reports that it did get.

Lack of situational awareness
A combination of causes interfered with FirstEnergy’s situational awareness. This prevented operators from taking corrective actions to maintain system equilibrium.
 * The Eastlake plant had inadequate redundancy in its alarm failure detection system. An infinite loop lockup from a race condition disabled the alarm, but there was no indication of the problem.
 * Computer support staff did not effectively communicate the loss of the alarm functionality. The lack of knowledge of the alarm processor failure exacerbated already deteriorating conditions.
 * Computer support staff did not fully test functionality after server restoration. The server failure was a separate incident, but a full test upon reboot would have revealed the alarm failure.
 * Operators did not have effective alternatives of visualizing system conditions. There was no other display or status overview for operators to monitor the system.
 * FirstEnergy did not have an effective contingency analysis capability. Real-time analysis of generators and transmission lines would have notified operators of deteriorating system conditions.

Vegetation management
NERC determined that what started the chain reaction was that a power line contacted an overgrown tree. They found that the trees had been allowed to grow unchecked around the 345-kV lines causing all three lines to go down within 30 minutes of each other, a statistical improbability had the trees been properly trimmed. Effective vegetation management could have avoided triggering the outage of these lines, but it would not have been enough to mitigate other failure modes. Even so, at the time, there was no standard for tree trimming.

Conclusion
Due to the dispersion of responsibility and lack of a clear individual source of blame, no penalties were issued to any party. Whether or not this was correct, fair assignment of blame is indeed extremely difficult. In the real world, more often than not, a catastrophic failure of this nature is the result of multiple smaller issues compounding upon each other rather than a singular root cause. Any one issue would not have caused failure, but when put together, the system could not compensate and failure became inevitable. It is up to the professional to keep this from happening: a professional is one who pays attention to the details and acts to take responsibility. Had a more experienced individual been in the correct position, the software bug might have been noticed in testing. If adequate protocols and technologies might have been in place at MISO, a contingency plan could have been enacted to respond to the system failure effectively. Perhaps if even the trees would have been trimmed more effectively, the blackout might never have occurred. Furthermore, communication is extremely important between professionals. For complex systems that depend on subsystems interacting with each other, communication needs to occur between the various levels. Developers need to inform end users of deficiencies of their product. It may not be as pleasing or marketable, but it ensures that the user is aware of the product’s shortcomings. Operators of the product or system also need to communicate with each other. Deviations from normal operations, areas of improvement, and maintenance requirements all affect system performance, so these details must be made clear among colleagues, including being sent up the chain of command so proper action can be taken.