This article was originally published by Forbes on October 10, 2024.
On July 19, 2024, the world’s economy received another chilling reminder of the implications of a botched software update when a coding error in software for Windows systems introduced users worldwide to the infamous “Blue Screen of Death.” It wasn’t a cyberattack that rendered systems around the globe unusable by disrupting healthcare, transportation, finance and other industries—causing billions of dollars in damage. Rather, it was routine maintenance.
A fix was quickly issued, and operations eventually returned. Still, the global outage triggered that day offers plenty of lessons IT organizations can apply to their operational and business continuity plans. Software outages, whether caused maliciously or accidentally, may not happen often, but they do happen. IT and security operations teams must be prepared to respond quickly to limit damage.
Continuity plans are essential to reducing material risk if and when—with a heavy emphasis on “when”—the next outage occurs, while enabling IT teams to answer critical questions from the C-suite and the boardroom as an incident unfolds.
Start with a comprehensive asset inventory.
Asset intelligence is the foundation of an effective continuity plan. A comprehensive and well-maintained inventory of IT assets, including information on their security controls and their integration with other assets, enables IT and security teams to respond in a well-defined, methodical manner. It also gives them the means to answer urgent questions thundering down from upstairs during an outage.
During the global outage in July, I spoke with several IT teams of our clients who were being peppered with questions from executive teams and board members. They essentially wanted updates on the company’s status in real time. To provide answers, those teams needed complete, easily accessible visibility into asset intelligence, including details on:
- What assets do we have?
- Which assets are mission-critical?
- What assets are experiencing an outage
- What’s not at risk
- Which users and applications may be affected?
A comprehensive inventory of assets covers a wide range of systems and devices, including temporary assets that currently exist in the cloud and assets that constitute environmental vulnerabilities such as active accounts belonging to employees no longer with the company, unpatched end-of-life systems, and systems that have either been retired or are only used internally for lab purposes.
Answering questions about how we’re doing can be tricky.
In addition to information on what is being affected, executives and board members will demand to know the status of the organization’s response, which relies on complete visibility into the enterprise’s assets. During an outage or other incident, leadership will want to know:
- How is the remediation process going?
- How can we validate it to ensure the remediations implemented are effective
- What are the metrics being used to measure progress? What percentage of issues did we address in the first 24 hours—90%? Less than that? More than that?
- What remains to be done?
- If we need to temporarily disable the application causing the problem, what will be the impact on customers?
- What other IT management and security controls are installed, and what is their status? For example, is information generated by a point solution being integrated into a centralized platform?
Answering each of these questions depends on asset intelligence, which involves a full accounting of an organization’s IT assets and knowing the status of each one regarding security. Teams must know which assets have adequate security controls and which don’t. If they don’t, the risks they present must be prioritized according to their vulnerabilities—whether it’s a known threat on the Common Vulnerabilities and Exposures (CVE) list or an unsecured endpoint—as well as their importance to mission-critical operations.
They also need to be sure the remediations applied have taken hold with all targeted assets. Some configuration management tools have falsely verified a certain percentage of asset remediations. It may have been a small percentage, but anything less than 100% is still a risk when it comes to closing security gaps.
A platform that centralizes asset information can give teams complete visibility into the enterprise while providing real-time updates on what’s being done to address an outage or other incident, such as tracking in real time which assets have been added to the risk list and which have been moved off. This enables teams to not only remediate efficiently but also to be ready with answers to questions from the organization’s leadership. A simple question like, “How are we doing?” can be challenging to answer without complete asset intelligence.
What steps should security and IT leaders consider after gaining visibility?
- Prioritize your response predicated on the intersection of the most critical systems and the highest risks.
- If a process isn’t defined, clarify the groups or individuals responsible for mitigation.
- Monitor the mitigation process, preferably in real time, to ensure what needed to be fixed was actually fixed.
- Measure the effectiveness of the mitigation process to determine where improvements can be made across talent, techniques and technology.
- Implement adjustments from the lessons learned in the previous step to be better prepared for the next time.
Outages, for one reason or another, may not happen often—but they do happen. When they do, complete asset intelligence is required to ensure that IT and security teams can find the answers to get everyone back to work quickly.