Reporting from Chennai, India: A major tech issue on Friday caused widespread disruptions around the world, impacting airlines, banks, media companies, and numerous businesses. The root cause appears to be a configuration change within Microsoft Azure, the company’s cloud computing platform.
— IndiGo (@IndiGo6E) July 19, 2024
This change led to a breakdown in communication between storage and computing resources, ultimately causing connectivity failures for Microsoft 365 services. The most visible symptom for many users was the dreaded Blue Screen of Death (BSOD) on Windows 10 machines, rendering them unusable. Some reports also suggest that the Microsoft error was caused due to a CrowdStrike ‘Falcon Sensor’ update, which affected airlines, banks, stock markets, and other businesses across the globe.
The Microsoft / CrowdStrike outage has taken down most airports in India. I got my first hand-written boarding pass today 😅 pic.twitter.com/xsdnq1Pgjr
— Akshay Kothari (@akothari) July 19, 2024
Social media reports documented the issue spreading rapidly across the globe, affecting users in the United States, India, and other countries. Banks, supermarkets, and media companies all reported problems, with some TV and radio studios even going offline.
Due to major global system outage, all gate screens at DEL blank. Flights are being held at gate. Some gates boarding pax and holding on board, some flights holding pax at gate itself, which is better. Seems to be impacting many airports and airlines. pic.twitter.com/9SmHsZ6wYg
— Sanjiv Kapoor (@TheSanjivKapoor) July 19, 2024
Crowdstrike CEO George Kurtz said the issue was identified shortly after the update was released. He said, “We identified this very quickly and remediated the issue. And as systems come back online, as they’re being rebooted, they’re coming back and they’re working. Many of the customers are rebooting the system and it’s coming up and (being) operational because we fixed it on our end. Some of the systems that aren’t recovering, we’re working with them. It could be some time for some systems that just automatically won’t recover.”
The fallout was particularly severe for airlines. Check-in systems failed at major carriers in India, including IndiGo, Akasa, and SpiceJet. Similar disruptions were reported at airports worldwide, with Delhi’s Indira Gandhi International Airport, Sydney Airport, and Berlin Airport experiencing delays and cancellations. While Microsoft quickly resolved the Azure issue, airlines are still working to address the knock-on effects and get passengers back on track.
“This massive global IT outage, reportedly caused by a faulty security update from CrowdStrike affecting Microsoft Windows systems, highlights the delicate balance between maintaining cybersecurity and ensuring operational stability. The incident began when a routine security update inadvertently caused widespread disruptions, affecting businesses across various sectors including airlines, financial services, and healthcare. This demonstrates how interconnected and vulnerable our global IT infrastructure can be. There’s now a risk that companies might become hesitant to apply crucial updates, fearing similar outages. However, this approach would leave them more susceptible to cyber-attacks. Organizations mustn’t overreact by avoiding updates altogether. Instead, this incident underscores the critical importance of managing software updates in a controlled, methodical manner. Companies should implement robust testing procedures, including staging updates in isolated environments that mirror their production systems before rolling them out widely. This approach allows for the identification and mitigation of potential issues before they can impact critical operations. While no update process is entirely risk-free, a careful, staged approach to updates can significantly reduce the likelihood of such widespread disruptions while maintaining strong cybersecurity defences,” explained Andreas Hassellöf, CEO at Ombori.
Mark Jow, Security Evangelist EMEA at Gigamon, commented, “This Microsoft IT outage demonstrates the need for more robust and resilient solutions so that when these issues do arise, they can be resolved quickly without causing such widespread customer chaos and security risk. Preparedness is key – every IT and security vendor must have a robust system in place across its software development lifecycle to test upgrades before they are rolled out to ensure that there are no security flaws within the updates.”
Alexey Lukatsky, Managing Director, Cybersecurity Business Consultant, Positive Technologies meanwhile added, “This case reminds us of the importance of secure development, since in this case it was most likely the lack of update checking both on the side of the manufacturer – CrowdStrike – and on the side of consumers who automatically installed all the updates that reached them, and led to a massive global outage around the globe. With the exception of those countries that are not using infosec products from this American corporation. In addition, this story shows us how firmly information technologies have become embedded in people’s lives and in various business processes, and how catastrophic the consequences of an accidental or unauthorized, malicious impact on the IT infrastructure can be. That is, in other words, businesses are faced with the task of assessing those non-tolerable events with catastrophic consequences that can occur in their activities due to the impact on the IT infrastructure.”
Lukatsky further added, “At the moment, the root cause, based on the scale of the disaster, the way the incident manifested itself, appears to be failure to follow safe development practices. However, there is a version that cannot be ruled out: it has not yet been confirmed, but we, as experts in the field of cybersecurity, cannot completely deny it. This is the intrusion of attackers into the software development process at CrowdStrike, which could have led to the introduction of malicious functionality into the next update, ultimately leading to this kind of massive failure. The only thing that can suggest that these are unlikely to be malicious actions of cybercriminals who have intruded into the development process is that usually in these types of stories the task of cybercriminals is to remain undetected for as long as possible.”
“The recent CrowdStrike outage appears to stem from a bug in their EDR agent, which was unfortunately not thoroughly tested. This resulted in widespread disruption as many installations were affected globally. The flawed update necessitates manual intervention to resolve, specifically rebooting systems in “safe mode” and deleting the faulty driver file. This process is cumbersome and leaves systems vulnerable in the interim, potentially inviting opportunistic attacks. This incident highlights the importance of rigorous testing and staged updates for EDR agents. Normally, testing is done with every release and can take days to weeks, depending on the size of the update or changes. The ease with which their driver files can be deleted also raises questions about the self-protection mechanisms of CrowdStrike’s software. For our Acronis customers, those with recent backups can restore their systems to a stable state, minimizing downtime and exposure. Moving forward, we recommend all businesses ensure robust backup solutions and advocate for better testing protocols from their security vendors,” added Kevin Reed, Chief Information Security Officer, Acronis.
Fortunately, not all businesses were impacted. India’s stock exchanges, BSE and NSE, reported normal operations as they rely minimally on Microsoft applications. While the exact cause of the grounding orders issued by major US carriers like American Airlines, Delta Airlines, and United Airlines remains unclear, it happened just after Microsoft resolved the Azure outage. Other airlines, including UAL and Allegiant Air, also grounded flights out of caution.
Darren Anstee, Chief Technology Officer for Security, NETSCOUT, said, “The worldwide IT outage currently affecting airlines, media, banks and much more appears to have been caused by a faulty software update which was automatically applied, and not a cyberattack. This is another demonstration of how dependent we are on both our IT infrastructure and the supply chains that deliver tightly integrated capabilities within it. There will undoubtedly be a huge fallout from this, with a lot of questions set to be raised around how to balance the need for regular security updates for defence, compliance etc, with the risk of applying unqualified updates to systems. Most enterprise software goes through testing and controlled roll-out before it is pushed to a whole population, but this doesn’t seem to be the case in this instance.”
Alois Reitbauer, Chief AI Strategist, Dynatrace, said, “Given the increasing complexity of software, all software developers and organizations are susceptible to outages. When outages do occur, organizations need the capability to pinpoint the root cause and remediate them immediately. AI-driven approaches have become essential for complex IT operations to deploy as manual processes cannot keep up. A power of 3 approach to AI leveraging predictive, causal, and generative AI is increasingly critical to help organizations deliver the highest availability and performance of software as well as minimize disruption to end-user experience.”
The global ripple effects extended to Singapore Airlines, where technical difficulties impacted their service centre and reservation hotlines. Thankfully, their flights continued to operate as scheduled. Passengers are advised to contact their airlines directly for the latest flight information as airports work through the backlog caused by this tech glitch.
Edit: This news article has been updated with inputs from industry experts