Microsoft has been investing heavily in next-generation security technologies. These technologies use our ability to consolidate large sets of data and build intelligent systems that learn from that data. These machine learning (ML) systems flag and surface threats that would otherwise remain unnoticed amidst the continuous hum of billions of normal events and the inability of first-generation sensors to react to unfamiliar and subtle stimuli.
By augmenting expert human analysis, machine learning has driven an antimalware evolution within Windows Defender Antivirus, providing close to real-time detection of unknown, highly polymorphic malware. At the same time, machine learning has also enhanced how Windows Defender Advanced Threat Protection (Windows Defender ATP) is catching advanced attacks, including apex attacker activities that typically reside only in memory or are camouflaged as events triggered by common tools and everyday applications.
In this blog post, we explore the machine learning techniques that have transformed Windows Defender ATP into a formidable solution for spotting all kinds of breach activity in the enterprise network.
Windows Defender ATP sensors and the Intelligent Security Graph
To deliver effective post-breach detection*, Windows Defender ATP uses endpoint sensors that are built into Windows 10. A notable difference between these sensors and first-gen endpoint sensors is the absence of signatures. Instead of relying on signatures, Windows Defender ATP sensors collect a generic stream of behavioral events. For example, the sensors can capture whenever a process connects to a web server and starts to drop and launch an application.
*As disclosed in June, the upcoming Fall Creators Update will integrate Windows Defender ATP closely with the rest of the Windows threat protection stack, transforming it into a comprehensive pre- and post-breach protection solution that enables enterprise customers to not only detect and respond to threats on their devices and networks but also to deliver proactive protection.
We marry data from these sensors with the Microsoft Intelligent Security Graph to trigger detections in Windows Defender ATP. For instance, in the example above, we can augment sensor data with a variety of information about the web server, including IP address reputation as well as Windows Defender SmartScreen reputation for the sites hosted on the same server. The graph can expand further to cover file prevalence as well as files with similar network activity and other shared behaviors. By referencing contextual information available through the Intelligent Security Graph, Windows Defender ATP can deliver more reliable verdicts.
The detections we build on top of our sensors and graph data can range from simple pinpoint detections that identify specific malicious behavior to more complex heuristics. For example, we can identify the use of a command-line parameter associated with a particular hacking tool or whenever a browser is downloading and executing a binary from a low-reputation website. And, of course, we use full-fledged machine learning to spot subtler breach activity.
The role of machine learning
Human analysts are extremely capable of carving out heuristics that alert on breach activities based on their expertise. However, an analyst can consider only a limited set of signals when creating heuristic rules. By taking into account thousands of signals, ML can slice through data more precisely while being guided by manually created heuristics. Based on our analysis of actual alerts, our ML technologies are at least 20% more precise than manually crafted heuristics.
Machine learning technologies are also able to operate with more generic artifacts. As a result, ML technologies can generalize from various shades of data to detect new and previously unseen threats. Our ML models optimize the use of the vast amounts of data and computational resources available to Windows Defender ATP.
Employing expert classifiers
Windows Defender ATP ML systems are composed of numerous models or classifiers operating together to make detection decisions. These decisions result in the identification of malicious entities and activities, including malicious processes, malicious scripts, social engineering and exploitation involving Microsoft Office, and even ransomware attacks.
Our ML models combine state-of-the-art feature engineering with a wide range of ML algorithms. We use neural networks, which provide trained predictions from a set of objects, their weighted characteristics, and the relationships of these characteristics. We leverage ensembles of decision trees, which use several layers of decision trees to correct errors and come up with high-performing predictions.
Some of our models observe a broad set of behaviors, while other models are trained to be “expert classifiers” in particular areas, such as registry and memory activities. All these ML models make layers of decisions about whether observed behaviors are malicious or benign. Windows Defender ATP then uses numeric scores from the models to calculate probabilities and decide whether to raise alerts.
Delivering contextual information
In general, ML models can provide only limited contextual information, such as why an alert has been raised. If available, such contextual information could support SecOps personnel when assessing incident severity and invoking the appropriate response. Context also serves as an initial pointer that guides succeeding investigation work.
Individual ML models can provide some context, but mostly at a very high level. For example, the models described earlier can convey whether an organization is dealing with a malicious process as opposed to a socially engineered attack or a document exploit. Windows Defender ATP augments powerful ML models with contextual information that enables SecOps personnel to hunt for more artifacts and determine the actual scope and breadth of an incident. It can provide information about persistence mechanisms and connections to specific IP addresses. Windows Defender ATP delivers context by surfacing the expert classifiers that voted for an alert while highlighting the high-level behavior that contributed to the alert decision.
In Figure 1, the ML alert identifies a suspicious file and shows the process behavior—memory activity, in particular—and structural signals in the file that led the classifier to flag the file as suspicious. This information can be used to conduct a targeted investigation for the memory activity that is indicative of exploitation, cross-process injection, or both.
Figure 1. Machine learning alert with contextual information
Supervised machine learning and feature engineering
We do employ unsupervised ML methods to identify anomalies on the network, such as abnormal user activity. However, supervised machine learning models constitute the majority of our ML algorithms.
A supervised ML model or classifier is created from a set of examples for which the ground truth class (or the “label”) is known. The goal is to be able to generalize and assign correct labels to new and previously unseen files, emails, processes, events, and all kinds of entities. When assessing supervised classifiers, we focus on their performance while handling these unknown entities.
While ML systems make decisions regarding real-world entities, such as emails (is this spam?) and images (does it show a cat, a dog, or something else?), they are typically built on algorithms that operate on features. Therefore, to apply ML techniques, we need to convert our entities of interest to features in a process known as feature engineering. When working with spam mail, for example, a feature would be the number of identical emails received from the same sender.
Feature engineering can be conducted by relying on the understanding of domain experts. Or, as in the case of recently celebrated deep-learning methods, the process can also leverage large volumes of raw data and computational power to learn appropriate representations. For example, a deep learner can use billions of emails to learn the concepts that represent spam. Both these feature engineering approaches—expert engineering and deep-learning—are used by Windows Defender ATP ML.
The application of ML to cybersecurity presents a unique challenge because human adversaries actively try to avoid detection by obfuscating identifiable traits. We take on this challenge through a multipronged approach. First, we build our ML models on top of behavioral traits that human adversaries are unable to vary easily. For example, while malware can be polymorphic—they have many static properties that can easily be modified to evade detection—they still need to utilize a limited number of persistence mechanisms. Second, we retrain our ML models using fresh data constantly, helping ensure that they generalize based on activity currently occurring in the wild. And lastly, we employ a set of graders that validate alerts raised by ML models to help uncover potential misses and ensure that the alerts pass a high bar for precision.
Process behavior trees: Capturing software behavior
How do we convert various software behaviors to features that our ML algorithms can crunch?
Our observation is that behaviors of a software process are defined not only by its own actions but also by the actions of descendant processes and other related processes. Moreover, and this is particularly important for malicious processes, many of the actions associated with process execution are performed by other processes that have been injected with malicious code.
To address these observations, we introduced process behavior trees in Windows Defender ATP ML, encapsulating all actions and behaviors exhibited by a process and its descendants, related whether through process creation or memory injection. An example of a process behavior tree for malware execution is shown in Figure 2.
Figure 2. Process behavior tree with both spawned processes and processes with injected code
Training ML models with behavioral data poses additional challenges stemming from the collection of training examples. Windows Defender ATP uses a variety of sources with millions of malicious files of different types, such as PE, documents, and scripts. We also collect training examples from non-file activities, including exploitation techniques launched from compromised websites or behaviors exhibited by in-memory or file-less threats. We build training sets based on malicious behaviors observed in the wild and normal activities on typical machines. We augment that with data from controlled detonations of malicious artifacts. Of course, the Windows Defender ATP sensors provide all the necessary data and insights without the use of signatures.
In the process of training of ML models, it is quite common to split the labeled data into train and test sets—the model that best extrapolates from train to test data is selected. Such a random split of data may not be sufficient in the cybersecurity domain. In Windows Defender ATP, we aim to be ahead of apex attackers and are aggressively exploring models that generalize well. For example, we partition labeled data by time of arrival and malware family, selecting the best performing models for detecting previously unseen malware families and advanced persistent threats (APTs).
Apart from enriching detection information, contextual data available to Windows Defender ATP through the Microsoft Intelligent Security Graph also augments the process behavior trees. When Windows Defender ATP flags a process tree—let’s say a tree for a PE file that opens a command-line shell connecting to a remote host—our systems augment this observation with various contextual signals, such as the prevalence of the file, the prevalence of the host, and whether the file was observed in Office 365. Windows Defender ATP classifiers consider these contextual signals before arriving at a decision to raise an alert.
Detecting suspicious PowerShell activities, code injection, and malicious documents
Machine learning technologies enable Windows Defender ATP to generically detect all kinds of advanced attack methods. In the following sections, we explore how these ML technologies detect attacks involving PowerShell scripts, code injection, and polymorphic documents that launch malicious code.
Attackers often use PowerShell, a scripting tool provided with Windows, to perform tasks without introducing malicious binaries, which can be caught by signature-based sensors. PowerShell also draws attackers because malicious payloads stored in scripts are generally easier to maintain and alter for polymorphism. Without relying on signatures, Windows Defender ATP ML detects suspicious PowerShell behaviors, including behaviors exhibited during a Kovter malware attack.
Figure 3. Detection of suspicious PowerShell behavior exhibited during a Kovter attack
Code injection and in-memory attacks
Kovter also uses in-memory or file-less attack methods to stay extremely stealthy. These methods generally help attackers evade signature-based scanners and reduce the chances of leaving forensic evidence. To stay persistent in memory, Kovter has PowerShell scripts that inject malicious code to other processes.
Windows Defender ATP sensors provide visibility into various memory events, including events related to the Kovter code injection. ML technologies process these events to uncover Kovter activity and similar activities, flagging them as abnormal and likely malicious.
Documents embedded with malicious code
Windows Defender ATP ML also detects documents embedded with malicious macros as they trigger suspicious PowerShell and Microsoft Word behaviors. ML detects this attack method based on behavior signals available only at the time of execution. In contrast, most signature-based technologies are unable to stop this method, which uses the normal processes PowerShell.exe and Winword.exe. Documents themselves are also generally easy to alter for polymorphism.
Figure 4. Detections of suspicious PowerShell and Microsoft Word behavior triggered by a malicious document
Windows Defender ATP ML can also detect suspicious documents used by Chanitor malware (also known as Hancitor), generically flagging suspicious behaviors, including memory injection activities. These ML detections include enough context for SecOps personnel to understand why the documents have been flagged. Like many crafted malicious documents, Chanitor documents are often capable of bypassing signature-based solutions.
Figure 5.Generic behavior-based detection of Hancitor document
Conclusion: Enhanced behavioral breach detection with machine learning
Behavior data is a great basis for robust, generic detections of malicious cyber activities. This data is made available to Windows Defender ATP by sensors built into Windows 10. Windows Defender ATP converts these behavioral events into sets of components or features that can be consumed by powerful machine learning technologies like process behavior trees. It also leverages the Microsoft Intelligent Security Graph to augment collected behaviors with important contextual information while applying Microsoft machine learning algorithms, delivering state-of-the art detection of advanced persistent threats (APTs) and the cyberattacks they enable.
For more information about Windows Defender ATP, check out its features and capabilities and read about why a post-breach detection approach is a key component of any enterprise security stack. Several features planned for release in the Fall Creators Update will be available to all users as part of the public preview.
Windows Defender ATP is built into the core of Windows 10 Enterprise and can be evaluated free of charge.
Shay Kels and Christian Seifert
Windows Defender ATP Research