Network Forensics with Windows DNS Analytical Logging

(Co Authored by Rob Mead (Microsoft Threat Intelligence Center), Kumar Ashutosh and Vithalprasad Gaitonde (Windows DNS Server)

Overview

DNS queries and responses are a key data source used by network defenders in support of incident response as well as intrusion discovery. If these transactions are collected for processing and analytics in a big data system, they can enable a number of valuable security analytic scenarios. An exercise to this end was conducted with Microsoft internal DNS systems. This document outlines the approach, and results to enable Windows DNS customers to re-produce the outcome. 

Motivation

The at-scale processing and analysis of DNS data in a Big Data system is a powerful capability that can be used to support analyst investigations and discovery of intrusions. Below are a selection of scenarios that are enabled –

IOC Detection

Domain Names and IP addresses are one of the most common sources of Indicators of Compromise (IOC), often referring to Command and Control servers of attacker infrastructure. The collection, processing and storage of DNS data allows for queried domains, and resource record response data for hosts within the network to be searched for these IOCs, providing a quick and accurate detection of whether the network has been impacted by an intrusion. The on-going collection of this data also allows for a powerful retrospective search of IOCs on computer networks. 

Protocol Agnostic Detection

Network defenders often have access to other data sources that can be searched for IP and Domain IOCs, such as Web proxy and Firewall logs. DNS collection provides a higher-fidelity detection of these, where the protocol implemented by attacker Command and Control infrastructure does not involve HTTP, or where DNS itself is being used as a covert channel.  

Covert Channel Detection

DNS can be used by adversaries as a covert channel to provide remote configuration or data transfer capability to malware inside a computer network. At scale analysis of abnormal response packets can be used to identify such covert channels.

Adversary Tracking

Historic logging of query and response data and associated analysis enables the tracking of command and control infrastructure usage used by adversaries over time, where multiple domains and IP addresses are used and infrastructure is transitioned following the discovery of activity. 

Analytical Logging in Windows DNS

Windows Server DNS (2012R2 onwards) has implemented enhanced logging of various DNS server actions in Windows, including the logging of query and response data with a focus on negligible performance impact.

Negligible Performance Impact of Enabling

A DNS server running on modern hardware that is receiving 100,000 queries per second (QPS) can experience a performance degradation of 5% when analytic logs are enabled. There is no apparent performance impact for query rates of 50,000 QPS and lower

See the following technet article for detailed information regarding this feature, including how to enable it on your infrastructure – 

https://technet.microsoft.com/en-us/library/dn800669.aspx

Details of Logged Data

The Analytic log type implemented through this feature contains much of the day to day operational detail of the DNS server, and although many types of data are recorded, including zone transfer requests, responses and dynamic updates, for the forensics and threat analytics we will focus on QUERY_RECEIVED and RESPONSE_SUCCESS data types in this example. These form the core of our current internal collection and pose the biggest challenge in collection due to volumes of data. 

QUEY_RECEIVED and RESPONSE_SUCCESS events that are logged contain a number of the fields that make up a query and response, but crucially also contain the full packet data received enabling the processing of any aspect of one of these objects. Here is an example response event from the Applications and Services Logs\Microsoft\Windows\DNS-Server analytic log–

 

Implementation

The logging of this data was implemented as an ETW (Event Tracing for Windows) provider. Many event types in Windows are enabled via this mechanism, and allow for high performance logging of data, and the subscription to these providers via their unique GUIDs. In the case of the Microsoft-Windows-DNSServer provider, GUID {EB79061A-A566-4698-9119-3ED2807060E7} is used as its identity. Windows comes with many tools to record samples of this data, such as tracelog , which will record events offered by the ETW channel and write them to a file. The Windows event viewer essentially replicates this subscription model when presenting the administrator with a view of these events, writing the collection sample to a temporary file location. As an example, here is the location of the data underlying the enhanced DNS logging feature – 

%SystemRoot%\System32\Winevt\Logs\Microsoft-Windows-DNSServer%4Analytical.etl

 

The event viewer acts as a browser for this file.

Collection of Data

One method of collecting events from Windows servers is Windows Event Collection (WEC). WEC is a mechanism built into Windows that will forward an XML representation of an event to a configured collection server, based upon a filter specifying an event identifier and selection criteria. WEC, however can only be configured for log types of ‘Operational’ . Operational events are stored in a rolled permanent location inside an .evtx file on the host. When these events are created they are also written via the Windows Event Collector service, which performs forwarding off host. For more information on Windows Event Collection, see the following article on msdn – 

https://msdn.microsoft.com/en-us/library/windows/desktop/bb427443(v=vs.85).aspx

There is an inherent overhead in logging an event in this way, and is the reason DNS query and response logging was implemented as an ‘Analytic’type. Analytic log types do not write events via the WEC service and as such have a lower performance impact. This helps to give us the negligible performance win we mention earlier at a high QPS in the order of 100 thousand queries per second.

For servers that do not have such high QPS needs, using WEC from an operational channel becomes a more viable option. Internal DNS servers which serve a dedicated enterprise network may have significantly lower QPS requirements (around 10,000 QPS.) At these levels, collection over WEC becomes a more realistic scenario. Further, from a security analytics stand point the majority of queries and responses for reputed domains such as *.microsoft.com, are less valuable to us and can be dropped, further reducing the effective QPS logged for WEC. 

For the internal Microsoft project, a high performance event collector was implemented that consumes the QUERY_RECEIVED and RESPONSE_SUCCESS events from the ETW channel on DNS servers. This consumer filters out high repute domains that are less valuable to us for security analytics and writes these to an operational log equivalent ready for collection over normal WEC channels. The following diagram gives a high level overview of the functionality of this tool –

Query Volume Modelling

Selecting domain names for filtering can be a balancing act that requires modelling using sample data. Front loading too many domain filters in the tool can cause un-necessary processing, whilst letting high volume domains through to the event writer can result in un-necessary volume and associated storage costs. 

Customers can collect a sample of query and response data from the analytic logging in a .etl file from a DNS server on their network. This ETL file can be analyzed for the top volume queried domains in terms of Second Level Domain (SLD). Taking the top SLDs, excluding county TLDs such as .co.uk customers can model query volume reduction by the application of various SLD filters, and extrapolate these to enterprise network coverage. These figures can then be used to calculate approximate storage costs of implementing such a solution. Alternatively, data such as the Alexa Top 100 SLDs can be taken as a starting point and refined as required to fit the needs of the enterprise. 

In our prototype, a value of 100 SLDs was chosen for implementation based upon sample data logged on the network. This resulted in a reduction of volume from 3,286 QPS from our sample data set, to 136 QPS, 4.13% of the total volume. The extrapolated effective QPS rate for the whole of the network drops significantly in this scenario, easily manageable by big data and WEC infrastructure. 

The Pilot Deployment and Results

We worked with Microsoft IT to pilot the analytic logging, and the ETW consumer/filtering tool to the Microsoft corporate DNS infrastructure. The pilot project was rolled out to 29 DNS caching servers. A little snapshot of the query volume that this project was dealing with is:

The total size of raw data storage post filtering for all 29 Microsoft corporate DNS servers peaks at approx. 100GB / day at the busiest times, and drops down to around 15gb / day during less busy periods such as weekends.  

This enables a whole new way of using information related to compromised domains, the identification of malicious transactions, infected machines and thus enabling us to monitor and fortify our network.

Note: You can try out this scenario on a fully patched 2012R2 DNS server as well as the Windows DNS server 2016 Technical Preview