Managing asset inventory in Office 365

Article
12/21/2017

In Office 365, servers are continuously provisioned and destroyed as the service is upgraded and scaled to meet customer demand. To assess the coverage of our security monitoring and patch management processes, we needed an asset inventory system that met the following criteria:

The system must ascertain the current state of the fleet accurate to within the last hour
The system must allow engineers to ask new questions of the fleet without a redeployment
The system must handle assets that have been torn down or renamed without returning stale or “ghost” results
The system must be able to identify assets which are online but for which state information could not be returned
The system must scale to many hundreds of thousands of machines and dozens of queries while maintaining performance and reliability

We met this challenge by deploying a lightweight agent to each node and an Azure web service which stores data in Cosmos DB and Redis.

Agent deployment and check-in

The inventory agent is installed on each physical and virtual machine during deployment.

When the agent starts up, it looks in the Windows registry for a unique ID that identifies the machine. If one does not exist, it generates one and stores it. This ID ensures that each machine is associated with only one inventory entry even if it is renamed or moved between Active Directory domains.

The agent checks in with the inventory service hourly. At startup, the agent sleeps for a random interval between 0 and 60 minutes before the first check-in to ensure that requests to the Azure web service are evenly distributed across the fleet.

During each check-in, the agent:

Requests a list of WMI and registry queries from the Azure web service
Performs those queries locally
Reports machine metadata and query results to the web service

Inventory queries

Engineers add WMI and registry queries to the system using PowerShell cmdlets. These queries are stored in Azure Table storage using the query type as the partition key and a query ID as the row key.

A WMI query is defined by the query statement, the WMI namespace under which to perform the query, and a list of property names to return values for:

Id	antimalware_details
Query	SELECT * FROM AntiMalwareHealthStatus
Namespace	root\Microsoft\SecurityClient
Properties	AntivirusEnabled, AntivirusSignatureAge, AntivirusSignatureVersion, …

Likewise, a registry query is defined by the registry hive, the registry path, and a list of value names to return values for:

Id	lsass_ppl
Hive	HKLM
Query	SYSTEM\CurrentControlSet\Control\Lsa
ValueNames	RunAsPPL

Registry paths may contain wildcards which are expanded on the agent to handle cases where a unique ID (such as a network interface ID or certificate thumbprint) is part of the path:

SYSTEM\CurrentControlSet\services\NetBT\Parameters\Interfaces\*

Inventory results

WMI and registry results are returned as a list of result instances where each instance is a dictionary of name/value pairs. For example, consider a WMI query that selects all IP addresses assigned to the machine:

SELECT IPAddress FROM Win32_NetworkAdapterConfiguration WHERE IPEnabled = 'true'

This returns one result instance per network adapter, each of which contains one key/value pair for the IPAddress property:

[
    {"IPAddress": ["192.168.1.14", "fe80::7cbe:6ae:becf:93ad"]},
    {"IPAddress": ["10.0.0.14", "fe80::e14e:dbd6:df43:cfbc"]}
]

In addition to reporting these query results to the web service, the inventory agent reports machine metadata as well:

The machine’s name, domain, and IP addresses
The machine’s uptime and OS image age in hours
The agent version
The time it took to perform the WMI and registry queries
Any exceptions that occurred when executing the queries

Result storage

Machine metadata is stored in Azure Cosmos DB. The unique ID associated with the machine serves as both the ID and the partition key to ensure that updates from the same machine never generate multiple entries. Cosmos DB automatically indexes property values as documents are inserted or updated, allowing engineers to rapidly locate a machine given an IP address, machine name, or range of uptime.

However, we encountered scale limits when we used Cosmos DB to store query results. As the number of registry and WMI queries increased, our upsert operations were impacted by throttling because our workload put O(M*N) demands on the indexing system, where M is the number of machines checking in each minute and N is the number of result instances and properties that are returned by each machine.

Instead, we built our own inverted index in Azure Redis Cache. We use a set to track the values observed for each WMI property or registry value name – one entry per distinct value. For each value, we create another set to record the machines that returned that value. If a machine does not return a result for a given query, we track that too. This allows us to identify machines that are missing security-critical processes.

We incorporate the hour (0-23) into each key so that we can update the current hour’s results while locating last hour’s results. When engineers need to identify machines with a specific value, we union the current hour’s results and last hour’s results to ensure that every machine is represented, even if it has not yet checked in this hour. We apply a 2-hour expiration to each Redis set to ensure that memory use stays bounded.

We take advantage of async support in the StackExchange.Redis library to pipeline our requests, allowing a small number of Azure web roles to perform roughly 30k Redis operations per second.

Locating missing agents

When assessing the health of the fleet, we not only need to identify machines that are reporting an unhealthy result – we also need to identify machines that are online but not running the inventory agent. We solved this by having the inventory agents listen for incoming connections on a TCP port, and having them periodically perform a limited port scan against their neighbors on the same /24 subnet. Each neighbor may be in one of three states:

Healthy: It accepts connections on the inventory agent port
Zombie: It does not accept connections on the inventory agent port, but it does accept connections on well-known Windows posts like SMB or RDP
Offline: It does not accept connections on the inventory agent port nor other Windows ports

Zombie IP addresses are tracked in Redis for remediation.

To limit network traffic, each inventory agent scans its neighbors once each day. The agent uses the last octet of the machine’s current IP address modulo 24 to identify the hour in which it will perform a scan:

int scanHour = ipAddress[3] % 24;
bool shouldScan = (DateTime.UtcNow.Hour == scanHour);

On a fully-saturated /24 subnet, 11 machines will scan the subnet every hour – a scan every 5 minutes. On a subnet where only one agent is online, one scan per day will be performed.

Remediating unhealthy machines

Once the inventory system was deployed across the fleet, we used it to build a system which automatically reimages and redeploys machines that do not meet our security requirements.

First, we added inventory queries that measure the security health of the fleet:

Are our security and telemetry agents running?
Is AV running? Are the signatures fresh? Is real-time monitoring enabled?
Is the firewall running?
Is the machine uptime <30d?
Is device firmware fully patched?

Next, we wrote a recurring job that fetches machines from the inventory system that do not meet these criteria and enqueues them for remediation. Remediation is handled by a separate system that reimages virtual and physical machines while maintaining high availability and a minimum set of redundant resources in each capacity unit.

To measure the effectiveness of this system, we created an Azure Logic App which periodically snapshots these health results and stores them in an Azure SQL database. We use Power BI to query this database for servers which have been persistently unhealthy for multiple hours. When the recurring job and replacement systems are properly functioning, the number of persistently unhealthy servers trends toward single digit numbers even in a fleet of hundreds of thousands.

Summary

In Office 365, we use a distributed inventory system to assess the security health of the fleet. This gives our engineers the ability to rapidly ask questions of the entire environment, and it drives a system which automatically reimages machines which are in an unhealthy state or whose state cannot be determined. This helps us maintain the security health of our fleet even as it scales to multiple hundreds of thousands of servers.

We continue to improve the systems that manage and secure our fleet at cloud scale. If you have questions, suggestions, or comments, I'd love to hear from you! Feel free to reach me on Twitter at @MSwannMSFT.