7 December 2018
Recently I was doing a review of a Microsoft ATA installation with a customer when we started facing the following symptoms:
- ATA center was complaining about an unresponsive gateway (Domain controller)
- On the gateway involved, the Microsoft Advanced Thread Analytics Gateway service was stuck in “Starting” status
- The memory was not over used and the ATA center URL was reachable from the gateway
- Error 500 recorded on the Microsoft.Tri.Gateway-Errors.log file
As all other gateways where running fine, we first tried to delete the gateway object on the ATA center, did a reinstallation of the ATA gateway and rebooted the machine. The service still refused to start with same errors.
Finally, we took the time to look at the different ATA gateway logs to get the big picture and we notice these errors:
C:\Program Files\Microsoft Advanced Threat Analytics\Gateway\Logs\Microsoft.Tri.Gateway-Errors.log
Error [WebClient+<InvokeAsync>d__8`1] System.Net.Http.HttpRequestException: PostAsync failed [requestTypeName=StopNetEventSessionRequest] ---> System.Net.Http.HttpRequestException: Response status code does not indicate success: 500 (Internal Server Error).
C:\Program Files\Microsoft Advanced Threat Analytics\Gateway\Logs\Microsoft.Tri.Gateway.Updater.log
2018-10-19 10:22:09.9317 34888 21 Error [ManagementException] System.Management.ManagementException: Not found
at System.Management.ManagementException.ThrowWithExtendedInfo(ManagementStatus errorCode)
at System.Management.ManagementObject.Initialize(Boolean getObject)
at System.Management.ManagementClass.GetInstances(EnumerationOptions options)
at Microsoft.Tri.Gateway.Updater.Gateway.NetEventSessionManager.StopSessionAsync(StopNetEventSessionRequest request)
at async Microsoft.Tri.Gateway.Updater.Service.GatewayUpdaterWebApplication.<>c__DisplayClass3_0.<OnInitializeAsync>b__2(?)
at async Microsoft.Tri.Common.Communication.CommunicationHandler`2.InvokeAsync(?)
C:\Users\ADMINI~1\AppData\Local\Temp\ Microsoft Advanced Threat Analytics\Gateway_20181019192441.log
[12E0:12C0][2018-10-19T19:20:32]e000: Error 0x80096005: Failed authenticode verification of payload: C:\ProgramData\Package Cache\.unverified\vcRuntimeMinimum_x64
[12E0:12C0][2018-10-19T19:20:32]e000: Error 0x80096005: Failed to verify signature of payload: vcRuntimeMinimum_x64
[12E0:12C0][2018-10-19T19:20:32]e310: Failed to verify payload: vcRuntimeMinimum_x64 at path: C:\ProgramData\Package Cache\.unverified\vcRuntimeMinimum_x64, error: 0x80096005. Deleting file.
[12E0:12C0][2018-10-19T19:20:32]e000: Error 0x80096005: Failed to cache payload: vcRuntimeMinimum_x64
[0A0C:1F80][ 2018-10-19T19:20:32]e349: Application requested retry of payload: vcRuntimeMinimum_x64, encountered error: 0x80096005. Retrying...
[12E0:12C0][2018-10-19T19:20:32]e000: Error 0x80096005: Failed while caching, aborting execution.
An HTTP error 500 is a server-side error but in this scenario the key clue of the issue was on the “Microsoft.Tri.Gateway.Updater.log”. When we looked closely to the logs, we noticed that a “WMI get instances” call was failing for the NetEventSessionManager.
We tried to manually query the class with the following PowerShell command:
Get-WmiObject -Namespace root\standardcimv2 -class “MSFT_NetEventSession” | Select Name
Result: blank output, the class is no more registered or corrupted.
To register a WMI class, we need to do an operation called “MOF recompiling”. As the installation setup failed to do it and maybe another class in the same situation, we took the decision to rebuild the entire WMI repository.
Notice that a rebuild of the repository reset the entire WMI database and recompile all registered .MOF file listed on the following registry key:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Wbem\CIMOM -> “Autorecover MOFs”
It’s not uncommon that some old third-party software doesn’t register their .mof and you must either manually compile it using the built-in mofcomp.exe or repair/reinstall the according software.
You are on a Domain Controller right? Very sensitive machine it isn’t? How many (outdated) third-party software do you have? Let’s keep the focus on ATA problem.
Steps used to reset the WMI repository:
- Sc config winmgmt start= disabled
- Net stop winmgmt /y
- Winmgmt /resetrepository
- Sc config winmgmt start= auto
- Net start winmgmt
Rebuilding the WMI repository can take few minutes depending on the system speed, the number and the content of .MOF files. Don’t stress the machine and take a 2 minutes break.
If you run again the PowerShell query, you should be able to retrieve this information:
Finally, we looked at the ATA center portal and confirmed the good health status for all gateways.
The ATA expert inside you knows that an extended blank period of communication between a gateway and the ATA center is not a good thing.
ATA abnormal behaviors are detected by using behavioral analytics and leveraging Machine Learning. A non-healthy gateway lead to an amount of information's definitely lost. Some false positive alerts can then be triggered and will require a precious investigation time or worst, you can miss real suspicious activities.
Troubleshooting ATA using the ATA logs