Monitoring Azure applications - Part 4

The “Monitoring Azure applications” series:

Up until now we’ve worked through getting your Azure application setup for monitoring and we’ve gone through the process of getting it discovered.  But, just like anything else in IT (and life…), something always seems to go wrong the first time.  Also, maybe you’ve setup your Azure application monitoring, but then suddenly, one day you get a call from the team that runs the Azure application and they say, “Hey!  Our Azure app is down and your (insert negative adjective here) monitoring didn’t catch it”.  This post talks about the ways that I’ve seen the Azure MP fail (I’m sure there are some I’ve not hit yet) and talks about how I’ve detected the problem and what I had to do to fix it.

Upfront, I also want to point out that since Azure workflows run on management servers and on watcher nodes the events mentioned in this post manifest in the OperationsManager event log one or both of the following places:

  • Any watcher node that is trying to access an Azure application with the certificate
  • The root management server in OpsMgr 2007 R2, or one of the management servers in the management server pool in OpsMgr 2012

Your Run As Account(s) are not setup right

In the 2nd post of this series, we talked about setting up the handshake.  In that we had to setup 2 run as accounts (one to store the certificate and one to store the password used to access the certificate).  I’ve seen this fail in one of two ways.

Way#1: The password used to access the certificate is incorrect:  Unlike windows accounts, where the UI will verify the password you enter against AD, the basic authentication account will gladly take any string you give it.  Its only when OpsMgr tries to use that password on the certificate, that it realizes that something is wrong. 

The way it manifests is in an event in the OperationsManager event log…

Source: Health Service Modules
Event ID: 34000
Description: Failed to load the
certificate for the Windows Azure Service Management API. Please verify that a
valid certificate and password are assigned to the Run As accounts used for
Windows Azure monitoring.

Instance name: <InstanceName>

And an alert…

Alert Rule: Windows Azure Invalid Certificate
Alert Description: Failed to load
the certificate for the Windows Azure Service Management API. Please verify
that a valid certificate and password are assigned to the Run As accounts used
for Windows Azure monitoring.

Instance name: <InstanceName>

I included the “Instance name: <InstanceName>” part above, because this is how you figure out which credentials you need to work on.  If you’ve only got one cert or you’ve only got one app, then you don’t have to dig too much, but for those that have multiple accounts to manage, there is some digging you have to do here, and a best practice I suggest:

  1. Get the value from <InstanceName>
  2. Move to the Monitoring section of the console
  3. Navigate to Monitoring -> Windows Azure -> Role Instance State
  4. Find the object that has the same name as the <InstanceName> and in the Detail View, note the Full Path Name
  5. The format of Full Path Name for an instance is <Hosted Service>\<Deployment>\<Role>\<Instance>, so from this you can take the value from the beginning of the path
  6. Now switch to the Authoring section of the console and navigate to Authoring -> Management Pack Templates -> Windows Azure Application
  7. Unfortunately, this is where you have to hunt and peck, because the UI does not provide any clear connections between the name of the objects and the name of the Windows Azure Applications. 
    1. Unless, you follow a best practice I suggest.  That best practice is to always put the DNS prefix of the hosted service into the Name of the Windows Azure Application template that you create.  This way, you can use the “Hosted Service” from the full path name above, to quickly make the connection.
    2. Once you find the right Windows Azure Application, open its properties and capture the name of the run as account used under Azure Certificate Password Run As Account
    3. Now switch to the Administration section and navigate to Administration -> Run As Configuration -> Accounts
    4. Locate the run as account that you captured in step 8, above, update its password to the correct value and save the changes.

You know your updates worked by looking in the event log where you first saw the 34000 events.  If you see an event that looks like the following and the value for <HostedServiceName> matches what you saw in step 5 above, then you’re back up and running:

Source:        Health Service Modules
Event ID:      34011
Description: A client for Windows Azure Storage Service has been initialized.
Subscription ID: <SubscriptionIDGuid>
Service name: <HostedServiceName>

Way#2: The account(s) are not correctly configured for distribution: In the 2007 R2 timeframe this new feature was added to Run As Accounts for “More Secure” or “Less Secure” distribution.  If you followed the steps in the 2nd post of this series, then you set things up as “More Secure” and what that means is that you have to explicitly pick which systems the account is authorized to be distributed to.  Unfortunately, if you ever add new management servers or start using a new watcher, then those systems may not be authorized for distribution. This manifests as an event…

Source:        HealthService
Event ID:      1108
Description: An Account specified in the Run As Profile "Microsoft.SystemCenter.Azure.RunAsProfile.Password" cannot be resolved. Specifically, the account is used in the Secure Reference Override   "<SomeLongStringHere>".  This condition may have occurred because the Account is not configured to be distributed to this computer. To resolve this problem, you need to open the Run As Profile specified below, locate the Account entry as specified by its SSID, and either choose to distribute the Account to this computer if appropriate, or change the setting in the Profile so that the target object does not use the specified Account.

Management Group:  <ManagementGroupName>
Run As Profile:  Microsoft.SystemCenter.Azure.RunAsProfile.Password 

Object name:  <HostedServiceName>

But no alert!?

So, if you followed the best practice I mentioned above and put the DNS prefix of the hosted service into the Name of the Windows Azure Application template that you created, then you can do the following:

  1. Find the right Windows Azure Application, open its properties and capture the names of both run as accounts used (Azure Certificate Password Run As Account and Azure Certificate Run As Account)
  2. Now switch to the Administration section and navigate to Administration -> Run As Configuration -> Accounts
  3. Locate the run as accounts that you captured in step 1, above.
  4. For both accounts do the following:
    1. Open the properties of the run as account and switch to the Distribution tab
    2. Confirm More secure is selected and click the Add button
    3. Add all management servers and all systems that you use as watcher nodes
    4. Click OK to apply the changes

Just like the previous case, you know your updates worked by looking in the event log where you first saw the 1108 events.  If you see an event that looks like the following and the value for <HostedServiceName> matches what you saw in the 1108 event, then things are back in order:

Source:        Health Service Modules
Event ID:      34011
Description: A client for Windows Azure Storage Service has been initialized.
Subscription ID: <SubscriptionIDGuid>
Service name: <HostedServiceName>

Your watcher node errors out when trying to access the Azure Service Management API

The next group of problems comes when the watcher node (aka proxy agent) tried to reach out and talk to Windows Azure.  For me, I’ve also seen two distinct types of mishaps here.

Way#1: Typos in the Windows Azure Application template: I don’t know about you, but I constantly mistype things.  Half of the reason I take so much time to do blog posts is because I’m fixing typos J.  So, its no surprise that I’ve accidentally mistyped some details into the Windows Azure Application template.  When I do this, it messes up the Azure MP because the Azure MP needs that information to query the Azure Service Management API.

This shows up as an event in the event log…

Source:        Health Service Modules
Event ID:      34004
Description: REST operation against Windows Azure Service has failed because the specified resource in the URI does not exist.

Request method: GET
Request URI: https://management.core.windows.net/ <SubscriptionIDGuid you entered> /services/hostedservices/ <DNSPrefix you entered> ?embed-detail=true
Response status code: NotFound
Response status description: Not Found
Error code: ResourceNotFound
Error message: The hosted service does not exist.
Error XML: <Error xmlns=https://schemas.microsoft.com/windowsazure xmlns:i="https://www.w3.org/2001/XMLSchema-instance"><Code>ResourceNotFound</Code><Message>The hosted service does not exist.</Message></Error>
Exception: The remote server returned an error: (404) Not Found.

But again, no alert!?  In order to track down the typo, you do the following:

  1. Capture the <DNSPrefix you entered> from the event
  2. Using that string, try full and partial searches of your Windows Azure Application list until you find the right one.
    1. Again, this assumes that you’re following the best practice I mentioned above of putting the DNS prefix of the hosted service into the Name of the Windows Azure Application template.
  3. Open its properties and reconcile the DNS prefix of the hosted service and the Subscription ID values, with what is in the Azure Portal.
  4. Fix any typos and click OK to apply the changes

The same type of 34011 event that is mentioned above, is the clear indicator that you’ve fixed the typo.  Likewise, you can keep refreshing the Hosted Service State view in the monitoring section of the console, until you see the new hosted service show up there.

Way#2: I need to use a proxy server or I typo’d the proxy server: OK, to be honest I’ve not encountered this one, but I know it can happen.  What I’m showing you below is actually from a management group where I just turned off the network connection to simulate a lack of connectivity.  The symptoms should look very similar though.  You would see the following events…

Source:        Health Service Modules
Event ID:      34016
Description:  The remote name could not be resolved: 'management.core.windows.net'. Please check the connection to the remote server or the proxy setting in the workflow configuration.

Subscription ID: <SubscriptionIDGuid>
Service name: <HostedServiceName>

or

Source:        Health Service Modules
Event ID:      34016
Description: The remote name could not be resolved: ' <DNSPrefix you entered> .table.core.windows.net'. Please check the connection to the remote server or the proxy setting in the workflow configuration.

Subscription ID: <SubscriptionIDGuid>
Service name: <DNSPrefix you entered>
Proxy address: 
Exception: The remote name could not be resolved: ' <DNSPrefix you entered> .table.core.windows.net'

And you will see an alert…

Alert Rule: Windows Azure Remote Name Resolution Failure
Alert Description: The remote name could not be resolved: <SomeURL>. Please check the connection to the remote server or the proxy setting in the workflow configuration.

Subscription ID: <SubscriptionIDGuid>
Service name: <DNSPrefix you entered>
Proxy address:
Exception: The remote name could not be resolved: <SomeURL>

In order to track down the typo, you do the following:

  1. Capture the <DNSPrefix you entered> from the event
  2. Using that string, try full and partial searches of your Windows Azure Application list until you find the right one
    1. Again, this assumes that you’re following the best practice I mentioned above of putting the DNS prefix of the hosted service into the Name of the Windows Azure Application template
  3. Open its properties and reconcile theinformation entered for the proxy server
  4. Fix any typos and click OK to apply the changes

Way#3: Your certificate is not a management certificate on the Azure subscription: Lastly, it is possible that the certificate that is being used in the run as account for Azure monitoring, is not yet added as a management certificate on the hosted service.

If that happens you’ll see the following event…

Source:        Health ServiceModules
Event ID:      34015
Description: Failed to authenticate the request. Please verify that the certificate for the Windows Azure Service Management API is valid and is associated with the subscription.

Subscription ID: <SubscriptionIDGuid>
Service name: <HostedServiceName>

Request method: GET
Request URI: https://management.core.windows.net/\<SubscriptionIDGuid>/services/hostedservices/<HostedServiceName>?embed-detail=true
Error code: AuthenticationFailed
Error message: The server failed to authenticate the request. Verify that the certificate is valid and is associated with this subscription.
Error XML: <Error xmlns=https://schemas.microsoft.com/windowsazure xmlns:i="https://www.w3.org/2001/XMLSchema-instance"><Code>AuthenticationFailed</Code><Message>The server failed to authenticate the request. Verify that the certificate is valid and is associated with this subscription.</Message></Error>

Inner exception: The remote server returned an error: (403) Forbidden.

And you will see an alert…

Alert Rule: Windows Azure Management Client Authentication Failure
Alert Description: Failed to authenticate the request. Please verify that the certificate for the Windows Azure Service Management API is valid and is associated with the subscription.
Request method: GET
Request URI: https://management.core.windows.net/\<SubscriptionIDGuid>/services/hostedservices/<HostedServiceName>?embed-detail=true
Error code: AuthenticationFailed
Error message: The server failed to authenticate the request. Verify that the certificate is valid and is associated with this subscription.
Error XML: <Error xmlns=https://schemas.microsoft.com/windowsazure xmlns:i="https://www.w3.org/2001/XMLSchema-instance"><Code>AuthenticationFailed</Code><Message>The server failed to authenticate the request. Verify that the certificate is valid and is associated with this subscription.</Message></Error>

Inner exception: The remote server returned an error: (403) Forbidden.

In order to fix this issue you’ll need to add the certificate as a management certificate, by following the instructions covered in the Monitoring Azure applications – Part 2 post under Add the private key (.CER) to your Azure subscriptions as a “Management Certificate” .

Your watcher node is unable to access the Windows Azure storage account

The last area that I’ve encountered failures is around how the Azure watchers try to access the Windows Azure storage accounts for each host service.  The storage account is where all of the instrumentation is stored for a hosted service, so getting to the account is very important. 

For better or worse, the storage account connection details is not something that you can configure statically within OpsMgr.  Instead the way the Azure MP works is that it uses the service management API to connect to the Hosted Service and then programmatically looks up the connection details to the storage account.  Some of this behavior is also described in the Azure MP guide within the prerequisites section, here.

Following are the two ways in which I’ve seen this part of the process fail:

  • Way#1: The name of the setting in the config file is something other than “DiagnosticsConnectionString”
  • Way#2: The setting in the config file does not exist or has an invalid value

The resolution steps for both cases is already covered in great detail on Walter Myers’ blog, here.

So that’s it!  Those are all of the ways that I know of that the workflows can fail.  Hopefully this is a useful resource for someone out there on the interwebs.

Cheers!