ARM concepts in Azure Stack for the WAP Administrator – Troubleshooting IaaS in Azure Stack

Hello Readers! This blog is part 8 (and the last) of the series "ARM concepts in Azure Stack for the WAP Administrator." In this post we'll discuss and share troubleshooting techniques and resources that we have learned when working with customers and partners that are actively validating Microsoft Azure Stack Technical Preview 1.

Note Some information relates to pre-released product which may be substantially modified before it's commercially released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

I'm including the table of contents for this series of post so that you'll find it easier to navigate across the series:

Table of contents

  1. Introductory post, and some first information on the Azure Stack POC architecture and ARM's role
  2. Cloud Service Delivery
  3. Plans, offers, and subscriptions
  4. Resource Deployment
  5. Packaging and publishing templates on Azure Stack
  6. Multi-tier applications
  7. In-guest configuration with ARM, and technologies such as Virtual Machines Extensions, including PowerShell Desired State Configuration (DSC)
  8. Troubleshooting IaaS deployments in Azure Stack —this post

With no more delay, let's get started!


WAP Troubleshooting

While we've already discussed the WAP architecture for IaaS in previous posts from this series, let's summarize the components required – a fabric based on Windows Server 2012 R2, a fabric management infrastructure based on System Center 2012 R2 and Windows Azure Pack for offering cloud services to tenants as depicted in the picture below:

And specifically, for enabling the Virtual Machine Clouds (a.k.a. VM Clouds or simply IaaS) service in WAP, the System Center 2012 R2 components required are:

  • Virtual Machine Manager (VMM)
  • Service Provider Foundation (SPF)
  • (Optional) Operations Manager (OpsMgr) – for usage
  • (Optional) Service Management Automation (SMA) – for executing automation runbooks

This is depicted in the picture below:

As you can see, there are several moving parts involved just for the VM Clouds service in WAP (meaning tenants can deploy VMs and VM networks via a self-service portal). So, when something goes wrong (like a tenant VM deployment failing), the root cause could be in WAP, SPF, VMM, at the storage level, or event in Hyper-V!

To help with this potential challenge, back when WAP was released, the Building Clouds blog team (aka.ms/buildingclouds) and the community had been very active providing guidance and troubleshooting for the initial scenarios.

At the same time, the official WAP documentation was growing to cover the different areas (hence, not only IaaS, but PaaS too, such as Web Sites and SQL). The WAP troubleshooting article in TechNet covers the different components and scenarios in great detail:

And, finally, WAP administrators have a plethora of troubleshooting information for Windows Server 2012 R2, Hyper-V, and System Center 2012 R2 (VMM, SPF, Operations Manager, and so on).

Now, let's see what resources we've available for troubleshooting Azure Stack TP1.


Azure Stack Troubleshooting

At the time of this writing, the Azure Stack version we have available is Azure Stack Technical Preview 1. This means that we're working with a very early version of the product and which is deployed on a one-node configuration for evaluation purposes. Hence, the guidance and links provided on this blog apply to Azure Stack TP1 only.

First, let's start with a quick overview of the Azure Stack TP1 architecture, which is described on this article:

The same article explains the roles of each of the components, and also, if you read the comments, you'll see that this diagram is missing the BGP VM, which acts as a router between different VMs in the TP1 single-node deployment.

As you can see in the picture above, the architecture and components on Azure Stack TP1 are very different from what you were used in WAP. New technologies from Windows Server 2016 Technical Preview 4 (such as Storage Spaces Direct) and from Microsoft Azure (such as Service Fabric) are used. With so many new components, let's take a look at the resources available for you to troubleshoot Azure Stack TP1.


Azure Stack Troubleshooting – Where to go and how to contribute

In case you haven't noticed, there is a very comprehensive list of known issues, workarounds and troubleshooting guidance in the Azure Stack documentation (direct link here). I'd suggest that you refer to this site for troubleshooting topics on Azure Stack as it's been updated regularly. The articles are organized by categories, so that it's easier to navigate and find answers depending on a specific area (such as Platform Image Repository, templates or TP1 deployment itself).

With that said, we will not start writing additional or new troubleshooting guidance for Azure Stack on this blog post, because the Azure Stack documentation is available in azure.microsoft.com and every one of us can contribute to it! You only need to have a GitHub account (if you don't have one, you can get one here), go to the specific document, and click on the Edit on GitHub link as depicted in the picture below:

This will bring you to the article on the Azure GitHub repository, and from here you can easily contribute by clicking on the edit button as highlighted in yellow in the picture below:

Make the edits in your fork of this project, propose a file change and then submit a pull request. Pull requests are reviewed by the Azure Stack team, and if everything looks good, they'd merge the request into their repository, and everybody will see your contribution:


Azure Stack Troubleshooting – Most common issues

Alright, the list of known issues provided in the link above is quite comprehensive, but which are the most common issues faced when working with IaaS in Azure Stack? At the time of this writing, these were some of the most common issues we've seen when working with customers:

Disclaimer – These common issues only apply to Azure Stack TP1 POC and were taken from the Azure Stack troubleshooting article. You could expect these issues to be fixed in future Azure Stack releases.
"Gateway Timeout" error message when working with virtual machinesIn Azure PowerShell, the error message may be:

Gateway Timeout: The gateway did not receive a response from 'Microsoft.Compute' within the specified time period.

This is a known issue, and should be fixed in a future release. As a workaround, restarting the Compute Resource Provider (CRP) services on the xRPVM, or restarting this VM, should solve the issue.
Performance issues when deploying or deleting tenant virtual machinesSome improvements on deployment and deletion times have been implemented in the incremental release for Azure Stack TP1 (April 2016). In case you still see issues, here are some steps that may help with poor performance during VM management tasks:
  1. Restart the WinRM service on the Hyper-V Host
  2. If that doesn't work, restart the CRP service on the xRPVM
  3. If that doesn't work, restart the xRPVM
A new image added to the Platform Image Repository (PIR) may not show up in the portalWhen Adding an image to the Platform Image Repository (PIR) in Azure Stack, it can take some time (5 to 10 minutes) for the image to show up in the Azure Stack portal, after running "CopyImageToPlatformImageRepository.ps1".Also, if the value for -Offer and/or -SKU contains a space, the manifest will be invalid and a gallery item will not be created. This is a known issue, and the current workaround is to ensure you don't use space, for example changing the SKU from "Windows Server 2012 R2 Standard" to either "WindowsServer-2012-R2-Standard" or "WindowsServer2012R2Standard".Finally, we've seen reports where increasing the number of virtual processors (to 4 or 8) and memory (to 8 GB) for the xRPVM would solve this situation.
Network security groups cannot be created using default tagsIn Azure Stack TP1, it is possible to deploy security rules with a sourceAddressPrefix of "*" or "10.0.0.0/24", but using a tag like "Internet" or "VirtualNetwork" fails. This is because default tags are not supported in TP1. This is a known issue that should be fixed in a future release.
Network resolution issues from tenant virtual machinesWith this release, virtual machines should be able to connect to the internet, for example for some of the virtual machine extensions.If you are having internet connectivity issues from within the virtual machines, it is likely due to the fact that we do not have the iDNS feature yet in this Technical Preview 1 release, meaning that a shared DNS feature from Azure is not configured by default.You can confirm this by looking at the "DNS servers" settings for the associated virtual network:

In the portal, this can be changed to 192.168.100.2 and another public DNS value for the second one that is required. This can also be controlled when deploying via a template, by using this setting in the "dhcpOptions" for the virtual network

"dnsServers": ["192.168.100.2"]

This setting can also be used when deploying a virtual machine via a template that also includes a virtual network.

If you need to change this for an existing virtual network, virtual machines that are already deployed will need to be stopped and restarted. When logging into the restarted VM, you should confirm it has picked up the new settings from the Network Controller, via DHCP. Doing changes directly in the VM may work, but would be a change "out of band" for the Network Controller, so is not desired. Disabling/enabling the virtual NIC within the VM would also be a possibility at this stage (since you have access to both tenant and service admin sides in the POC).
Error "Operation could not be completed within the specified time" when running the New-StorageContainer cmdletThis is a known issue that should be fixed in a future release.Workaround:You can stop the WAC (WacServer.exe) process inside the ACS VM, using task manager. Service fabric should automatically restart it


Azure Stack Troubleshooting – Tools available

Now, let's review some of the tools available to help you troubleshoot Azure Stack TP1:

Tool: ARM Template Checker for Microsoft Azure Stack

Let's imagine this situation: you have a JSON template that you've been using to deploy resources in your Azure subscription (for example, a virtual network, VMs and NSGs). When you deploy the template in your Azure subscription it works like a charm, but it fails to deploy on your Azure Stack subscription.

For scenarios like this, you can use the ARM template checker tool that as the name implies, it'll help you to check your template, and it will indicate if it detects incompatibilities on your template that would prevent the successful deployment on Azure Stack. For example, your template might reference an Azure region (such as West Europe) that does not exist on Azure Stack (the only region on Azure Stack TP1 is local). Also, your template, might make references to resource providers or APIs available in Azure, but not available in Azure Stack yet.

ARM-Deployment-Troubleshooter

Think about this scenario: you take one of the templates from the Azure Stack Quick Start GitHub repository (or any template you may have written), deploy it to a resource group in your Azure Stack subscription, and for some reason, the deployment fails and maybe you get just a generic error in the Azure Stack portal or in PowerShell. It's difficult to know where the deployment failed, isn't it? (and this is even more complex when you've nested templates such as SharePoint).

This script can help you to troubleshoot ARM deployments on Azure Stack TP1. Basically, you pass the Resource Group as parameter to this script, and then, the script will contact ARM and will get you all the information and logs from the deployments available on the resource group, and it will save all that information in a log file, hence, you've in a single place all the logs and deployment details. Among the details collected from the deployments on the resource group, the script gets you:

  • The template used during the deployment
  • The deployment parameters
  • Details of the deployment operations
    • Here you can see which specific action failed (if any)
  • Resources in the resource group
  • Details about the virtual machines,
    • VM status
    • VM Agent Status
    • Installed VM extensions on the VM

For example, one of my colleagues was troubleshooting a complex deployment, and using this script, he got the logs and noticed the following error on the Custom Script VM Extension:

{     "name": "PowerShellExec",     "type": "Microsoft.Compute.CustomScriptExtension",     "typeHandlerVersion": "1.7.0.0",     "substatuses": null,     "statuses": [      {          "code": "ProvisioningState/failed/3",          "level": "Error",          "displayStatus": "Provisioning failed",          "message": "Failed to download all specified files. Exiting. Error Message: The remote server returned an error: (404) Not Found.",          "time": "0001-01-02T00:00:00Z"      }    ] }

As you can see on the snippet above, the Custom Script Extension is in failed state, and the error message clearly indicates that it couldn't download the required files, as it received a 404 error code (not found). In this particular case, the environment required a proxy to connect to the internet, and additional configuration was required to allow this particular VM to access the internet to download the required files.

Deployment Checker for Azure Stack Technical Preview 1

Let's imagine this scenario: you are eager to test Azure Stack TP1 and you got one server for installing and testing it, but after reading the online documentation for hardware requirements, you're still not sure if your server meets the requirements to deploy Azure Stack TP1, and you'd like to know if it would be possible to run Azure Stack on your hardware before you download the Azure Stack TP1 installation files.

This script will help you to check if your hardware meets the requirements / prerequisites for deploying Azure Stack TP1. The script goes through the prerequisite checks done by the Azure Stack TP1 installer and it will indicate if your server meets the requirements beforehand.


Azure Stack Troubleshooting – Additional resources

Now, let's review additional documents / links available for Azure Stack troubleshooting:

  • Microsoft Azure Stack troubleshooting

    Official article from the Azure Stack team with detailed troubleshooting guidance. Expect this list to grow over time!

  • Frequently asked questions for Azure Stack

    Also an official article from the Azure Stack team, which is frequently updated (last update was a couple of weeks ago!) with common asks and topics being answered directly by the Azure Stack team.

  • FAQ, known issues and workarounds

    Collection of known issues and workarounds provided and maintained in the Azure Stack Forum.

  • Azure Stack Forum

    MSDN forum dedicated for Azure Stack. Great place to learns from others, but also, this is the right place to place your questions when you face problems with your Azure Stack environment.

  • Azure Stack Logs

    Entry in the Azure Stack forum with a comprehensive list of logs for different Azure Stack components, as well as instructions on how to gather logs manually and automatically.

  • The Azure Stack Channel

    Channel 9 channel dedicated for Azure Stack resources (deployment, best practices, and more).


Conclusion

The resources provided on this blog should help you to troubleshoot the most common and known issues with Azure Stack TP1, specifically for IaaS (the focus of this series).

Also, with this blog post, we conclude this series that had as an original goal to map the IaaS concepts that WAP administrators are familiar with to the new Azure Stack TP1. We covered this series from a wide variety of angles, to help you understand more how cloud services are delivered on Azure Stack, and how the consistency with Azure via Azure Resource Manager is a key differentiator to bring the power of Azure to your datacenter.

Thanks and until next time!

Victor, Tiander and Bruno