Automating HPC Cluster Deployments in Azure IaaS: Part III - Auto Grow and Shrink Azure Nodes and IaaS VMs

Author: Yutong Sun Date: December 2, 2014

 

HPC Pack 2012 R2 supports Azure Nodes, aka PaaS worker role instances, bursting from both On-Premise head node and IaaS VM head node. It also supports IaaS VM compute nodes adding to the IaaS VM head node. In Update 1 release (What’s New, Download), we provide the HPC Pack Image (Link) in Azure gallery and IaaS deployment (Download) and management scripts to facilitate customers to deploy and manage HPC Pack Clusters easily with PaaS on IaaS offering on Microsoft Azure cloud.

After deploying the Azure Nodes or IaaS VMs, many customers may want to auto grow and shrink the Azure computing resources according to the current workload on the cluster to improve cluster utilization and save cost. This blog is to introduce the auto grow and shrink script available in Update 1 release which can be used to grow and shrink Azure Nodes for both On-Premise/IaaS VM head node and IaaS VM ComputeNodes for IaaS VM head node.

The script can run on both client and HN, when running on client, make sure the env $env:CCP_SCHEDULER is set correctly. If running against the IaaS VM compute nodes, the script must be run on the IaaS VM head node and it depends on the Start/Stop-HPCIaaSNode.ps1 to work correctly.

 

Sample 1. To grow and shrink Azure nodes from on-premise head node or IaaS head node

Before running the script in this scenario, please make sure the the Azure nodes are added with the correct management cert imported and right cloud service and storage account specified in the Azure node templates. The auto grow and shrink script depends on Start/Stop-HpcAzureNode HPC Powershell cmdlets to work fine.

We can simply run the script as following to start monitoring the job queue to grow/shrink Azure nodes,

  

 

PS C:\Program Files\Microsoft HPC Pack 2012\bin> .\AzureAutoGrowShrink.ps1 -NodeType AzureNodes -NumOfQueuedJobsPerNodeToGrow 5 -NumOfQueuedJobsToGrowThreshold 8 -NumOfInitialNodesToGrow 3 -GrowCheckIntervalMins 1 -ShrinkCheckIntervalMins 1 -ShrinkCheckIdleTimes 3

[11/02/2014 04:28:26][Info] Log file : AzureAutoGrowShrink_Log.txt

[11/02/2014 04:28:26][Info] Argument file : AzureAutoGrowShrink_Arg.xml

[11/02/2014 04:28:26][Info] Script arguments :

[11/02/2014 04:28:26][Info] NodeTemplates -- []

[11/02/2014 04:28:26][Info] JobTemplates -- []

[11/02/2014 04:28:26][Info] NodeType -- [AzureNodes]

[11/02/2014 04:28:26][Info] NumOfQueuedJobsPerNodeToGrow -- [5]

[11/02/2014 04:28:26][Info] NumOfQueuedJobsToGrowThreshold -- [8]

[11/02/2014 04:28:26][Info] NumOfActiveQueuedTasksPerNodeToGrow -- [0]

[11/02/2014 04:28:26][Info] NumOfActiveQueuedTasksToGrowThreshold -- [0]

[11/02/2014 04:28:26][Info] NumOfInitialNodesToGrow -- [3]

[11/02/2014 04:28:26][Info] GrowCheckIntervalMins -- [1]

[11/02/2014 04:28:26][Info] ShrinkCheckIntervalMins -- [1]

[11/02/2014 04:28:26][Info] ShrinkCheckIdleTimes -- [3]

[11/02/2014 04:29:34][Info] ===============Grow Check==================

[11/02/2014 04:29:34][Info] Number of active jobs : 0

[11/02/2014 04:29:34][Info] Number of queued jobs : 0

[11/02/2014 04:29:34][Info] No enough relevant workload or nodes to grow.

[11/02/2014 04:29:34][Info] ===============Shrink Check================

[11/02/2014 04:30:35][Info] ===============Grow Check==================

[11/02/2014 04:30:35][Info] Number of active jobs : 0

[11/02/2014 04:30:35][Info] Number of queued jobs : 0

[11/02/2014 04:30:35][Info] No enough relevant workload or nodes to grow.

[11/02/2014 04:30:35][Info] ===============Shrink Check================

 

The grow check would look at the current number of queued jobs, if the job number goes beyond the NumOfQueuedJobsToGrowThreshold -- [8] e.g. 9 then, it would start to grow the Azure Nodes from the capacity by NumOfQueuedJobs -- [9] / NumOfQueuedJobsPerNodeToGrow -- [5] = 2 Azure Nodes. If all the Azure Nodes in the node template are NotDeployed and the NumOfInitialNodesToGrow is specified e.g. [3] then at least 3 Azure nodes would be started for the initial deployment. The grow check would repeat with GrowCheckIntervalMins -- [1] after the last shrink check or grow/shrink operation.

  

 

[11/02/2014 05:02:12][Info] ===============Grow Check==================

[11/02/2014 05:02:12][Info] Number of active jobs : 8

[11/02/2014 05:02:12][Info] Number of queued jobs : 8

[11/02/2014 05:02:12][Info] No enough relevant workload or nodes to grow.

[11/02/2014 05:02:12][Info] ===============Shrink Check================

[11/02/2014 05:02:12][Info] No idle nodes to shrink

 

 

 

[11/02/2014 05:12:29][Info] ===============Grow Check==================

[11/02/2014 05:12:29][Info] Number of active jobs : 9

[11/02/2014 05:12:29][Info] Number of queued jobs : 9

[11/02/2014 05:12:29][Info] Grow with queued jobs

[11/02/2014 05:12:29][Info] Number of relevant queued jobs : 9

[11/02/2014 05:12:29][Info] Nodes to grow capacity : 6

[11/02/2014 05:12:29][Info] Nodes to grow for relevant queued jobs : 3

[11/02/2014 05:12:29][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 05:12:29][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 05:12:29][Info]     AzureCN-0008 NotDeployed    Unreachable       1.5          AzureNodes

[11/02/2014 05:12:29][Info]     AzureCN-0009 NotDeployed    Unreachable       1.5          AzureNodes

[11/02/2014 05:12:29][Info]     AzureCN-0006 NotDeployed    Unreachable       1.5          AzureNodes

[11/02/2014 05:12:29][Info] +++++++++++++++Grow Nodes++++++++++++++++++

[11/02/2014 05:12:29][Info] Growing the 3 node(s):

[11/02/2014 05:12:29][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 05:12:29][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 05:12:29][Info]     AzureCN-0008 NotDeployed    Unreachable       1.5          AzureNodes

[11/02/2014 05:12:29][Info]     AzureCN-0009 NotDeployed    Unreachable       1.5          AzureNodes

[11/02/2014 05:12:29][Info]     AzureCN-0006 NotDeployed    Unreachable       1.5          AzureNodes

[11/02/2014 05:20:37][Info] Start nodes succeeded

[11/02/2014 05:20:38][Info] Nodes online : AzureCN-0006 AzureCN-0009 AzureCN-0008

[11/02/2014 05:21:39][Info] ===============Shrink Check================

[11/02/2014 05:21:40][Info] No idle nodes to shrink

[11/02/2014 05:21:50][Info] ===============Grow Check==================

[11/02/2014 05:21:50][Info] Number of active jobs : 9

[11/02/2014 05:21:50][Info] Number of queued jobs : 3

[11/02/2014 05:21:50][Info] No enough relevant workload or nodes to grow.

[11/02/2014 05:22:41][Info] ===============Shrink Check================

[11/02/2014 05:22:41][Info] No idle nodes to shrink

 

 

The shrink check is to find all the idle nodes with no jobs running on them. If a node is checked as idle in consecutive ShrinkCheckIdleTimes -- [3], the node would shrunk by stopping it. Note if a deployment has multiple role sizes, the last instance for a role size would be kept if the whole deployment would not be stopped by the shrink. This is a by-design requirement for Azure PaaS deployments.

 

 

[11/02/2014 05:53:22][Info] ===============Shrink Check================

[11/02/2014 05:53:22][Info] 2 idle node(s) found in this check: AzureCN-0009 AzureCN-0006

[11/02/2014 05:53:22][Info] ---------------Shrink Nodes----------------

[11/02/2014 05:53:22][Info] Idle nodes to shrink : 2

[11/02/2014 05:53:22][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 05:53:22][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 05:53:22][Info]     AzureCN-0009      Online             OK       0.0          AzureNodes

[11/02/2014 05:53:22][Info]     AzureCN-0006      Online             OK       0.0          AzureNodes

[11/02/2014 05:53:22][Info] Shrinking the 2 Azure nodes

[11/02/2014 05:53:22][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 05:53:22][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 05:53:22][Info]     AzureCN-0009      Online             OK       0.0          AzureNodes

[11/02/2014 05:53:22][Info]     AzureCN-0006      Online             OK       0.0          AzureNodes

[11/02/2014 05:53:22][Info] Bringing nodes offline

[11/02/2014 05:53:27][Info] Nodes shrink operation started successfully.

 

 

The script can specify an argument file and a log file. By default the argument file is named ‘AzureAutoGrowShrink_Arg.xml’ under the same $env:ccp_home\bin folder. It saves all the grow and shrink parameters which facilitate the following scenarios: 1) the argument file can be modified while the script is running to update the grow and shrink parameters; 2) rerun the script use -UseLastConfigurations to load the arguments from default or specified argument files. The log file contains all the records for grow and check checks and operations.

 

 

 <Objs Version="1.1.0.1" xmlns="https://schemas.microsoft.com/powershell/2004/04">  <Obj RefId="0">    <TN RefId="0">      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.Int32</T>      <T>System.String</T>      <T>System.String[]</T>      <T>System.String[]</T>      <T>System.Object</T>    </TN>    <ToString>System.Object</ToString>    <MS>      <Obj N="NodeTemplates" RefId="1">        <TN RefId="1">          <T>System.String[]</T>          <T>System.Array</T>          <T>System.Object</T>        </TN>        <LST />      </Obj>      <Obj N="JobTemplates" RefId="2">        <TNRef RefId="1" />        <LST />      </Obj>      <S N="NodeType">AzureNodes</S>      <I32 N="NumOfQueuedJobsPerNodeToGrow">5</I32>      <I32 N="NumOfQueuedJobsToGrowThreshold">8</I32>      <I32 N="NumOfActiveQueuedTasksPerNodeToGrow">0</I32>      <I32 N="NumOfActiveQueuedTasksToGrowThreshold">0</I32>      <I32 N="NumOfInitialNodesToGrow">3</I32>      <I32 N="GrowCheckIntervalMins">1</I32>      <I32 N="ShrinkCheckIntervalMins">1</I32>      <I32 N="ShrinkCheckIdleTimes">3</I32>    </MS>  </Obj></Objs>

 

We can also use -NodeTemplates and -JobTemplates parameters to specify the scope of the nodes to grow and shrink and to specify the kinds of workload for which the nodes should grow. The usage is shown in the second sample below. Please be noted for the grow check if the queued job is requesting nodes or node groups outside of the specified node templates or node type, the job won’t be count as a relevant queued job, thus no nodes would grow for it.

 

Sample 2. To grow and shrink IaaS VM compute nodes from IaaS head node

  

To use the auto grow and shrink script for IaaS VM compute nodes on IaaS head node, it is required to import the management cert and set Azure subscriptions, or import the Azure publishing settings file, or add Azure users beforehand. The auto grow and shrink script depends on the Start/Stop-HPCIaaSNode.ps1 scripts under the same $env:CCP_HOME \bin folder to work fine.

In this sample, we would monitor the queued tasks instead of queued jobs for grow check. It would count the number of queued tasks in Queued and Running jobs to decide whether and how many nodes to grow. Note for parametric sweep tasks the number of sub-tasks would be counted and for service tasks only root-tasks would be counted.

This sample also uses -NodeTemplates and –JobTemplates parameters to narrow down the scope of nodes and jobs. It also specifies the custom argument file and log file.

Note this script won’t add or remove IaaS VM compute nodes. When idle IaaS VM is found, it only use Stop-HPCIaaSNode.ps1 to stop the VM into Stopped (Deallocated) state in which the VM won’t be charged for computing hours. When new workload comes, the IaaS VM can be started again to respond. To add more IaaS VM as a pool, please use the Add-HPCIaaSNode.ps1, for details please refer to this blog.

 

 

PS C:\Program Files\Microsoft HPC Pack 2012\bin> .\AzureAutoGrowShrink.ps1 -NodeTemplates 'Default ComputeNode Template' -JobTemplates 'Default' -NodeType ComputeNodes -NumOfActiveQueuedTasksPerNodeToGrow 10 -NumOfActiveQueuedTasksToGrowThreshold 15 -NumOfInitialNodesToGrow 1 -GrowCheckIntervalMins 1 -ShrinkCheckIntervalMins 1 -ShrinkCheckIdleTimes 10 -ArgFile 'IaaSVMComputeNodes_Arg.xml' -LogFile 'IaaSVMComputeNodes_log.txt'

[11/02/2014 11:19:28][Info] Log file : IaaSVMComputeNodes_log.txt

[11/02/2014 11:19:28][Info] Argument file : IaaSVMComputeNodes_Arg.xml

[11/02/2014 11:19:28][Info] Script arguments :

[11/02/2014 11:19:28][Info] NodeTemplates -- [[Default ComputeNode Template]]

[11/02/2014 11:19:28][Info] JobTemplates -- [[Default]]

[11/02/2014 11:19:28][Info] NodeType -- [ComputeNodes]

[11/02/2014 11:19:28][Info] NumOfQueuedJobsPerNodeToGrow -- [0]

[11/02/2014 11:19:28][Info] NumOfQueuedJobsToGrowThreshold -- [0]

[11/02/2014 11:19:28][Info] NumOfActiveQueuedTasksPerNodeToGrow -- [10]

[11/02/2014 11:19:28][Info] NumOfActiveQueuedTasksToGrowThreshold -- [15]

[11/02/2014 11:19:28][Info] NumOfInitialNodesToGrow -- [1]

[11/02/2014 11:19:28][Info] GrowCheckIntervalMins -- [1]

[11/02/2014 11:19:28][Info] ShrinkCheckIntervalMins -- [1]

[11/02/2014 11:19:28][Info] ShrinkCheckIdleTimes -- [10]

[11/02/2014 11:20:29][Info] ===============Grow Check==================

[11/02/2014 11:20:29][Info] Number of active jobs : 0

[11/02/2014 11:20:29][Info] Number of queued jobs : 0

[11/02/2014 11:20:29][Info] No enough relevant workload or nodes to grow.

[11/02/2014 11:20:29][Info] ===============Shrink Check================

 

  

[11/02/2014 11:30:39][Info] ===============Grow Check==================

[11/02/2014 11:30:39][Info] Number of active jobs : 1

[11/02/2014 11:30:39][Info] Number of queued jobs : 1

[11/02/2014 11:30:39][Info] Grow with active queued tasks

[11/02/2014 11:30:39][Info] Number of active relevant queued tasks : 16

[11/02/2014 11:30:39][Info] Nodes to grow capacity : 3

[11/02/2014 11:30:39][Info] Nodes to grow for active relevant queued tasks : 2

[11/02/2014 11:30:39][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 11:30:40][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 11:30:40][Info]     YUTONGSCN-01     Offline    Unreachable5.33333333333333        ComputeNodes

[11/02/2014 11:30:40][Info]     YUTONGSCN-02     Offline    Unreachable5.33333333333333        ComputeNodes

[11/02/2014 11:30:40][Info] +++++++++++++++Grow Nodes++++++++++++++++++

[11/02/2014 11:30:40][Info] Growing the 2 node(s):

[11/02/2014 11:30:40][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 11:30:40][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 11:30:40][Info]     YUTONGSCN-01     Offline    Unreachable5.33333333333333        ComputeNodes

[11/02/2014 11:30:40][Info]     YUTONGSCN-02     Offline    Unreachable5.33333333333333        ComputeNodes

[11/02/2014 11:31:55][Info] Wait for nodes to start.

[11/02/2014 11:36:01][Info] Nodes online : YUTONGSCN-01

[11/02/2014 11:38:05][Info] Nodes online : YUTONGSCN-02

[11/02/2014 11:38:05][Info] All growing nodes are online.

[11/02/2014 11:38:05][Info] ===============Shrink Check================

 

  

[11/02/2014 12:06:46][Info] ===============Grow Check==================

[11/02/2014 12:06:46][Info] Number of active jobs : 0

[11/02/2014 12:06:46][Info] Number of queued jobs : 0

[11/02/2014 12:06:46][Info] No enough relevant workload or nodes to grow.

[11/02/2014 12:07:37][Info] ===============Shrink Check================

[11/02/2014 12:07:37][Info] 2 idle node(s) found in this check: YUTONGSCN-02 YUTONGSCN-01

[11/02/2014 12:07:37][Info] No idle nodes to shrink

[11/02/2014 12:07:47][Info] ===============Grow Check==================

[11/02/2014 12:07:47][Info] Number of active jobs : 0

[11/02/2014 12:07:47][Info] Number of queued jobs : 0

[11/02/2014 12:07:47][Info] No enough relevant workload or nodes to grow.

[11/02/2014 12:08:38][Info] ===============Shrink Check================

[11/02/2014 12:08:38][Info] 2 idle node(s) found in this check: YUTONGSCN-02 YUTONGSCN-01

[11/02/2014 12:08:38][Info] ---------------Shrink Nodes----------------

[11/02/2014 12:08:38][Info] Idle nodes to shrink : 2

[11/02/2014 12:08:38][Info]      NetBiosName   NodeState     NodeHealth    Weight              Groups

[11/02/2014 12:08:38][Info]      -----------   ---------     ----------    ------              ------

[11/02/2014 12:08:38][Info]     YUTONGSCN-02      Online             OK       0.0        ComputeNodes

[11/02/2014 12:08:38][Info]     YUTONGSCN-01      Online             OK       0.0        ComputeNodes

[11/02/2014 12:08:38][Info] Bringing nodes offline

[11/02/2014 12:09:36][Info] Nodes shrink operation started successfully.