I've already shared some high-level best practices regarding Cloud OS Network Platform in my previous posts. Today I'll continue sharing my knowledge about building IaaS cloud services based on Hyper-V, System Center and Azure Pack for hosting service providers and big enterprises. Here are my best practices regarding IaaS for Cloud OS Network Platform.
04.02.2016 I've updated the list with some recommendations added by service providers after the post initial publish. Thx for that!
Common best practices
- Don't forget to configure automatic Windows Server activation. I've already shared some best practices in a previous post.
- Use the same Update Rollup version for all COSN Platform components. Using different versions of URs for VMM, SPF and WAP is not supported and will cause a lot of troubles. Don't forget to update SCOM agents to the newest UR.
- Use high-availability deployment options for all COSN Platform components. Eliminate single point of failure.
- Change HTTPS services port from service default to 443. As example, SPF service uses 8090 port by default, Azure Pack components also use non-standard HTTPS ports after installation.
- Use Enterprise CA certificates for internal services and Public CA issued certificates for services, accessible by tenants. Don't use self-signed certificates.
- Use dedicated VM with all Windows Server and System Center consoles installed to manage COSN Platform. Enable remote management for IIS across COSN Platform components.
- Add #WAPWiki to your favorites. This is a very useful resource.
Management Active Directory
- Use naming conventions for user accounts, service account and computer accounts.
- Provide full descriptions for service accounts
- Don't add users to Domain Admins if it is absolutely unnecessary. Use rights delegations of local admin rights instead.
- Create separate security groups for service access. Configure service to assign access to groups instead of users.
- Use Group Managed Service Accounts (gMSAs) instead of traditional user accounts for services. It will increase security by automatically leveraging password management for service account. But be sure that the service itslef supports gMSAs. As example, SQL Server 2014 supports them.
- If you run domain controller with PDC role inside a Hyper-V VM, disable "Time Synchronization" in Integration Services tab in the VM settings. Otherwise hosts will try to get time from PDC, and your PDC will try to get time from the underlying host. You'll go into chicken and egg problem and time will not be actual across the whole management domain.
- Always configure time synchronization with an external source on a domain controller with PDC role. As example, just run this command on DC with PDC role and restart the server:
w32tm /config /manualpeerlist:pool.ntp.org /syncfromflags:MANUAL
Hyper-V Hosts and Clusters
- Don't enable NUMA Spanning in Hyper-V Settings of the host unless it is unnecessary. NUMA Spanning can potentially increase the VM density, but it decreases the CPU performance.
- Don't use checkpoints for management VMs unless unnecessary. Checkpoint decrease virtual disk performance and consume more storage.
- Use the same version of network adapter drivers across Hyper-V cluster.
- Don't enable CPU compatibility mode for VMs if you use hosts with the same CPU family in the cluster. CPU Compatibility can potentially decrease CPU performance because not all available CPU instructions will be used. Details are available here.
- Run cluster validation tests every month and check the results.
- Check the consistency of software updates every month.
- No need to use a dedicated network adapter for Live Migration traffic on 10Gbe+ LAN. You can use Cluster network for this.
- Exclude unneeded and unstable network adapters from cluster use.
- To achieve high availability, use multiple similar network adapters and use NIC Teaming in "Dynamic" mode.
- Separate hosts which run NVGRE Gateways and hosts that run VMs, which use NVGRE Network Virtualization. Collocation of them is not supported.
- Tune VMQ properly. But partners reported, that there are some issued with Broadcom chipsets, so they disable VMQ on network card with such chipsets.
- If you use 10GbE network, check that Live Migration don't uses "Compression" mode.
- Install all available hotfixes for failover clustering from this article.
- Install all available hotfixes from this article if you use NVGRE.
- Hyper-V host must be able to see only that LUNs, that they will use for CSV. Configure zoning and masking properly.
- Connect one LUN to several hosts only if they are connected into the same cluster. Otherwise there will be no write master, hosts will try to write simultaneously and file system will corrupted very soon.
- Prefer formatting disks in GPT instead of MBR.
- Use 64k cluster size.
- Use storage classification functionality for storage disks, clouds and VM Images.
- If you are using SMB to store VMs, specify FQDN instead of flat name in share URL. \fs01.contoso.comVMs instead of \fs01VMs. VMM doesn't like flat names for network shares.
- Install all available hotfixes for File Services on SOFSs and VMM Library Servers from this article.
- If you use Storage Spaces:
- Read this post
- Use at least 3 JBOD enclosures with 2-way Mirroring to leverage Enclosure Awareness.
- Apply the recommended registry changes
- Disable TRIM and Physical Disk cache
- Specify "-UseLargeFRS" option when formatting drives.
- Use "Least Blocks" MPIO Global policy and "Round Robin" MPIO policy for SSDs.
Virtual Machine Manager
- VMM Service is very important for COSN Platform. Use at least 2 VMM instances in a cluster.
- Add all used networks as Logical Networks. Define subnets and IP pools for every network.
- Use proper isolation types for Logical Networks. All management logical networks should use "One Connected Network" isolation. NVGRE-based Logical networks for tenants must use "Hyper-V Network Virtualization Isolation". If you use traditional VLANs for tenants instead, create new sites with different VLANs in the same Logical Network with "VLAN-based Isolation". Details are here.
- On every Hyper-V host check in host's Hardware Settings that every physical network adapter is mapped to a proper Logical Network. Confirm "Network Compliance" for each adapter in "Logical Switches" session.
- On every Hyper-V host check in in host's Hardware Settings that local system drives are not available for new VM placement (uncheck "Available for Placement" checkmark).
- If you use NVGRE, then consider using at least /22 IP Pool for PA Network. Otherwise you will be out of available PA IP addresses very soon.
- Always add file shares and libraries using FQDN instead of flat names.
- Check that all shares have correct permissions. Should be no errors in VMM.
- Use traditional clustered File Server for highly available VMM Library. SOFS is not recommended for this.
- Scope Libraries to proper Host Groups, Networks and Clouds.
- Store HNV Gateway Service template in the VMM Library. You will need to create additional network services.
- Use Availability Sets for VMs with guest clustering and NLB. This will create anti-affinity rules for you and place VMs with guest clustering on different hosts.
- Use different RunAs Accounts for different purposes. Assign only needed rights to these accounts.
- In some conditions host refresher goes into Legacy Mode (instead of default "Event Based" mode). Check refresher mode periodically and revert refresher mode back to Event Based mode.
- Configure Bare Metal Deployment to reinstall faulty hosts and add new hosts quickly.
- Deploy different HNV gateways from the same service template.
- Don’t miss important parameters in network service connection string (BackendSwitch, MpDiscovery and MpDiscoveryIpAddress).
- Configure "Update Baselines" to ensure Hyper-V hosts patching levels are consistent.
- Don't use Dynamic Memory for VMs with SQL Server. Here is a good explanation.
- COSN Platform supports SQL Server 2012 and SQL Server 2014. You can use any of them, but don't forget to install latest cumulative updates.
- Never install SQL Server in a standalone mode. SQL Server is a core, VMM and WAP won't work without SQL Server connection. Use AlwaysOn Failover Clustering or AlwaysOn Availability Groups for high availability. Or both.
- Don't collocate all DBs in the same instance. Use a dedicated VM for SCOM DW at least.
- Use separate domain service accounts for SQL Server service and SQL Server agent.
- Use gMSA if you use SQL Server 2014 or newer.
- Use separate disks for system, database files and log files.
- Don't collocate all Azure Pack components in the same VM. Use Minimal distributed deployment or Scaled distributed deployment architecture, explained here.
- Deploy highly available RDS Gateway for console access to VMs. Details are available here.
- Don’t forget to update Azure Pack database to the same UR version after all Azure Pack components update.
- Reconfigure portal names and ports like described here. Use port 443 for HTTPS instead of defaults.
- If Microsoft NLB doesn't suit your needs, you can use virtual network load balancers instead. KEMP VLM and Citrix NetScaler VPX work OK with Azure Pack services.
- By default, Azure Pack sends user password in clear text after password change initiated by the tenant. It's a good idea to delete this field in Notification Settings in Azure Pack admin portal - password should not be stored in users inbox.
- Update Hyper-V integration services for Windows and Linux to the latest available version.
- Use this manual for CentOS VM template creation.
- Don't use Dynamic Memory for templates. You can't predict which workloads tenants will install into their VMs, that will be deployed using these templates. Some workloads like Exchange or SQL Server Standard don't support Dynamic Memory.
- Tenants sometimes confused because they need to hit CTRL+ALT+END instead of CTRL+ALT+DEL when connect via console to their Windows Server based them. To eliminate this confusion, disable CTRL+ALT+DEL requirement on the logon screen.
- You can't disable Secure Boot option for Generation 2 VM Roles. Linux distros don't support Secure Boot on Hyper-V 2012 R2. So don't use Generation 2 VM Roles if you need Linux. But there is no such issue with Standalone VMs - you can disable Secure Boot manually in VM template settings in VMM.
But remember - all these recommendations are not official statements by Microsoft. I share these best practices based on my personal experience, achieved in 10+ deployments of COSN Platform in different service providers from different countries. So no warranties 🙂