Quite a while back a Managed Service Provider partner (hereafter referred to as "the MSP") approached Microsoft regarding migrating one of their datacenter facilities to Azure. The MSP is an experienced managed service provider providing Dynamics ERP, ISV solutions focused on a few industries like automotive, retail and finance. Longer term they are looking to transform as a highly differentiated LOB cloud and managed services provider and they feel operating datacenters offers limited value, differentiation and profitability in this transformation. They are already providing managed services to Dynamics ERP customers on Azure, based on customer's Microsoft Enterprise Agreements and had significant expertise in managing Azure based solutions. Their colocation contract for own datacenter came up for renewal and they saw it as an opportune moment to evaluate using Azure as their primary datacenter for delivering Dynamics ERP + ISV solutions to their customers. However, the MSP had limited experience in migrating services from on-premise facilities into Azure, particularly at this scale. The MSP engaged with Microsoft for both technical guidance and business support for the migration. The MSP had initially attempted to proceed with the migration on their own but had not succeeded in making the business case for migration to Azure and in particular making the financial model viable.
At the time we used the MAP tool to gain insights into the actual resource consumption of the platform components, these days more advanced tools are available to accomplish this.
Part-1 of this post has been written with my colleague Chris Brown from the Applied Incubation Team.
The following were the MSP's core objectives:
- Transform into a cloud based managed services provider with deep expertise in Microsoft Dynamics ERP, CRM and Azure platforms.
- Leverage Microsoft Azure as their primary DC for customer expansion and growth.
- Migrate the existing customer managed services from their co-location datacenter facility to Azure.
- Minimize down-time with any service outages to be managed out-of-hours of the Production service.
- Run the current services in Azure for the same cost or less than in the co-lo facility.
Most guides for migrating to the cloud focus predominantly on the technology and barely consider the business aspects. However, large data center migrations are typically managed as business transformation initiatives and as such are signed-off by the company's senior business executives, which includes the CFO.Therefore, we approach data migrations as we would any other business transformation by considering the business and technical aspects as two sides of the same coin. In doing this we understand the fundamental business case and economics of the potential solution options, so that decisions on which is the most suitable migration option can be made by weighing up both the business and technical implications.
The following are key steps in our approach:
Understand the current business environment and economics
In order to make the datacenter migration to Azure successful you need to understand the business environment and the business drivers for the MSP wanting to migrate their datacenter to Azure in the first place.
For the MSP, they already had a significant portion of their infrastructure running in Azure and their long term vision was to be fully Azure-hosted. The colo contract for the datacenter in question was nearing the end of the contract period, and hence they had the opportunity to minimize exit fees by migrating this datacenter to Azure at this time, or alternatively renew the colo contract. The key factor for the MSP was understanding the cost of migrating to Azure (including migration, colo contract obligations, etc.) compared to the alternative of simply renewing their colo contract.
Symbiosis between business and technical solutions
Datacenter change programs often seem to be driven by the technical solution, whether intentionally or otherwise. In other words, the technical migration solution is designed and then the resulting business impact is determined from it. A better approach is to adopt a business-driven paradigm, where the goal is to derive certain desired business outcomes, and the technical solution is then engineered accordingly. However, the business solution and the technical solution are inherently connected; changes in the technical solution impact the business solution/outcomes and vice versa. Therefore, at a micro level it is important to understand the key connection points and trade-offs between the business and technical realms and then determine the optimal holistic solution. In practice, this starts by identifying the fixed and variable parameters across the business and technical areas. Fixed parameters constitute constraints which will bound the potential solution, see the next section. Variable parameters can be flexed, but may have dependencies on other parameters, so that changes in some parameters affect others. It can be tempting to try to design a unified equation that represents the entire ecosystem, but this can take a long time and ultimately prove to be inaccurate. The most expeditious approach is to model a variety of options - see the final section of this post which describes how to approach this.
The most common inflection point is when the existing datacenter infrastructure is out of capacity or end of life. It is crucially important to target these inflection points because trying to convince a company to undertake a transformation project (to migrate to Azure) when their existing datacenter solution is perceived to be fit for purpose is much harder.
Therefore, key aspects to consider are:
- What are strategic objectives for the partner - e.g.: business model transformation, customer and revenue growth, moving away from additional capital investment?
- What are the business drivers for migration to Azure?
- What is the budgeting cycle in the company and who approves projects?
- When are the inflection points? This could include expiring colo/maintenance contracts, hardware depreciation/refresh cycle, end of support, existing hardware reaching capacity, customer expansion/growth in newer geographies
- Existing contractual arrangements? Who owns the existing infrastructure and maintenance, timelines, break-points, dependencies.
- What is the detailed financial breakdown for the as-is on-premise hosting versus the Azure migration?
Understand the as-is architecture and constraints
Core to an optimal migration to Azure is understanding the on-premise solution. This obviously includes the technology, but crucially you must understand the business outcomes currently being delivered by the on-premise solution, which often involves probing several levels underneath the surface, past the technical specifications to understand the design decisions taken and why the solution has been configured the way it has. Also important is to understand the business and technical constraints, both for the current on-premise solution and for the migrated solution in Azure if these are different. Even for a "lift-and-shift" migration to Azure don't necessarily assume that the existing constraints still apply to the migrated solution, as you could be leaving options and/or money on the table.
Gather detailed as-is usage data
These days most on-premise infrastructure is virtualized. The most common approach when companies are investigating migrating their data center or solution to Azure is to map their on-premise virtual machine sizes to the nearest Azure VM t-shirt size running 24x7, rounding capacity up to the next biggest Azure VM t-shirt when there isn't an exact match. This is fundamentally flawed approach and results in Azure appearing very expensive.
For solutions which need to run 24x7 at a constant capacity level, Azure may not be the most cost effective option compared to on-premise. However, very few real-world solutions fit this pattern. The reasons why a direct one-to-one mapping of on-premise provisioned capacity to Azure will over specify capacity are:
- Constant capacity, specified for peak load.Infrastructure has been designed to handle peak demand at all times rather than flexing capacity over time to meet demand. Most demand follows cycles (daily, monthly, quarterly or based upon predictable triggers) where median load is a small percentage of the peak.
- On-premise capacity is not quick to change. Adding additional capacity to an on-premise server farm at worst case can involve procuring additional hardware, racking/stacking, etc., which could take several weeks. Therefore, architects and System Administrators will add an on-premise capacity buffer to mitigate against this.
- On-premise solution capacity is typically specified for estimated demand years into the future. Traditional on-premise projects often have 3 to 5 year business cases and it is common for infrastructure capacity to provisioned on day 1 for the forecast capacity in 3+ years' time.
- Required capacity was forecast but not measured. On-premise capacity was set based on a forecast and not retuned for actual demand.
- On-premise solution capacity will always be over specified by design. When architects perform capacity planning they will naturally add a capacity safety net to their forecasts, it's human nature. This is because under-provisioning capacity can have visible implications (solution could fail) but over-provisioning has far less visible implications (infrastructure costs are higher than they should be) and companies rarely hold architects accountable for their capacity estimates.
What does this mean in practice? Our experience shows that average on-premise infrastructure utilization is often less than 40%. This means that companies could be paying 2.5 times more than they need to for infrastructure capacity than is not being used. For the MSP the average on-premise CPU and RAM (95th percentile) was 9% and 57% respectively.
Therefore, the most efficient approach is to ensure that Azure infrastructure capacity tracks the actual demand as closely as possible. This can be achieved by:
- Considering customer growth, but provisioning infrastructure capacity in Azure appropriate for the actual solution demand now.
- Use smaller percentage capacity headroom values. Scaling for 90% utilization is recommended for most solutions.
- Ensure that the Azure capacity will scale up and down in response to changes in demand.
The actual demand across the data center can be measured using the Microsoft Assessment and Planning (MAP) Toolkit, which is available as a free download. This can collect infrastructure usage data such as CPU, RAM, network and storage utilization and much more besides.
For the MSP we used the MAP Performance Metrics Wizard to measure basic infrastructure information about the CPU, memory, disk, and network utilization of all the virtual machines every 5 minutes. MAP uses an agentless approach which was very simple to setup and only required a machine to tool on from a place on the network with the required port access to the VMs. It is recommended the run the data collection over an extended period to cover all expected performance cycles, i.e. over a month end etc. as applicable. Usefully, MAP also records the provisioned VM infrastructure specification (number of CPU cores, RAM, etc.) plus details on the operating system version also.
Develop a detailed financial model for a selection of solution options
Often companies thinking of migrating to Azure simply model a single option, which is a one-to-one equivalent in Azure of their on-premise deployment. As described in the previous section, mapping one-to-one into Azure is a mistake, but also only considering a single option is not a good idea. At a bare minimum, two options should be considered; a migrate to Azure option and a "stay on-premise" option. Note: the "stay as-is" option is rarely the same as "do nothing" because the economics of the current on-premise solution will invariably change over time. However, it is recommended that multiple Azure target options are considered and modelled along with the stay on-premise option to determine which has the best business case over time.
- Stay on-premise option (baseline 1)
- Direct one-to-one port into Azure (baseline 2)
Capacity focused variations to consider in the Azure options (some can be combined as needed):
- Pre-Production environments started on demand (different types of pre-Production environments will have different capacity requirements, e.g. Functional Test versus Performance Test)
- Pre-Production environments started on schedule (e.g. Monday to Friday 8am to 6pm - i.e. switched-off out of hours)
- Single geography Production environments scaled back during non-peak hours
- Tracking available capacity to actual demand in pseudo real-time, with a small buffer.
- Pre-loading capacity for known peak cycles (e.g. start of business day logins, end of month batch run, etc.)
- Consolidation of SQL hosts
Solution focused variations in the Azure options (suitability depends on the applications used and will likely have architecture impact):
- Moving VM hosted SQL to Azure SQL
- Utilizing PaaS services to replace some IaaS components
We have found that running Azure 24x7 is typically more expensive on average across the entire datacenter than the equivalent on-premise deployment, but factoring in actual utilization makes it cheaper. This is because Azure is more agile and can scale much quicker to changes in demand, and therefore can be deployed more efficiently (i.e. lower capacity previsioned for the same output performance) than the equivalent on-premise deployment. In particular, peaky and unpredictable workloads are a sweet-spot for Azure. In fact, any workload with large variations between the average and peak performance plays to Azure's pay-per-use strengths as you only pay for the capacity provided at a point in time. However, scaling Azure capacity up or down is not instantaneous and can take a few minutes depending on the application running, and where capacity requirements increase substantially in a short period then pre-loading nodes is a good idea, all of which needs to be factored into the financial model.
Target utilization when capacity planning
What approach should we use for analyzing historical utilization? Measuring historical utilization will invariably be done by sampling the utilization in fixed intervals. The first decision to make is how often to retrieve the measured statistics from the infrastructure; every 5 minutes is a good starter, but for highly varying loads you may want to reduce this. The longer the sampling interval (that is, the less frequently the statistics are retrieved), the greater the possibility that significant variation in the traffic during the sampling interval may be hidden due to the effects of averaging, but conversely the higher the sampling frequency the more load on the system.
What is the optimal target utilization? This will vary depending on the type of applications. When modeling using 95th percentile figure you are essentially targeting the infrastructure to be fully utilized. Do not then add an additional utilization buffer on top, instead change the percentile figure to suit your workload - e.g. if 5% saturation is too high for your workload then model the 98th percentile and compare this. Note: this is different from the parameters you will set for Azure Autoscale and VM scale-sets.
The next decision is which statistical method to apply to the data. Mean averages or peak capacity are common measures used but these approaches will underestimate or overestimate capacity needs respectively. Typical infrastructure usage is actually quite spiky, so a high percentile interval such as the 95th percentile is commonly used as it is more sensitive to outliers. Basically the 95th percentile says that 95% of the time the usage is below this amount. Conversely though this means your infrastructure will be saturated 5% of the time. The MAP tool described in the previous section will provide 95th percentile figures out of the box in its Performance Metrics Wizard report.
Understanding your specific cost drivers is the first place to start from a variety of viewpoints (environments, function, location, etc.) as these can vary dramatically by company, although the remainder of this paragraph describes some generic truths.For the MSP, more than half of their on-premise costs were due to networking and storage due to over-engineered network architecture and expensive SAN storage, which is atypical. From an environment viewpoint, 40% of their costs were due to pre-Production environments which were running 24x7. Then from a functional viewpoint, 30% of their costs were driven by SQL. Therefore, analyzing the costs through different lenses helped to identify potential optimization solutions.
Azure costs are driven by the CPU/RAM needed, with storage and networking costs tending to be a much smaller component of the total costs. However, interestingly this is typically different with on-premise costs where networking and storage (if SAN based) can be a much bigger cost driver.
Things that are included in Azure but are usually missed in financial modeling comparisons
Azure includes a number of elements as part of the core service that are usually missed in financial modeling comparisons. To ensure an "apples for apples comparison" with the on-premise option ensure you include the following line-items in you on-premise financial model (non-exhaustive list):
|Financial Model||Azure model||On-premise model|
|Capital costs for hardware
|Hardware support/ maintenance||
|Other factors (not necessarily to include in your financial model)|
|Financial risk (not directly bottom-line impacting)||
Include all existing contractual liabilities in the model
When financial modeling you must make sure all additional on-premise liabilities are included as costs in the to-be Azure model. There will be existing contractual liabilities for the on-premise datacenter such as hardware rental, hardware maintenance, software licensing, support, etc. which will probably not neatly finish at the same time, so there could be exit fees. These exit fees for example are part of the "cost to change" so should be included in the Azure financial model. There will also be costs for the migration itself in terms of manpower and potentially also license fees for migration software. In addition, the biggest factor is that you will have to account for the costs of both the on-premise costs and the Azure costs during the migration period, with the former ramping down and the latter ramping up.
Sunk costs and depreciated assets
One common sticking point with CIOs in particular are existing on-premise datacenter assets that have not yet fully depreciated. Companies often buy datacenter equipment upfront with the plan to straight-line depreciate them over say 5 years. If there is still a remaining book value for the assets (i.e. they haven't yet fully depreciated) then there can be hesitance to invest in migrating to Azure. However, in the case of upfront payments the datacenter equipment is a sunk cost. Therefore, the remaining book value of the equipment is irrelevant as the asset write-down for the remaining book value also represents a sunk cost because the cost of the equipment was incurred in the past. Therefore, what matters is how the financial model for migrating to Azure looks compared to the remain on-premise option.
An exception to this is where the company has not paid upfront for the datacenter equipment and is instead leasing it for example. However, even in this situation the important thing is to model and compare financials of the potential future options, irrespective of any sunk costs that occurred in the past.
Time horizon for financial modeling
When creating the financial model, the time horizon to model over depends on the business environment but 3 years is a typical average. However, you should ensure that the period is long enough to incorporate major cost events such hardware refreshes, support contract renewals, etc.
Note: This paper is not intended to cover licensing in any depth, but rather highlight this as an important focus area.
There are several Azure licensing options available to MSPs across both direct and indirect models. Determining which vehicle to use for licensing Azure depends on the MSP's business model and whether or not they want to own the customer relationship and solutions end-to-end. Most MSPs want to do this as it allows them to demonstrate more value to their customers and hence build higher-margin and stickier customer engagements. As such, CSP should be seen as the primary Microsoft cloud channel program for Managed Service Providers. However, in some cases you may need to consider a hybrid model that combines both EA-based subscriptions and CSP-based subscriptions.
For the MSP, we used a combination of CSP for 75% of the workloads, with EA used to cover the remaining shared services.We also leveraged existing SQL licenses in Azure.
Scalability of the solution
Some of the key characteristics of cloud native applications include:
- Runs as one or more stateless processes,
- Supports concurrency and the ability to scale processes horizontally (elastic scaling),
- Disposability, designed to fail gracefully so that nodes can be replaced easily.
The vast majority of applications that you will be migrating to Azure will not be cloud native.In fact, most will probably be traditionally design 3 tier applications running in virtualized form.For some initiatives you may have the luxury to re-architect some of the applications during the migration, but our experience is that most datacenter migrations are primarily a lift-and-shift port of the current applications to Azure, and any re-architecting comes as a stage 2 activity.
Therefore, the ability of the application to scale horizontally up and down in respond to changes in demand will vary depending on the application's design, as will your ability to track the provisioned capacity to the demand. In an ideal world your provisioned capacity curve will be a function of your demand curve plus your capacity buffer, but in reality you will need to factor in the ability of your application to scale into the equation.
When to consolidate resources
Companies will try to consolidate resources as part of their migration to Azure in order to reduce the costs. In some areas this is a good thing to do but more often than not this tactic can have negative implications if not done correctly. Decoupling your applications and enabling independently scalable micro-services is a good architectural pattern to follow, and therefore consolidation of VMs is a trade-off to reduce costs. Consolidating properly sized VMs does not reduce costs, because the consolidated "big VM" infrastructure resources will be the same as sum of the constituent VMs that were combined, and now you have coupled them together. However, for poorly sized VMs, consolidation can appear to allow you to "run more on less", but in reality all you have done is reduce the total (over-specified) capacity buffer you had before. Where consolidation can be beneficial to the overall solution economics is where it facilitates more efficient licensing, e.g. for SQL Server. A larger consolidated highly-available SQL cluster is typically more cost-effective than several smaller HA SQL clusters.
Options with SQL and using PaaS to reduce costs
For the MSP 30% of their costs were driven by SQL. The reason for this was that they had a large number of small SQL Servers so in the majority of cases each customer tenant was isolated. They had used the pattern of most customers having a single primary SQL node running SQL Server Standard Edition with a cold standby node. This provided basic fault-tolerance as in the event of a failure with the primary node, service was manually failed-over to the secondary cold standby node. This design was entirely driven by the need to minimize SQL licensing costs, and hence did not use SQL clusters with SQL Enterprise Edition for that reason.
There are two approaches that can be taken with this approach:
- Azure now supports high-availability with SQL Standard edition, which can mitigate lack of scheduling ability for Azure maintenance windows.
- Use Azure SQL PaaS service to provide HA service, either provisioned by database or by required performance.
For the MSP, we utilized approach #1 above as this had minimal architectural impact. Modelling the financials for both approaches showed that approach #2 would be more cost effective but requires that your application layer support Azure SQL, which for the MSP running a COTS product it didn't. However, over time as more applications support Azure SQL this will become less of an issue.
Existing SLAs and Contracts
Developing new options to drive financial improvements can make a big difference to the overall Azure business case, but some options may not fit with existing contractual obligations or SLAs. For example, one approach which has been mentioned already that can reduce costs significantly is to switch-off some pre-Production environments outside of office hours. However, for the MSP this was not possible due to customer contractual SLAs that stated that Test environments needed to be available 24x7.Â Even in this situation though we were able to make to leverage Azure to make cost savings:
- Reduce Test environment capacity out of office hours (based on usage data) while still keeping the environments available 24x7.
- Plan for a future product update that gives the customer the choice of Test environments being available 24x7 or just during office hours, with different price points.
Migration approach - a business view
The duration of the migration period is a trade-off between cost and risk. Slow migration is lower risk, but the longer the migration the higher the double-counting costs where you are running the on-premise systems and Azure in parallel.
Another element to consider is big-bang cutover versus iterative migration. If an iterative migration is an option, this is preferable from both a cost and risk perspective. A big-bang cutover requires you run full capacity on-premise and Azure environments in parallel and keep them synchronized up to the cutover point. An iterative approach also allows you to learn from mistakes on a smaller scale and refine your migration as you go.
For the MSP we adopted an iterative migration over a 4 month period, striking a balance between parallel-run cost and risk.
Refine the model after initial Azure deployment
Modeling is just a forecast. It is recommended you use MAP to measure actual utilization in Azure where possible and tune your available capacity and autoscale rules accordingly.
Ensure contractual cover for all agreements
The resulting decisions from the financial modeling must be documented as a set of agreements in a contract between Microsoft and the MSP, spelling out clearly the expectations on both sides. Make sure you adhere to Microsoft policies and seek the required legal and contractual support. The contractual documentation should include but not be limited to the following elements:
- The agreed migration approach and high-level solution
- Agreed goals and KPIs for the partnership
- The financial model headlines, e.g. MSP consumption/revenue expectations over time
- Financial gives and gets across Microsoft and the MSP (including resource commitments)
- Technical gives and gets across Microsoft and the MSP
- High-level timeline for the change program, showing work-streams
- Protection of Microsoft and/or Partner IP used during the program
- Ownership of any new IP resulting from the program
- Specific roles and responsibilities between Microsoft and the MSP
- Governance approach including Steering Committee membership and communication cadence
- Escalation path and agreed criteria for escalation
- Agreement from the MSP to create a Case Study.
We'll cover the technical aspects in the second part of this article: Migration Datacenter to Azure – Part 2.