This post was contributed by Rhoderick Milne, everyone’s favourite Canadian-Scottish Premier Field Engineer.
Microsoft Exchange Server is the leading messaging platform in the world today, and can be found in SMBs, large enterprises and also behind cloud services of unimaginable scale. Exchange has been with us since its release in April 1996 and from that first version, it has evolved to offer businesses and consumers outstanding features that they have come to heavily rely upon.
This is fantastic! Until something happens…. That is, something bad happens, or something really bad happens.
When people are unable to access the services that they have become dependent upon, then: “Houston, we have a problem!” (Well, technically that should read “Houston we’ve had a problem here”, but… oh well!)
There are multiple preventative steps that we can take to help ensure that bad things don’t happen.
Are you are an IT or business professional who needs to ensure:
- that your organisation’s Exchange environment meets the needs of its user base
- that Exchange is available
- and (if that’s not enough) that Exchange performance is also excellent?
If the answer to any of these questions is yes, then the chances are you may be wondering how best to achieve that! Microsoft has an offering that can help you and your business address these challenges!
So, what is it? The answer you have been seeking is the Exchange Risk Assessment Program (ExRAP).
The ExRAP is designed to review an enterprise’s Exchange organization. An ExRAP engagement helps identify both existing problems and risks of future problems by reviewing performance, operational processes and Exchange configuration settings.
ExRAP can help you solve the challenges above, and many more. By engaging with a certified Microsoft Exchange Premier Field Engineer (PFE) you have access to the ExRAP toolset, with all the solutions and knowledge that it contains.
Who uses ExRAPs?
Microsoft Premier Support helps thousands of customers each year achieve higher uptime and performance from their Exchange servers. Premier Field Engineering conducts ExRAPs against environments of all imaginable levels, ranging from organisations with only a couple of Exchange servers to large corporates with tens or hundreds of servers. The largest ExRAP to date was for over 900 servers! Most ExRAPs are performed at a slightly lower number than this, typically in the 10 – 30 server range, but it varies by geography and customer request.
What’s the process?
Your Technical Account Manager (TAM) or Service Delivery Manager (SDM) will typically have discussed and positioned the ExRAP with your organisation, and a PFE is normally contacted once the engagement has been agreed upon.
At this time you and your TAM should have discussed the size of the Exchange environment as that directly correlates to the amount of time (hours) needed to analyse it. As you can imagine, the larger the environment, the longer it takes, and thus the cost increases. ExRAP pricing is broken in to distinct tiers and it is the tier that will dictate the number of hours required.
The ExRAP is then booked into the engagement system, and an accredited PFE is contacted to deliver it.
The next step is to acquire scoping information, using a scoping tool. An on-site delivery cannot commence until valid scoping details have been sent to Microsoft, and this is required for two reasons:
- it demonstrates to Microsoft that the environment is ready for a full ExRAP
- it confirms the size of the environment, and thus the number of days required to complete the engagement
So submission of the scoping data is mandatory before the engagement can continue. Other questions that sometimes arise at this point may be something like:
- I don’t want to analyse the Exchange servers in the DR site
- I don’t want to analyse the old Exchange servers, only the new ones
- I don’t want to analyse the new Exchange servers only the old ones as the new ones only have some pilot mailboxes
While the team behind ExRAP at Microsoft certainly understands why customers ask these questions, they unfortunately defeat the design principles of ExRAP: ExRAP is designed to work against a holistic view of the Exchange environment, and as such we look at the entire Exchange organisation – not just a subset. Time and time again we have seen issues caused by one server impact other Exchange servers, and if we were to only analyse a portion of the environment we would not be able to accurately determine its true risk and health status. In some other RAPs, like SQL, SharePoint of Clustering, isolated instances are scanned, but ExRAP and ADRAP (for Active Directory) both require scanning of all of the relevant servers in a forest.
What are the top findings?
Now that we have reviewed some of the background to ExRAP, what are some of the top items that we see commonly reoccurring? Let’s look at the top 5 critical severity and top 5 high severity issues that are typically found globally during ExRAPs.
If you can address these top issues then you have already started to make fantastic inroads into maintaining Exchange server uptime!
Top 5 Critical Severity issues
- Critical security updates are missing
- Defined Service Level Agreements do not exist for backup and recovery processes
- No defined Service level Agreements (SLAs) exist for the Messaging/Email service
- A documented disaster recovery plan does not exist
- No defined Operating level Agreements (OLAs) exist between IT groups who own the services that support the Messaging/Email services
Critical security updates are missing
Failure to keep your core infrastructure servers updated with the latest security updates and service packs, especially those with a "critical" severity rating, leaves the entire environment at extreme risk to service outages, data loss and exposure, and other malicious activities. An outage of the Exchange infrastructure can be a business loss-generating event; a quiet compromise of the Exchange infrastructure could be even more critical.
Your Microsoft TAM will send out an email every month describing the upcoming updates that will be released on “patch Tuesday” and then a more detailed email after the updates are available. You can also sign up for the update notification service yourself, and pass this onto others. Microsoft strongly recommends proactively reviewing these bulletins for applicability, testing the updates in a lab, and once validated, installing into production in a defined maintenance window.
The Microsoft Baseline Security Analyzer (MBSA) is a useful tool for both scanning and reporting security status for a single server or across the computing environment. It scans for common incorrect configurations, overlooked default options, and the installation status of the latest security hot fixes from Microsoft. It also fully integrates with Microsoft Windows Server Update Services to scan systems according to a predetermined configuration.
The MBSA tool and its documentation can be obtained from the Microsoft Security Web site.
Microsoft Windows Server Update Services (WSUS) is available within the latest versions of Windows Server.
Defined Service Level Agreements do not exist for backup and recovery processes
Not having a clearly documented disaster recovery SLA can have a number of consequences, including the following:
- Disproportionate or unrealistic expectations during disaster recovery
- Uncertain recovery guidelines or timeframes during disaster recovery
- Difficulty in accurately gauging a DR plan’s appropriateness or effectiveness
In consultation with management, administrators should develop disaster recovery SLAs that accurately reflect the business needs and requirements of the user community. These plans should then be documented to provide clear guidelines for handling a disaster recovery incident. Disaster recovery SLAs also allow administrators to set appropriate expectations during a disaster recovery incident, which can help provide administrators with time to perform basic analysis of the root cause of an issue, to prevent reoccurrence.
Disaster recovery service level agreements should take into account the recovery levels and the various business requirements of the user community. At a minimum, a disaster recovery plan should also include the following:
- Realistic and attainable metrics
- Mailbox uptime, server uptime, restoration of service, and restoration of access to historical data.
- Information about the different levels of disaster recovery to be used in the environment. For example, when should a database be dial toned instead of launching an immediate restore job. A complete plan would take into account the various requirements and concerns of the specific environment in question as well and also the information necessary for administrators to gauge their own performance in meeting the needs of the user community.
No defined Service level Agreements (SLAs) exist for the Messaging/Email service
Service Level Agreements (SLAs) are negotiated agreements between IT and end customers. These agreements should contain several service targets including availability targets and windows of measurement.
Sometimes customers and IT do not effectively communicate such details, and as a consequence, misunderstandings regrettably occur. Typically this results in customers expecting 100% availability, whether or not that was funded or if even possible in the environment.
Another consequence of having no SLA is that IT does not have a mark to shoot for and by extension no bar from which to measure success. Having at least minimal service level agreements with customers, IT can improve the relationship with customers and also set expectations that they can manage.
- Quantify customer expectations
- Negotiate initial service level agreements
- Measure and report achievement(s) to both management and users
- Strive to improve the service level agreements through iteration
A documented disaster recovery plan does not exist
Disaster recovery is one of the most important functions of Exchange Server administrators. Recovering from many disasters frequently requires the coordination of multiple individuals, perhaps across multiple teams. Without a predefined plan for activating and coordinating these critical resources, the success of your recovery is left to chance and circumstance.
A well-documented Disaster Recovery Plan reduces the time spent deciding what to do, helps keep those involved up-to-date, and ensures that your organization can recover as quickly and efficiently as possible. Your plan should also ensure that the services and infrastructure upon which Exchange Server relies are available, reliable, and recoverable. An additional benefit in creating a Disaster Recovery Plan is that, during the plan-development process, you may discover areas where your systems are vulnerable. These vulnerabilities can then be reduced or removed to make your systems more robust and recoverable
Create, test and maintain a detailed DR plan. Ensure that all documentation and the various prerequisites are available should the primary site totally cease to function. There have been several cases where issues were observed due to documentation and files only stored in the primary site. When that datacentre failed, all access to the required documentation was lost.
No defined Operating level Agreements (OLAs) exist between IT groups who own the services that support the Messaging/Email services
Service Level Agreements are agreements between IT and the customer, while Operational Level Agreements (OLAs) are agreements between the messaging team and the other groups that own the services supporting Exchange. Creating and maintaining SLAs and OLAs for your organization are critical first step in being able to measure your own rate of success with Exchange Server. If SLAs and the corresponding OLAs are not present then it is difficult to design or accurately predict the outcome of an Exchange Server implementation.
· Negotiate initial operational level agreements
· Measure and report achievement(s)
· Strive to improve the agreements over time
Top 5 High Severity issues
- All Exchange environment counters are not monitored
- Successful DNS record registration by all domain controllers in the forest is not verified
- Patches are not tested before deployment to production
- Detailed windows and Exchange server build documents do not exist
- Service Level Availability of the Messaging/Email service is not measured
All Exchange environment counters are not monitored
Monitoring an Exchange Server environment is a critical aspect or running a successful Exchange organisation. Ineffective or absent monitoring can lead to negative effects on performance, availability, and security.
Ensure that all relevant performance counters are monitored. This is to include not just RPC latency, store RPC latency, Disk latency, RPC operations but all other counters. Installing any monitoring tool and then assuming that the default installation will successfully monitor Exchange is a falsehood. Some counters will need to be added and others tuned down. TechNet has values for the counters. But as Captain Jack Sparrow often says, they are guidelines rather than rules. For example, organisations that run Outlook in online mode exclusively will be far less tolerant of disk IO blips and thus the thresholds must be considered in the milieu of a given organisation.
Successful DNS record registration by all domain controllers in the forest is not verified
DNS is the primary name resolution mechanism for Active Directory, Exchange Server and Outlook clients. Invalid DNS data will break AD replication, authentication and resource lookups. A direct result of this will be that Exchange cannot locate catalog servers and Outlook clients cannot communicate with Exchange.
Utilise tools to ensure that all of the record required by DCs are registered into DNS by using automated tools. This is to include checking both SRV and A records, it is not sufficient to just ping a DC as this does not fully validate records used by NetLogon. DNS Lint and the DCDIAG.exe /Test DNS are highlighted below. Other monitoring tools and platforms are also able to achieve the same.
Patches are not tested before deployment to production
Testing updates before deployment can help minimize the risk of the adverse effect that the update might introduce in your environment. Although the depth of testing should depend on the business importance of email to your business, some level of testing needs to be performed.
Based off the business importance that the messaging environment has, one can then create a test plan that works to meet the SLA. At a bare minimum, a test Exchange server should be established and the update installed to verify that no obvious issues arise. The issue of not testing patches is often encountered with a customer not having a test lab at all, the test lab is so out of sync with production or the test lab has been cannibalised for parts & resources that it is ineffective.
Create a test environment that ideally mirrors production. Some organisations chose to deploy the test environment on virtual machines (VMs) which is fine; however the crucial aspect is ensuring that the relevant aspects of the environment are tested. For example mail flow, client connectivity, server to server interoperability and ideally extended to include 3rd party services. Note that this should be a separate forest. How can you test schema extensions on a “test” machine that is in the corporate forest? Answer is that you cannot, as the schema will be updated and replicate to all servers regardless of whether or not they are deemed “test”.
Detailed windows and Exchange server build documents do not exist
Complexity and inconsistencies are two demons that will challenge any infrastructure. Complexity for complexity’s sake is generally a poor idea, and simplicity will win out as it is easier to support in the long run. If there are no build standards this will result in servers having different configurations depending upon who built them and what they had for breakfast. As a direct result of this, failed changes will increase. Thus the time (and cost) of troubleshooting will also increase.
The first step in bringing servers under control and minimizing complexity is to create and follow a step-by-step build document for servers. By following a detailed build document, servers will be consistent when they go into production. Change management will then assist in keeping them consistent during their lifecycle.
Detailed build documentation should be created to document a server’s entire configuration. This is to include hardware specific configuration, OS, Exchange, service pack and update levels and also all of the 3rd party components that make up your messaging ecosystem.
Additionally drift from the known configuration should be proactively tracked and monitored upon. This is called Desired Configuration Management (DCM) and is a function of SSCM. If you are interested in obtaining assistance from PFE with this please speak to your TAM as we have a specific offering that meets this need.
Service Level Availability of the Messaging/Email service is not measured
Measuring, reporting, and publishing availability data for the Messaging service is essential and assists with:
- Communicating clearly
- Setting realistic service level agreements
- Allocating resources & funding most suited areas
- Measuring impact of improvement initiatives
There is no point in having an SLA and not measuring to see if you are meeting it. This can be called “driving in the dark” or not “keeping yourself honest”, either way you do not know if you are actually meeting the SLA requirements. .
Leverage an automated toolset that calculates and reports upon the Messaging and email availability as defined by your SLA. These reports should be available within your organisation and can then be used to drive improvements to the messaging services that your provide to end users. Should you find that SLAs are not being met; a conversation can now happen with the business to ask for funding, development time or the SAL gets modified. It may not be ideal, but all parties know how the environment is performing.
Now that we have gone through the top 10 issues, you are prepared to work to address them within your organisation, and by doing so can improve the uptime and reliability of messaging services in your environment.
You may have noticed that the majority of this article has been around the “softer” side of managing Exchange, and not just gnarly and arcane technical facts. Why is that, you may ask?
In a nutshell: technology does not exist in isolation.
You may have seen the MOF diagram that discusses people, processes and technology? Of these three areas the smallest is technology. I have personally seen customers have better uptime from well-maintained standalone systems, compared with others that have badly maintained “highly-available” clusters.
Because of this, the biggest impact can often be generated from improving the policies, processes and management practices within the messaging environment. By creating the necessary documentation and processes, you ensure everyone knows how the Exchange environment should look, how it is meant to be administered and the level of services that end users should expect!
Posted by Tristan Kington, MSPFE Editor, and three fifths of a salad.