You may encounter an issue where the UM service is not starting. When checking the event log you will see the following events:
Log Name: Application
Source: MSExchange Unified Messaging
Event ID: 1038
Task Category: UMService
The Microsoft Exchange Unified Messaging service was unable to start. More information: "Microsoft.Exchange.UM.UMService.UMServiceException: The worker process didn't start in the allotted time.
Log Name: Application
Source: MSExchange Unified Messaging
Event ID: 1430
Task Category: UMCore
The Unified Messaging server shut down process umservice (PID=4048) because a fatal error occurred.
This type of service startup issue can happen after performing one of these actions –
1. Creating a new UM dialplan or IP gateway, or hunt group objects.
2. Installing a /new UM language pack.
3. Running the script (ExchUCUtil.ps1) in order to integrate UM with Lync (Skype for Business) server.
Let’s look at what happens in the background when the UM service starts. When you start a UM service, there are two processes that needs to start – UMService.exe and UMWorkerProcess.exe. UMService.exe is the watcher process for UMWorkerProcess.exe. UMWorkerprocess is the primary process that performs all the functionalities of UM. When the service starts it loads the major UM configuration related objects in its memory. The primary function of UM is to accept incoming calls for voicemail. When a call comes in UM has to determine –
1. Is this call from a valid UM IP gateway ?
2. Which hunt group is a match ?
3. What is the dialplan that corresponds to the hunt group ?
If UM does not have all the major configuration objects in its memory, then it will have to query the AD for each call. The process of querying AD and retrieving all the information is not instantaneous and may cause reasonable delay depending on the network and Global Catalog location. As a result the “real time experience” of a call will be impacted when someone calls the UM server. In order to provide a seamless real time experience, most of the frequently queried objects are already loaded in memory when the UM service starts. Therefore, when a new call comes in UM can answer the call instantly. In addition, UM also loads the compiled GAL grammar file, so that, when a caller calls in to SA or AA and searches the entire GAL of an organization, the GAL is already loaded.
The startup problem arises when there are large number of configuration objects. UMWorkerProcess by default has approximately 240 seconds of total time to start. During this time it has to check certificates, allocate memory, and load UM configuration objects prior to starting the service. If there are a large number of dialplans, hunt groups, IP gateways then it will take long time to start the service and in some cases, the time available may not be enough to complete all the pre-requisites of service startup. In that case, the service startup will fail due to timeout. Similar problem can happen if an organization has a large GAL (over 100k users) and multiple language packs. For this scenario, GAL for each language pack has to be loaded. Note that, just loading the GAL in different language packs does not usually cause problem with service startup, since the process is fairly fast. It is the combination of multiple GALs along with large number of dialplans (and/or other configuration objects), that causes the issue.
A similar situation may arise when you are trying to integrate your UM environment with Lync\Skype for business. In that case, you will run the script that creates UM objects corresponding to each pool. If there are lots of pools in the Lync environment, you will end up with a large number of configuration objects on the UM side.
Is there a hard coded limit on the number of objects that UM service can load in memory during startup?
Unfortunately there are no hard-coded limits. The configuration of each object also impacts what gets loaded in memory. For example, a dialplan with multiple subscriber access number, will take more time to load, compared to a dialplan with single SA number. That is why we cannot say that UM service is fail after loading a specific number of dialplans. Also, the hardware spec for the machine where UM server is installed plays a role. From the cases that we have seen in support, in general, UM service start to experience this issue when they have 150+ dialplans and two or three times more hunt group objects. This number can dramatically change when you have multiple language packs for an organization with 100k+ users. In that case, if the UM server has 70+ dialplans with 5 or 6 language packs, the organization experienced this issue.
How to fix this issue?
For Exchange 2010, make sure UM server is installed in a standalone machine with more than adequate memory and CPU.
Same goes for Exchange 2013 and 2016 server – in this case, the issue is seen in the mailbox server running the UM service - make sure the mailbox server has more than adequate memory and CPU usage.
If adding memory and CPU does not resolve the issue, then you will need to reconsider the overall design of UM and start reducing the number of dialplans. Check the overall usage of each dialplan – typically, a single dialplan can serve entire organization operating in a single geographical location using a single dial code, example, a US based company operating on multiple states in US. . For global organizations, a single dialplan can serve country / Region – since most countries or region has the same numbering plan (meaning number of digits in a DID number). Only time, you need to create multiple dialplans is when, “number of digits” in extension are different and “dialcode” is also different. These are unique properties of dial plan and cannot be changed.
If you have two dialplans with the same values for “number of digits in extension” and “country code” – they can be easily combined into one dialplan. Sometimes UM Admins wind up creating an excessive number of dialplans for administrative purposes. Note that this is not the intent of the dialplan object. Having fewer dialplans decreases administrative overload and reduces chances of a call misroute.
Is this a bug ? Are there any plans to fix this ?
This is not considered a bug – in order to provide real-time experience with UM, Microsoft developed a design that retrieves configuration details in a timely manner. This means, loading the objects in the cache during service startup. Since this design is intentional, it is not considered a bug.
An efficient UM topology, consisting of dialplans that align with PBX systems as well as geographical locatiosn will result in a more manageable number of dialplans and huntgroups. Remember that a smaller number of dialplans is easier to manage and there will not be any service startup issue.