With the release of the Service Manager 2010 MP, I wanted to take a moment and highlight some of the best practices we have uncovered so far. The items listed below include known issues, hints, and advice. As time goes on, I will continue to update this post. If you have any comments or feedback, please use the Service Manager forums so that the resolutions are public.
The Service Manager MP provides monitoring for the Service Manager services, connectors, and workflows. It can be obtained from here.
1. Distributed Application Diagram not loading
Symptom 1: When you click on the Application Diagram view, under the Service Manager folder, nothing loads.
Resolution 1: This is a known OM issue where distributed application diagram will sometimes not load. To cause it to load, you can click on another view and click back.
Symptom 2: Another result of clicking on the Application View may result in only the Service Manager node showing up (shown in the picture below)
Resolution 2 A: Discovery can take as long as two days. If it has not been two days, let discovery run till 48 hours have passed since importing the MP.
Resolution 2 B: Ensure that Agent Proxy is enabled for the management servers which do not load in the diagram (may be individual or all)
Resolution 3 C: Restart the Health Service on the management servers which do not load in the diagram. Note that this may affect the SM workflows that may be running on that management server and will disrupt monitoring momentarily. After restarting the Health Service, it will take discovery another full cycle (a few hours) to run again.
2. Service Manager and Operations Manager Share the Health Service
The Health Service is used both by the Agent role of Operations Manager as well as the Management Server role of Service Manager. Since Service Manager is using the 2007 R2 version of the Operations Manager Health Service, it is required to use a 2007 R2 Agent to monitor Service Manager 2010. It is also important to note that the Service Manager 2010 MP is not capable of monitoring the health of the Health Service since it would require the Health Service to monitor itself. You will have to keep a special eye on the alerts that Operations Manager raises regarding its agents on Service Manager management servers (both SM and DW) since the root cause of the problem may be either Service Manager or Operations Manager, and both will most likely be affected.
Example: Bad Service Manager workflow crashing the Health Service would lead to a “Health Service Heartbeat Failure” alert being thrown by the Operations Manager MP.
3. Registered SM Management Groups to a DW Management Group
This management pack is capable of detecting which Service Manager management groups are registered with the Data Warehouse management group (shown in the picture below). However, for this to happen, a full ETL will need to be run by the DW management server for the SM management group, after which discovery will detect the SM management group.
4. Multiple Duplicate Alerts for Workflows, Connectors, and Run As accounts
Symptom: The information regarding workflows, connectors, and Run As accounts is stored in the CMDB of the management group. Since we do not represent the Service Manager Database as a CI we chose to attribute the alerts regarding workflows, connectors, and Run As accounts to management servers. This means that every management server in a management group will alert on an issue with any of these components, thus resulting in multiple (one per management server in the management group) duplicate alerts.
Resolution: Disable the monitors which cause duplicate alerts on the management servers which are not running workflows. Although you can leave the monitors enabled on any one of the management servers, it is recommended to keep them enabled on the server which is running the workflows since it will maintain a relationship between the server running the workflows and the server monitoring them.
You can see which management server is running workflows in the State Diagram view.
On the management servers not running workflows, disable the following health service, connector, and workflow monitors:
Server Running Workflows:
Server Not Running Workflows:
5. Repeat Count for Workflow and Connector Alerts
Workflow and connector alerts are generated by rules in Service Manager which has a few implications that I would like to explain.
Connector alerts show up as one alert per connector (per actual connector, not per connector type). The repeat count will increase by one every time the connector failure is detected (which happens every time the connector runs and fails). If the connector starts to run successfully, the alert will not auto-resolve (since it was generated by a rule). If the alert is manually resolved, the next time that a failure is detected, a new alert with 0 as the repeat count will be created.
For workflows, the behavior is a little different. The workflow alerts show up one per type of workflow (for example “Software Deployment Activity”). The repeat count will increase by one for every instance of a failed workflow, meaning that if you look in the failed workflows view in Service Manager you should see a number of failed workflows that matches the repeat count (remember that the repeat count starts at 0). If you fix the issue, this will not cause the alert to go away. You need to use the failed workflows view to either retry or ignore the failed workflows. Once the failed workflows view is clear in Service Manager, the alert will auto-resolve. If you remove some of the failures, the repeat count will not drop unless you remove them all. If you manually close the alert, it will be opened again with a repeat count equal to the current number of failed workflows.
6. Run As Account Alerts
The Run As accounts that are monitored by the Service Manager MP reside in the Service Manager database. Every so often the Health Service validates whether it can log on as the Run As account and places event 7000 into the Operations Manager event log (used by Service Manager too) if it cannot. The Service Manager MP then comes along and sees this event and creates an alert.
Till now, if you have deleted Run As accounts, or are not sure if you have, continue reading this section. If you have not deleted any Run As accounts, jump to the “Best Practice to Avoid this Issue” section below.
I Have Previously Deleted Run As accounts from the UI: If you have deleted Run As accounts from the UI, the symptom will be that you get an alert which tells you that a Run As account is invalid, and when you look at the credentials of the Run As account, you notice that it is not shown in the Run As account view in the Service Manager console.
You can either ignore the alert (if you close it, it will right back), or you can disable the monitor. We are currently looking into how we can help you get out of this state and will hopefully have a solution for SP1. I will make sure to update this post once we have a definitive plan.
Best Practice to Avoid this Issue: The best way to avoid this issue is to never delete Run As accounts from the UI. You can reuse existing Run As accounts by changing their name and/or credentials. If you would like to stop using a run as account, you can change its credentials to Local System and change the name to something easy to remember such as “Inactive.”
This way, you will not end up with stale Run As accounts which cause events to be placed in the Operations Manager event log.
7. Integration with Service Manager
Like other management packs, the Service Manager management pack (for Operations Manager) can be imported into Service Manager. This can be done either through the Operations Manager CI connector (in Service Manager) or manually by importing the Microsoft System Center Service Manager Library MP.
From importing the MP you will gain the following value in Service Manager:
Service CI and Service Map: This is the representation of the Distributed Application Diagram in Service Manager. It is obtained from importing the library management pack along with importing the Configuration Items from Operations Manager.
Search Filter by Service Manager Class: This can come in-handy when filtering a large amount of Configuration Items.
Configuration Item View: This helps present property information about Service Manager management servers without having to go look in Operations Manager.
To create the view above:
1. Navigate to the Configuration Items Workspace
2. Right Click on “Computers”
3. Select “Create View”
4. Enter a Name and Description.
5. Under “Search for objects of a specific class:” browse for SCSM Management Server in the “All basic classes”
Note that this will only show the Service Manager management servers (and will not include the Data Warehouse management servers). If you would like to create a second view for Data Warehouse management servers, use the DW Management Server class as a target. And if you would like both DW and SM management servers to show up in the same view, then target the Management Server class (in the SM library, there is also a management server class in the OM MP). The problem with targeting the Management Server class is that it does not have all the information that the more specific classes do.
6. Select the display columns. Note that the Asset Status, Name, and Notes columns will not be populated.
Service Manager Incidents View: If you would like to create a view in Service Manager which lets you see all of the incidents which affect the Service Manager service, below is the procedure for one method of accomplishing this task. In this case we assume that all alerts raised by the Service Manager MP affect service manager which is a fair assumption. However, we miss out on other possibilities which affect the Service Manager Service like SQL being down. If you would like to grab alerts which affect the Service Manager Service but were not created by the Service Manager MP, then you can add further rules in the Operations Manager Alert connector such as (alerts from ServerA, if you know that server as is the SQL server where the Service Manager DB is located). But still, you may end up with alerts which affect ServerA but do not affect the Service Manager Service.
1. Import the CIs using the Operations Manager CI connector.
2. Create an Incident template which adds the Service Manager Service to the Affected Service field in an Incident form.
3. Create a rule in the Operations Manager Alert Connector which looks for incidents created by the Microsoft.SystemCenter.ServiceManager.Monitoring MP and applies the template created in step 2.
4. Create an incident view which is targeted at the incident class and uses “About Configuration Item Display name Equals Service Manager”
8. Health Rollup for Service Manager and for Management Groups
By default we set the health rollup for Service Manager (the application) and Management Groups to 50%. We wanted to make it a percentage so that I would be easy to override (by just changing the percentage value, instead of having to do overrides). The idea behind this was that each organization will have a slightly different environment and therefore want a different percentage. You can accomplish the following statements by setting the percentages to their respective value:
The Image above shows the node which controls the health rollup of Service Manager as a whole. It also shows, on the right side, how to change the percentage. If you would like to:
- Let me know if one management group is unhealthy (organizations most likely want this if all management groups are production management groups) – Set the percentage to 99% (or any number larger than n-1/n, where n is the number of management groups).
- Let me know if all management groups unhealthy (organizations most likely want this if you do not care about the health of Service Manager as a whole, rather you focus on the health of the management groups as individuals) – Set the percentage to 1% (or any number smaller than 1/n, where n is the number of management groups).
The image above shows the node which controls the health rollup of a management group as a whole. It also shows, on the right side, how to change the percentage. If you would like to:
- Let me know if one management server is unhealthy (organizations most likely want this if each management server is hosting console connections individually) – Set the percentage to 99% (or any number larger than m-1/m, where m is the number of management servers in this particular management group).
- Let me know if X% of management servers are unhealthy (organizations most likely want this if they are using a network load balancer to host console connections) – Set the percentage to a percentage which suits your environment.
Note that in the latter case, even if you are using an NLB, workflows only run one one of the management server, therefore you may actually show a management group as healthy even though workflows in that management group are unhealthy. We are looking at how to avoid this situation in the next release of the MP.
9. “WF Workflow” in Warning State Right After MP Import
You may find yourself in the situation where the WF Workflow node is in the warning state right after importing the Service Manager 2010 MP. After reading the knowledge you may see that an Alert is expected, and that the alert should contain information which should help you figure out which workflow it is. When looking for the alert, if you do not see one, this means that the rule which determines the state of the node (which is rolled up to the management server) has found workflows which have failed prior to importing the Service Manager MP. We made the decision not to alert on failed workflows prior to the importing of the MP since this may trigger 10’s if not 100’s of alerts, depending on if the failed workflows were taken care of. When new workflow failures are detected, a proper alert will be generated for each workflows failure (repeat count will increase if the same type of workflow failure is detected). To cause the state monitor to become green, go to the Service Manager console, navigate to the Administration pane, expand the workflows node, and select the status view. Use this view to look for failed workflows by going through all of the workflows and by looking in the “Need Attention” tab of the view. For any failed workflows, you need to either retry them and have them succeed, or click ignore. After this is completed, the state monitor will become green. If this does not resolve your issue, please let us know in the Service Manager forums.
I hope that you found this information useful. Once again, any feedback about the OM MP is very welcomed in the Service Manager forums.