This post is authored by Henrik Jørgensen from Microsoft Services in Denmark.
The following is based upon real life experience with the CQM framework from a Microsoft Consulting Services project.
One of our customers reported bad experiences with Lync. Specifically the customer reported, that their end-users complained about problems with Audio and Video. The problems reported could be divided into 2 scenarios:
- In PC to PC communication, where 2 end-users communicated via the Lync 2010 client.
- In Conferences hosted on the Lync 2010 platform, where multiple end-users participated in a conference call.
The customer is a major global player in their specific area. They host several Lync pools globally and are represented in countries around the globe.
An approach to analyze the above problems is to use the Call Quality Methodology (CQM) framework as introduced by the Microsoft Lync product group.
The approach was to establish a baseline, in order to understand the level of the problems but also have a benchmark to compare with after implementation of changes to the Lync environment and related IT infrastructure components.
We divided the work into the following areas:
- Bios and device drivers
- Patch level of the OS in use
- Use the CQM SQL queries to analyze the data in the QoE database
- Definition of persona profiles for Lync usage
- Logical map of the Network topology
- Bandwidth estimate calculation
- Network Assessment for UC
The work involved several IT teams at the customer. A key learning was, that the operation of a complex IT-infrastructure as Lync calls for co-operation and communication between the IT teams involved in operations and maintenance of the Lync infrastructure.
The analysis work revealed several findings:
- The Bios level needed to upgraded on some pc's
- It was necessary to deploy newer versions of drivers for Network Interface Cards in the clients
- The Hardware requirements was met
- Newer BIOS and firmware was needed at the server level
The KHI, CQM and Network Analysis revealed other findings. These are presented in more detail in the following.
Server Key Health Indicators
We used the KHI collection PowerShell script from the networking guide. We collected data for 5 working days. Afterwards, the data was imported to Microsoft Excel.
Among other critical findings, we observed packet loss on some of the front-end servers. This called for further analysis of the server problems. A firmware upgrade was part of the solution.
The CQM SQL queries are divided in 3 areas
- Endpoints – the methodology will help to determine whether there are problems with the client pc's or the devices connected to these.
- Server to server / gateway traffic – in order to document whether the Lync media servers are healthy or not. Furthermore, the methodology can document whether conditions on the servers are contributing to packet loss and Jitter.
- Network – the methodology will document which LAN subnets the Lync poor calls are coming from.
We used the queries in the Networking guide. A summary of the findings are provided below:
The CQM queries revealed that a majority of end-users at given locations did not use Lync certified devices. The customer initiated a process to
- Provide the end-users with certified devices
- Learn the end-users to use the devices
The CQM queries documented packet loss between the AVMCU and the Mediation Server and from the Mediation Server to the gateway at some sites. Further analysis looked at
- Non Lync Software on the servers
- Lync pre-requisites regarding antivirus exclusions
- Ensure that the network equipment are healthy and it follows Microsoft guidelines from the Open Interoperability List
- Firmware on the servers network interface card
We identified several issues
- All access to the Lync infrastructure was via VPN for the customers employees when working remotely. Split tunneling was not implemented.
- Some internal peer-to-peer traffic was relayed via the Edge servers.
- RTT > 500 Ms was observed on some network connections
- Packet loss on WiFi networks.
All above findings called for additional analysis and work in order to solve the problems.
Together with the customer, we defined three persona profiles. These where defined in the bandwidth calculator.
The customers HR department provided us with a number for each of the persona profiles at the specific locations where the customer is represented.
The calculations in the bandwidth calculator revealed:
- Possible overflow of the QoS queue allocated for Lync traffic
- Locations where more bandwidth was needed to handle the Lync traffic
Furthermore, the customer initiated a network assessment of the WAN. The assessment documented the predictions from the bandwidth calculator.
Customer initiated actions to improve the Lync experience
The customer initiated several actions to improve the Lync experience of their end-users. In summary these are
- Acquiring more WAN bandwidth for given locations
- Implementation of Quality of Service on the network
- Renewal of network equipment at given locations
- Improvements to existing WiFi implementations at given sites
- Training of the end-users
Proactively use of the CQM methodology in order to monitor improvements
With the CQM approach, we helped our customer to not only troubleshoot and fix problems with their Lync infrastructure. We also established a methodology that is used pro-actively in their environment to prevent problems in Lync communications internally as well as with external parties.
A key learning is that CQM is a very good framework, but the value from it can be very limited, if the processes at a customer are not aligned to CQM and a proper Lync service mapping is not in place. Furthermore, the different IT teams at the customer needs to communicate very close about operation and maintenance of the IT infrastructure.