SharePoint Performance Troubleshooting – part 2

Hello,

after publishing performanceTester in my last post Sharepoint performancetester identify your bottlenecks, lets get started to troubleshoot performance issues.

In solving performance issues i could identify different places of bottlenecks. These places are inside your IT infrastructure. One of my colleagues is always saying: “Lets paint a picture” and in performance troubleshooting it’s display us the whole world of product we need to check.

image 
SharePoint Processing Pipeline

Performance issues can happen:

  1. Client side on the user machine: e.g. Browser issues, Java Scripts, bad network connection, slow name resolution, …
  2. on Proxy Server side: proxy server busy
  3. on WebFrontEnd Server, e.g.: high CPU, high Memory, high IO network traffic, Antivirus engine, …
  4. on Application Servers slow performance, e.g. Excel Services, Conversion Services, html viewer service, …
  5. on SQL Server: many concurrent queries cause IO blocking, Deadlocks, …

Lets paint the picture of possible bottlenecks:

image
SharePoint Processing Pipeline with bottleneck places

The main problem i could identify of bad performance is mostly the time of page rendering while browsing. This means many user create incoming HTTP request to the WebFrontEnd Server (WFE). In the page rendering process are Microsoft Components involved and custom code components like CustomWebParts,CustomUserControls. Performance issues can be inside of any component and we need to identify them. But normally the code isn’t bad, it’s more an issue with: the code did not scale good enough with the used information design.

Pre-actions before start stress test: Enable Logging
For troubleshooting we have the task to go step by step from the client to the WebFrontEnd server. So we need to measure the Client page rendering. We also need to measure the time to render a page on the WFE. This is possible over IIS W3C logs (time-taken). As last step we need to measure the performance of each incoming SQL-Request from WFE to SQL server via SQL Profiler.

To measure the page rendering client side i prefer to use Fiddler and IE 8 with Developer Tools bar. These tools are good for analyzing one single page request but are useless to check it in a stress situation.

To measure the time of page rendering i will use IIS W3C logs. Therefore i will enable the IIS Logging on WFE.
image

After configuring the IIS logs, please wait for 1-2 business days for analyzing the current performance baseline.

Note: With the IIS W3C Logs we get information about the .net rendering speed which is measured under the hood of SharePoint. We focus on the time-Taken value, which is the total time the IIS server need to send the response for an incoming request.

To analyze IIS W3C logs i use logparser and queries like:

Display the 10 slowest Request
logparser.exe “select top 10 time-taken, cs-method, cs-uri-stem from ‘path to IIS logfile’ order by time-taken desc”
image 

It’s also important to know, which url has a server error –> http 500
logparser “Select date,time, cs-uri-stem from ‘ex100508.log’ where sc-status > 500”

Now we have a target on which URL we can focus. In my demo case it’s the root site of the site collection “/” or “/default.aspx”.
In normal business hours all incoming user request have a different URL. If we create only bad performance URL requests  on the same time then we can create the performance issue in a non randomly situation. So with the right URL we will create now a constant traffic with using performanceTester.

Lets create 100 sample request: 
image
or with debug output:

image

Now we know the performance value from client side. It’s measured in a stress situation. Lets measure the value from WFE IIS W3C logs of the webapplication with using logparser: (more)

image 

The Average is nice and small with 90 ms on server side to render pages. 302 ms is the average of client side receiving data. This show me in the transmission phase we lose 302-90 =  212 ms in average. But for performance troubleshooting we should always focus on the BAD situation and the average isn’t bad enough. We need to focus on the max-values. On WFE we see 41485 ms to render the page. This is very long and a real performance issue.

As next we should focus on the time when the long page rendering happen, e.g. using  logparser, e.g. get all request which a time-taken high then 40 sec. = 40000 ms:

logparser.exe “select date, time, cs-method, cs-uri-stem from ‘ex100508.log where time-taken > 40000”

During the execution of performanceTester we need to check main resources like CPU, Process, Memory, …
on WFE we see this:

Wow, cpu is under pressure from one instance of performanceTester

image 

In more deep i like to focus on perfom counters like:

  • Process / Virtual Bytes / for all Instances of w3wp.exe
  • Process / Private Bytes / for all Instances of w3wp.exe
  • Process / % Processor Time / for all instances of w3wp.exe
  • .NET CLR Memory / % Time in GC / for all instances of w3wp.exe

Note:
1. On 32 bit environment the process is limit to 2 GB of user mode virtual address space. If the process reach the limit then an outOfMemory exception occurs.
2. If the system is under pressure (e.g. in memory) the garbage collection will be the next break for the process. If the process want to create a new object then i will request a bit of memory from operation system. If the memory is not available then the Garbage collect will run until free memory is available. Normally this is good but the garbage collect is a very expense operation for the CPU. If w3wp process has “% Time in GC” higher than 15 then we should focus on coding and consuming memory.

In a debug view of performanceTester, i see client side a “random” slow performance issue, page received in 24 seconds.
image 

Know we see the real performance issue. Client side we see page rendering of 24 seconds and server side we see it’s worst, 41 seconds after a few minutes of research.

To solve performance issue we need to check the IIS recycle and reset operations as first step before looking deeper. You can see the recycle operation in IIS w3c log with a find command:

check of iis resets & recycle operation:
1. open a cmd.exe
2. run: find <iis w3c log> “#Date”

image

Unfortunately this will not show us all IIS mechanism when a recycle happens but in this case we see the issue happens 3 times in our stress test.

The rest of IIS recycle operation you can find in Event viewer of the server operation system. Focus on Internet information service related event id’s. (more, kb 332088)

Every time when the whole service is stopped and restarted the application pool get’s restarted too and this means for our SharePoint webapplication: start the recompile and rendering from zero and all caches are empty. This you can monitor too with using the task manager in comparing the process id of w3wp process or with using:

on windows server 2003: iisapp

on windows server 2008: %systemroot%\system32\inetsrv\APPCMD.EXE list WP

If the process id has changed then the process was restarted.

Normally the IIS web server and worker process should recycle the process only one time in 24 h. My personally suggestion is ones a week or never. So disabling all recycle operation will help us for performance troubleshooting too.

How to deactivate the recycle operation?

Open IIS manager, open the Application pool properties and unselect every check box on these 3 tabs:

image

image

image 

Now the performance is better because the recycle is switched off. In my demo the slowest request is around 600 ms without caching. :-)

image

In my next post i will go deeper and move from IIS into SharePoint API and object model.

[Update: MyWebcast with step by step guide to debug sharepoint performance issues (in germany but take i look into the last 20 min. you will see a detect method for bottlenecks with 4 simples commands]

regards

Patrick