Today I want to talk about something quite odd that I ran across at one of my customers...
They had some issues with one of their workflows not "performing" well... It took ages for the workflow to complete.
One of the most important troubleshooting steps with workflows is usually quite simple...
It's just "Wait..."
But since that was already done for a few days, we had to take a closer look to see what was going on. SharePoint workflows can be run in either the W3WP process, or they can be handed over to the OWS Timer service. Usually this transition happens when you add a workflow action that calls for a "wait" or when some events need to be picked up as a reaction to edits... So we checked the timer service and reviewed the running timer jobs. The odd thing was that the workflow timer job is running on every server on the farm for several hours already... That is very odd, since usually that job only runs for a few minutes for each run, so we suspected a hung workflow. The workflow engine provides a lot of ULS entries if it fails somewhere, but non of those showed up... A good collection of the events that are output on failure can be found here... http://blogs.technet.com/b/sharepointdevelopersupport/archive/2013/03/12/sharepoint-workflow-architecture-part-3.aspx (This also provides some very good information on how to troubleshoot workflows in general)
So we increased the ULS logging and that helped us to identify what was happening in the job. The next step was to take the data, port it to a single server, and analyze it (merge-splogfile). But since it was a busy production server the amount gathered was very high. We found that all different OWS timer jobs were constantly updating list items in the same sitecollection, but there were no obvious workflow errors in the ULS logs. So now we had at least a place were we could have a closer look.
As it turned out the sitecollection in question was already known to the customer... The sitecollection was Nintex-Workflow enabeled (Only users with a special inhouse training are allowed to use the nintex workflow components for that customer).
Now lets go a bit of topic here... The power user on that sitecollection had a small issue, he actually managed to solve...
He had a list and wanted to utilize [today] in a calculated field. But SharePoint does not support the use of dynamic data (Like [me] or [today]) in a calculated field. But we can use other fields in a calculated fields. So the user was outsmarting SharePoint by adding a field called "today" to his lists... But in order for this to work, the field had to be updated each day. So he found a workflow action that helped him out:
Armed with this knowledge, he went on and created a site workflow, which should be executed daily... Here is a picture of the workflow the user designed:
So back to our workflow issue... And why has this caused a Problem?
Well... There are 2 timeout when it comes to workflows. One of these timeouts is the Event delivery timeout... This timeout determines how long a workflow can run. It is documented here: http://technet.microsoft.com/en-us/library/cc262968(v=office.12).aspx
If you create a workflow solution that has a very long processing time to start your workflows, complete tasks, or modify workflows, you should consider increasing this value. View the ULS logs and watch the Microsoft SQL Server table ScheduledWorkItems to determine if the workflow jobs are timing out. The default folder location for the ULS log is Program Files\Common Files\Microsoft Shared\Web server extensions\12\Logs. In the ULS log file, you can use "workflow" or "workflow infrastructure" as search keywords.
There is also a 2nd timeout, that determins when a workflow has died. This timeout will kick in after 20 minutes and will reset the state of the workflow back, so it will be processed again.
Now there was a point when the server tipped over, and the processing for all those lists exceeded the 2nd timeout. Which caused the workflows to be reset. Each day there was a new workflow added to the queue... None of these workflows managed to complete, and we piled up a total of about 100 neverending workflows over time. The issue was never noticed, since the workflows managed to update "most" items, and so everything looked fine... There was only one strange result. The Workflow timer server keept increasing in duration, and took longer and longer to complete... (100 X 20 minutes = 2000 minutes, 5 Servers total in the Farm means an average of 6-7 hours per workflow timer job run per server)
This started to raise other alerts, since the normal workflows got "stuck" behind this one (Processing is done one contentDB at a time) and things like approval took to long to be processed and users started to complain.
How could this have been fixed?
Assuming the user insists on keeping his solution, the easiest way to fix it would be to use the FILTER in the workflow action in order to only retrieve items that are NOT todays date. If the workflow fails the 1st time, it will just restart and fix the items it left over from the first run, and not doing millions over millions update actions to the same records setting todays date to today...