Timer jobs missing ( too many CPUs)

Ran into a funny behavior these days, checking the timer job status on a Sharepoint server ( totally new, new box, new OS , new Sharepoint) I fount the page blank.

I have never encountered this before so I was puzzled, stunned, blocked and intrigued...

What could have gone wrong ?

Believing this to be a configuration cache issue ( not refreshed ) I tried to clear the cache by stopping the timer service.

The timer job hanged on stopping. Killed the process.Deleted the xml files from C:\documents and settings\All users\app data ...\Config, edited the cache.ini file , set the value to 1 ,  started the Windows Sharepoint Services Timer again and waited for the xml files to reappear. Nope. No success. Stopped the service again, rebooted the box. Same result.

Now I had no xml files, no cache.ini, no timer jobs....

Interresting part was that I had ABSOLUTELY NO logentries in ULS having the source OWSTIMER.

Digging deeper, I finally realized what was going on: Owstimer service was in fact not working, not even starting, although the service was showing up as started and the process was showing up in task manager.

What was wrong?

After memory dump analysis, the conclusion was that the process was in a deadlock situation.Why ? well , because upon starting, the process tried to create the required heaps ( by default, owstimer will create a number 2 x <nr of CPU cores> in the system). When a number of heaps too big is created, to signal a potential memory leak, we log a warning message in ULS logs .The default threshold for the warning message is 32,  in our case, since the number of processors exceeded 16  cores, the process tried to create  the 33rd heap , at which time the threshold being hit, the process tried to send a warning message to ULS then continue. but at this time in the process's life, the logging part is not initialized yet, so we found ourselves in a deadlock. Heap management waits for the ulslogging to complete, uls logging waits for the process to start completely to be initialized.

To get out of this situation , here is what you can do ( apart from the obvious solution of using a box with less CPU's )

Warning Serious problems might occur if you modify the registry incorrectly by using Registry Editor or by using another method. These problems might require that you reinstall the operating system. Microsoft cannot guarantee that these problems can be solved. Modify the registry at your own risk.

Open Regedit, go to   

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions

  • Right click on "Web Server Extensions" and click [New] - [Key] Name the new key "HeapSettings"
  • Ensure the following key is created:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\HeapSettings

  • Right click on "HeapSettings" key and click [New] - [DWORD value] "LocalHeapWarnCount"
  • Double click on "LocalHeapWarnCount"
  • "Edit DWORD Value" dialog will open. Enter [Value data] = double the number of CPU

Reboot and you are done.

To prevent other issues from appearing later on, if you managed to get this far as to get to central admin and spot the missing timer jobs, I would strongly recommend to rebuild the farm, since during the farm provisioning process, a lot of administrative jobs should have happened through owstimer and did not since the timer was not running so there might be configurations that were not propagated across all servers, websites not provisioned, features not installed, to name just a few.