The Case of the Out of Memory BizTalk Server

About a month ago, I was conducting a BizTalk Health Check for a customer (yes, Microsoft BizTalk Server is one of my other specialties along with several others) where the BizTalk server has been running slower than expected. They noticed Out of Memory exceptions in the event logs, but since BizTalk doesn’t loose messages and restarts on failure, this wasn’t a big deal.

Now, when a process runs out of memory, it means that it ran out of virtual memory – meaning it ran out of memory in it’s virtual world. Now, if you are thinking, “What the heck is Clint talking about?”, then this blog entry is for you because I will explain it. Otherwise, you can skip the next few paragraphs.

Begin virtual memory history lesson…

How many people remember DOS? When I ask this question when teaching the Vital Signs workshop, I typically get about half the class raising their hands. How much memory can DOS address? 640K of course. If you wanted to run a game (such was my life back in those days), you had to free up enough memory to run it. This meant editing the Autoexec.bat and config.sys files – wow, I still remember those file names :-). Anyway, this behavior was the same in Windows 3.1 because it was based on DOS. When you tried to open a large Word document (or game), it would fail because it didn’t have enough memory. Users just wanted it to work no matter how slow it was, so virtual memory was invented.

In 32-bit Windows NT4, the operating system works on the architecture of virtual memory. Windows NT v3.5 supported it as well, but that was before my time. In any case, virtual memory is a way to fake the process into thinking it has 2GBs of memory by placing the process in a “virtual” world similar to “The Matrix” movie. Now, back then 2GBs was a *huge* amount of memory and *each* process had it’s own 2GBs of virtual address space. So an NT4 server with 8MBs (yes, I am talking megabytes) of memory could open up a bunch of 10MB Word documents with no problem other than being… very slow. But who cares, it worked right? Well, the kernel still has to put this memory somewhere. Since it couldn’t fit the 10MB documents into RAM, it now uses the page file on the disk as additional physical memory which is why it would be very slow to open.

Technically speaking, I lied to you about the size of the virtual address space. Each process actually gets 4GBs of virtual memory – it just has to share that memory with the kernel. The kernel gets 2GBs and the process gets 2GBs by default. When I first heard of this I was thinking that there is a kernel for each process because that is what it sounds like. Think of it as a hub and spoke model where the kernel is the hub with 2GBs of memory and each process is a 2GB spoke. The 4GB virtual address space is simply the amount of memory that a 32-bit processor can address.

Now, let’s upgrade to Windows 2003 32-bit. Today, we have massive servers with 16GBs or more of RAM. Guess what? It’s still a 32-bit world. Each process still lives in the same 4GB virtual address space with the kernel (2GBs to each process and 2GBs to the kernel). The problem is that our servers now have far more RAM than virtual address space. This means that products like SQL Server are limited to the 2GBs of user-mode virtual memory regardless of how much RAM is installed on the server. Your server can still use all of that RAM, but since each process is limited to 2GBs of virtual memory, it just takes a *lot* of processes to use all of that RAM.

Quiz Time… If a 32-bit application is out of memory, is it out of virtual memory or out of physical memory? Answer: It is out of virtual memory. Remember that each process is living in a virtual world and has no idea what the real world (physical memory) looks like.

If you want a single process to be able to use all of the RAM on the server, then the process needs to have more virtual address space than the amount of RAM installed. Therefore, the answer is really just go to 64-bit (implemented as x64) which gives each process 16TBs of virtual address space per process - 8TB for the process and 8TB for the kernel.

End Virtual Memory History Lesson

Okay, let’s get back to the mystery… we have a BizTalk Server service (technically called a BizTalk host instance) that is running out of memory. Did the server run out of physical memory or did the service/process run out of virtual address space? If you read my history lesson above, then you know that BizTalk ran out of virtual address space. Since the server is a 32-bit server, this means that that BizTalk is not able to address more than 2GBs. Effectively, the BizTalk server had an aggressive memory leak, soaked up the 2GBs of virtual memory very quickly, then once I ran out of virtual memory, it crashed.

Question: What could have happened if a process is only using about 1.2GBs of virtual address space (about 60% of it’s virtual memory) yet ran out of memory?

Answer: An issue called heap fragmentation.

Heap memory is just memory that a process uses to store and retrieve data. If a process uses a 64K block of memory, then frees 60K of that memory in the middle of the block, you now have two 2K blocks of memory with a 60K gap in the middle. The .NET Framework allocates memory in 64K blocks of memory. This memory *must* be contiguous, therefore, the 64K memory allocation cannot fit between the two 2K blocks of memory, so the memory allocation must go higher in the memory addresses to find a contiguous block. If this pattern continued indefinitely, then you would end up with about 1.2GBs out of 2GBs of used virtual memory yet be out of virtual memory because the .NET Framework is unable to create a new 64K block of contiguous memory.

Luckily, the .NET Framework is well aware of heap fragmentation and it has a garbage collector (literally called the Garbage Collector or often called the “GC”) the defragments memory (consolidates the memory to make it contiguous) as it goes. This works great when you are using all .NET assemblies, but if you have COM object (native code – non-.NET), then that COM object cannot be collected/defragged by the GC and could potentially fragment the heap memory aka “heap fragmentation”.

In this case, the BizTalk server ran out of virtual memory that was likely caused by a non-.NET COM object. Once I mentioned this behavior to the BizTalk developer, he immediately had an idea of which object was likely causing the memory leak.

The reason we know that we have a memory leak is because we looked at the “\Process(*)\Private Bytes” counter. If this counter is showing an increasing trend over a long period of time and eventually results in an out of [virtual] memory exception which results in a crash in the process.

To get more information about what is leaking memory, then use a tool like DebugDiag which is a free download from Microsoft.com. Use it to attach to the leaky process shortly after it has ran for awhile. DebugDiag will track the memory allocations, then eventually report which components are leaking memory – not releasing memory that it should have released. DebugDiag was written by some guys I know in Microsoft Support in Charlotte, NC such as Jeremy Phelps.

Another tactic is to use ADPlus.vbs from the Microsoft Debugging Tools for Windows package to get a memory dump (*.dmp) of the leaky process before and after the processes has consumed a large amount of memory. Next, use WinDBG to debug the dmp files. The Microsoft Debugging Tools for Windows is a free download from Microsoft.com.

ADPlus.vbs was originally written by a good friend of mine, Robert Hensing who is now a Window security demi-god. Check out his blog postings at: https://blogs.technet.com/robert_hensing

Yet another tactic is to use a tool called VAView. This tool reads a process dump file (*.dmp) and visually shows memory allocations very similar to the disk layout when you do a disk defrag analysis. All you do is look for the “speckling” or plaid pattern – processes really don’t look good in plaid. VAView was written by a colleague of mine named John Allen. I don’t think he has it published anywhere on the internet, so if you would like this tool, then let me know.

Getting back to the case at hand, I confirmed that they had a memory leak in BizTalk by watching the “\Process(*)\Private Bytes” performance counter trend upwards, then eventually hit the 2GB virtual memory limitation, and crash. The “\Memory\Availible MBytes” counter showed that the server still had plenty of RAM available, so I came up with the following recommendations:

  1. Separate the BizTalk artifacts/components into other BizTalk Hosts: If there is more than one artifact in the BizTalk host (process), then move the artifacts to different BizTalk hosts. This distribution of the artifacts into different processes (each with 2GBs of virtual memory) might be enough to handle the memory consumptions.
  2. Migrate to 64-bit BizTalk Server: The 64-bit virtual address space is exponentially larger than the 32-bit virtual address space. Moving to 64-bit will effectively resolve this issue because the server still has about 1GB of free RAM on the server. Unfortunately, the BizTalk Servers were on 32-bit hardware, so this would mean new hardware.
  3. Track the Memory Consumption: We can use memory leak tools to troubleshoot this issue. A good one to use is DebugDiag which is a free tool from https://download.microsoft.com.

The following knowledge base article has a lot of great information on possible causes of memory leaks in Microsoft BizTalk Server.

  1. How to troubleshoot a memory leak or an out-of-memory exception in the BizTalk Server process
  2. https://support.microsoft.com/kb/918643

This was a really large blog posting, but I had to cover a lot of background concepts first. I hope these blog posts help you with your own mysteries.