When customers call us with issues – in particular application or program failures, one of the first questions that we ask is, “What changed in the environment”. More often than not, the answer is, “Nothing”. In some cases, that may be true, however in a majority of cases, there has been some change of which the system administrator that we are working with is unaware. Tim Newton discussed some aspects of program crashes in his recent post, Access Violation? How dare you …, but let’s go ahead and recap some of them. The most common cause for an application crash is when a program tries to read or write memory that is not allocated for reading or writing by the application – a general protection fault. Some other causes are listed below:
- Attempting to execute privileged or invalid instructions
- Unforeseen circumstances or poor code writing that results in the program executing an endless loop
- Attempting to perform I/O operations on hardware devices to which it does not have permission to access
- Passing invalid arguments to system calls
- Attempting to access other system resources to which the application does not have permission to access
At this point, let’s digress a little bit and introduce a couple of quirky terms that we use to discuss “bugs”.
Heisenbug: The Heisenbug takes its name from the Heisenberg Uncertainty Principle. A Heisenbug is a bug that disappears or alters its characteristics when it is observed. The most common example of a Heisenbug is being unable to reproduce a problem when running a program in debug mode. In debug mode, memory is often cleaned before the program starts. Variables may be forced onto stack locations as opposed to being kept in registers. Another reason that you may see a Heisenbug in debug mode is that debuggers commonly provide watches or other user interfaces that cause code (such as property accessors) to be executed, which in turn may alter the state of the program.
Bohrbug: The Bohrbug takes its name from the Bohr Atomic Model. A Bohrbug is a bug that manifests reliably under a well-defined (but possibly unknown) set of conditions. Thus, in contrast with Heisenbugs, a Bohrbug does not disappear or alter its characteristics when it is researched. These include the easiest bugs to fix (where the nature of the problem is obvious), but also bugs that are hard to find and fix and remain in the software during the operational phase.
Most of the application issues that we deal with are Bohrbugs, although we often encounter Heisenbugs when dealing with applications that exhibit Heap Corruption. In some cases, enabling Pageheap on an application causes the problem to no longer occur. OK, getting back to our original discussion, let’s take a look at a couple of common scenarios:
Scenario One: The Spooler Service is crashing on a print cluster that has been online “since forever” (yes, that’s actually how some administrators may describe their problem to us!) until today and no changes have been made. From the administrator’s perspective nothing has changed in the environment. By this, the administrator usually means that the drivers are still the same, and there have been no recent updates to the OS. However, there are some variables to consider:
- The problem may be caused by a specific driver which has an inherent bug with respect to the number of Print Devices using it. The issue suddenly begins to manifest as the number of print devices and / or users has increased beyond a critical point
- A bug related to an input data pattern may be invoked because a new applications elsewhere in the environment is passing data to the driver that it is unable to interpret
- The Spool folder hasn’t been excluded from Real-time Antivirus Scanning (or was excluded previously but for some reason is no longer excluded). A recent Pattern or Engine update may be causing corruption of spooled data
- There may be an inherent bug in the printer driver that is related to size of the print job that it can accept
- There may be a printer driver related to a Network Printer that does not handle network issues gracefully. A network issue may be invoking some fault within the driver
As you can see, from the Print Server administrator’s perspective, nothing in fact has changed. However, subtle changes in related system or external conditions are causing a problem. With that, let’s take a look at our second scenario …
Scenario Two: The server is experiencing a hang. It has been running fine since the day it was brought online, and all of a sudden the server is experiencing issues. The last server maintenance was performed a couple of months ago, but beginning yesterday morning, the server keeps locking up. So what’s going on?
In many enterprises, IT departments are somewhat autonomous. A single server may have components that are managed by several different teams. For example, Antivirus and Anti-Spyware software are managed by the Security team, the Storage team is responsible for the SAN environment, Host Bus Adapters (HBA’s) and related firmware. Meanwhile, the Windows team is responsible for the Server Operating System, including the overall system configuration and performance. With this type of division and ownership, it can become problematic for all the teams to stay in sync. This is not an indictment of any of the teams, it is an unavoidable by-product of decentralization. So what might be going on in this scenario?
- The Security team may have pushed out a new Antivirus pattern update to equip the systems to defend itself from some high risk security threat in the wild. This pattern update might have a bug related to high server workload. This might manifest as memory depletion (Paged / NonPaged Pool depletion for example)
- An Antivirus Pattern update was released which has a conflict with the OS component but surfaces only under certain conditions – for example, in a scenario where there is excessive realtime scanning being performed as the result of a large number of users who have their “My Music” folder redirected to their Network Drive
- An update to the firmware and drivers on a SQL server was performed by the storage team. The new Multipath I/O (MPIO) driver may have a bug which manifests when the I/O activity reaches a certain threshold. Since the update, the server ran fine for almost a month. However, at month-end processing there are now heavy SQL queries and reporting being performed. This results in additional stress on the disk subsystem – resulting in the inherent bug to surface and affect the production environment
- Although it may be rare, the problem could be caused by a hardware component that has developed a fault over time. This problem results in “bit-flips”, which might cause a fault in the driver based on the code logic of the driver. The end result is a system hang or crash
- The backup agent hasn’t been updated for quite some time. However, there may be a bug related to pool memory leaks under certain circumstances (such as the size of backup data being pulled from the server). Over time the utilization of the server has increased. The bug surfaces under these conditions and causes exhaustion of pool memory – resulting in the server hang
- Normal usage of the server may put the server beyond a critical point in terms of what the hardware and software is able to handle. The most common example of this is a file or application server. Over time, as a result of normal business growth the workload of the server may have reached the point where the operating system or application is simply unable to keep up. At this point, it would be time to consider scaling the environment or server(s) to address the problem
Again, based on the scenario above, there are some fairly innocuous changes that, at the time of implementation, did not result in issues. However, over time or under certain conditions, problems do surface – but, “Nothing changed in the environment” …
With that, it’s time to bring this post to a close. Thanks for stopping by! By the way, you can find more information on the quirky terms Heisenbug and Bohrbug as well as other similar terms on the Wikipedia page devoted to Unusual Software Bugs.
– Pushkar Prasad
|Share this post :|
EDIT (6/23): Added Wikipedia link to article