by jcannon on July 11, 2006 04:32pm
*Updated* So, this weeks tech tip is about memtest, and yes, I am sure there are some that might scoff at this….But I think we have a tendency to loose sight of the basics. For instance, last week we had quite an interesting time debugging a problem that occurred intermittently and we where not able to find a way to consistently reproduce the problem. We ran through all kinds of things until we decided for grins and giggles to run memtest. And low and behold, it found memory errors. We replaced the memory and have not seen a recurrence of the problem.
We decided to pull this old but trusty tool back out of the stable. Special thanks goes to Kyle Adams and Stephen Zarkos who did a lot of the footwork on this one.
Memtest is a simple program that is designed for the x86 architecture. You would use it for things such as when hardware hangs or when your computer doesn’t boot at all. Either way, you could just grab memtest and throw it onto your computer.
Actually, there is not a hard and fast rule when to use it. There are two ways I would put it in the toolbox to use, and they have to do with methodology more than anything.
- I have no clue what is wrong and I am completely out of ideas, I am just stabbing in the dark.
- As a standard suite of checks and tests I do to debug a problem I will run memtest.
Honestly, I think number one is one that happens more in real life. A lot of people do not think that HW like memory will be causing any problems. They forget that often with memory it is not a black or white issue. It is not an all or nothing failure. It sometimes happens and sometimes it does not. I have never really seen a failure that I would say without question, that is memory!
You can get it here (Linux GPL, and windows version); http://www.memtest86.com/
(There is also a non GPL, but still free version for windows available here http://hcidesign.com/ I am not to familiar with how it works, but the web page gives you a lot on information)
One thing to note, this can run for a very long time, several hours in some cases.
Memtest gives a user the ability to access the memory in an effort to pinpoint a problem in the memory itself. It uses a set of algorithms to check for consistency and errors in the placing of memory. The algorithms that are used by memtest to test the memory are the following:
- Address test, walking ones, no cache
a. Fills in the address space with ones in a sequential order
- Address test, own address
a. Puts the address of the test address in itself
b. Test for addressing errors
- Moving inversions, ones and zeros
a. Checks the addresses using a series of ones and zeros
- Moving Inversions, 8 bit pattern
a. Uses an 8 bit wide pattern to test for errors on “wide” memory
- Moving inversions, random pattern
a. Creates a set of random numbers and its compliment, writes to address.
- Block Move, 64 moves
a. Memory is initialized with 8 byte inverting patterns.
b. Moved every 4 MB
- Moving inversion, 32 bit patterns
a. Shifts data patterns one bit for each successive address
- Random number sequence
a. Writes a set of random numbers into memory
b. Checks the memory for consistency on the next pass
- Modulo 20, ones and zeros
a. Uses the Modulo-X algorithm to check for errors not detected by inversions because of buffering
- Bit fade test, 90 minute, 2 patterns
a. Initializes memory and then sleeps for 90 minutes
b. Checks memory after the 90 minutes is up
The point: applications still need error-free memory to execute correctly, especially today with application complexity increasing all the time. How do you replicate problems in your lab environment with such diverse environments across your network, or even more importantly, separate hardware from software failure?As always, comments/suggestions etc appreciated.