Self-healing NTFS

Windows Vista and Windows Server 2008 contain an often overlooked feature called NTFS Self-Healing.  In a nutshell, it is basically an improvement to the NTFS system whereby Windows will detect a file system error and automatically fix it on-the-fly.  All this is performed in the background without anyone actually noticing that it happened, unless you have something (such as MOM or SCOM) keeping an eye on the Eventlog for the relevant events as they are logged.

This is a major improvement as before you often had no idea that a file was corrupt until you went to open it, which was normally the moment when you most needed the data!  Running Chkdsk.exe on the machine nearly always gives the message:

“Chkdsk cannot run because the volume is in use by another process. Would you like to schedule this volume to be checked the next time the system restarts (Y/N) ?”

Which, although understandable, is never helpful at that specific moment in time, especially if the machine in question is a server in a production environment.  The most common method used to try and avoid any possible problems with corrupt files is to schedule regular Chkdsk’s on the machine, but this often requires lengthy downtime while the scan is being run.

 

By default in Windows Server 2008, the self-healing feature is turned on by default.  You can double-check this with the command “fsutil repair query c:”, this command can also be used to enable or disable self-healing.  When you run the command, you should see the following:

C:\Windows\system32>fsutil repair query c:
Self healing is enabled for volume c: with flags 0x1.
flags: 0x01 - enable general repair
0x08 - warn about potential data loss
0x10 - disable general repair and bugcheck once on first corruption

 

The whole process is transparent to the user, and he/she will probably not even realise that anything has taken place, although I have not actually seen any specifications as to if it uses any noticeable CPU cycles or RAM.  In fact, as Mark Russinovich explains “If a corruption is detected, an NTFS worker thread is spawned which will go off and perform a localized fix-up of those data structures. The only effect that an application would see is that files would be unavailable for the period of time that it was trying to access, had been corrupted.  If it retried later after the corruption was healed, then it would succeed. But the system never has to come down, so there's no reason to have to reboot the system and perform a low-level CHKDSK offline."