Disaster Recovery Planning for Work Folders

Recently, Matt Hrynkow from Microsoft helped a customer to deploy Work Folders, and we talked about how to plan for disaster recovery (DR), I thought it will be good to share the details.

Overview

The purpose of DR is to return a system to a state of normality after an occurrence of a disastrous event. In the context of Work Folders, there are two main goals:

  1. Allow users to continue access their data, i.e. user experience minimum to no downtime when the server failed.

  2. Allow IT admin to bring the server back online and resume client sync.

Note: DR plan is not intended to recover user deleted files. There are other features in Windows such as File history can be evaluated for individual file restore.

Unlike traditional file servers, where the data is only stored on the server, Work Folders being a sync technology, has data stored at multiple locations. Each sync client (e.g. user windows or iOS devices) and the sync server (i.e. Windows 2012 R2 file server) has a local copy of the same data for a given user. With Work Folders, user can always access their data on their devices even when the Work Folders server is down. This is very different from the traditional file share, where user access depends on the availability of the file server. Work Folders server failure results to user data not being synced across multiple devices, because the device sync depends on the server availability to mediate the process.

 

In general, two methods for data recovery can be utilized.  You can either use the client data to replenish the server copy, or you can replicate the server data to another server at another site to pre-stage the file set. 

Recovery using client data

If a user typically has a single device in your environment, and devices are always connected to file server through high speed connections, you may evaluate the option for recovery using only the client data copy. Using this approach, after a primary server failure, you can configure the replacement, and the client will re-upload the data to the server once the replacement is up and running. The advantage is that you don’t need to have a standby server and storage, and the downside is the network can be busy when the replacement server is put in place, and all the clients try to upload their data. Dataset size vs network bandwidth should be carefully evaluated for this option.

Export Work Folders server configuration

If you plan to use the client data for recovery after DR scenario, you should keep track of the sync share configurations on the server to ease the setup on the replace server.

You can run the following cmdlet on the Work Folders server to get all the sync share settings:

Get-SyncShare | Export-Clixml -path c:\temp\config.xml

And later run the import cmdlet on the replacement server to configure the sync share settings:

Import-CliXML –Path c:\temp\config.xml | New-SyncShare

Certificate

To have the client automatic sync with the replacement server, you can use the same server name on the replacement server. To ensure the certificate works on the replacement server, you need to export the certificate you have acquired for the original Work Folders server (if you are using the same server name) with the private key, then install the .pfx file on the new server.

 

User doesn’t need to do anything; the sync will continue after the replacement server is configured.

DR using server to server replication

For DR planning, you need to be aware of 2 types of data for Work Folders:

  1. File data: the actual file set to be synced across clients and server.

  2. Metadata database: tracking the file set version and sync status.

In order to resume sync successfully after DR scenario, the following initial data states on the secondary server (or replacement server) are acceptable:

  1. No data present: No file set nor metadata database on the replacement server. Once sync starts, client will upload the data to the server.

  2. Only file data is present: once sync starts, it will reconcile the data, and generate the metadata database on the replacement server.

  3. Both file data and metadata database are present, and they are in a consistent state. It’s important to ensure the database and the file data are in a consistent state, otherwise data loss may occur. Using VSS snapshot with the Work Folders VSS writer for example, will make sure the database and the file set are in a consistent state; however, doing file copy of the file data and metadata database will not guarantee the data in a consistent state.

Different DR planning will result in different initial data states on the secondary server, and each with its pros and cons, to be discussed below.

DR recommendations

At a high level, there are 3 options you can consider:

  1. Using BCDR (business continuity and disaster recovery) products: many storage products offer block level replication, if your data is stored on these products, you may consider leverage them to build the DR plan. These products need to guarantee the data replication order, and data consistency at a given point in time.

  2. Using VSS backup and restore: Many VSS backup products offer incremental backups, you can configure the snapshot to be taken daily, and restore the file data and metadata database to the secondary server after primary server failure.

  3. Scheduled file replication with Windows applications (e.g. Robocopy or DFSR): make sure only the file data is getting replicated, and not the metadata database. After failover, the client and server will reconcile the data, and create the metadata database on the replacement server.

DR Configuration Pros Cons Support
Using BCDR products
  • Although depending on the replication frequency, typically the data on the secondary site lags the primary site in minutes or seconds. Can achieve low RPO and RTO target.
  • Metadata and file data are in consistent state, no need for reconciliation after failover, reduce the file conflicts created
  • Requires BCDR products
  • Can be expensive
This should be supported by the BCDR product, to ensure replication consistency.
VSS based backup and restore
  • Metadata and file data are in consistent state, no need for reconciliation after failover, reduce the file conflicts created.
  • Requires VSS based backup application, additional cost.
  • Longer RPO since incremental backup typically configured on daily schedules
  • More complex failover procedure, to restore data on the secondary site
Database consistency is ensured by the Work Folders VSS writer.
Scheduled file replication using  robocopy or DFSR for file data only
  • No additional software required, can be configured with only built-in applications

  •  Missing metadata database after failover, requires data reconciliation
  • Copy process can introduce more sync concurrency errors
  • More complex failover procedure, to avoid file conflicts or data loss
  • Ensure configuring the correct user folder ownership on the replacement server
  • ·         File changes on the client during the failover will result in conflicts which requires user to do manual merge

Supported

Deployment guidance

Using BCDR products

I’ll not go into details for this option, as the configuration depends on the BCDR product. You need to follow the guidance of the BCDR products, make sure the file data and the metadata database is in the same replication group, so that the IO ordering is maintained across these data set, and the product can guarantee the data consistency between the file data and the metadata database for any given point in time.

Using VSS backup and restore

Using VSS doing backup and restore with the Work Folders VSS writer can keep the consistency between the file set and the metadata database on the server. Depending on the VSS backup application, you need to configure the backup using the Work Folders VSS writer.

Upon restore, both file data and the metadata will be restored to the replacement server. The data restored on the server using VSS is called “non-authoritative” restore. That means, the files restored on the server could be overwritten by the client, if there are newer changes made on the client.

(note: In contrast, there is another mode of VSS restore called “authoritative restore” by copy and paste files from the backup. The restored files will be treated as a newer version, and will overwrite the client copy through sync, even when the client may actually have a newer copy in comparison to the restored file. This approach can be used to recover individual corrupted files when necessary, but not a focus for this blog).

Using the non-authoritative restore, when the client comes to sync, the sync engine will be able to compare the file versions between the client (current) and the server (restored at a past point in time), and sync the changes between client and the server.

File replication with Robocopy or DFSR

This approach is not recommended, because:

  • It scans the entire sync share every time for changes, this can be expensive on the server if there are many files

  • When the file changes are detected, Work Folders sync and file replication app may compete in opening the files, and increase the failures for both file sync and file replication app.

  • You will need to actively monitor file replication progress, to ensure the app can continue successfully. (Work Folders sync will be able to retry after getting concurrency errors).

  • You must exclude the sync share state folder on the server for file replication (both staging files and metadata database). After failover, the metadata database must not be present before syncsharesvc service starts. This will trigger data reconciliation. This process may have a performance impact on the server if the file set is large.

  • Before the secondary server is put in action, make sure file replication (from primary server to secondary server) is stopped, so that after client starts to sync with the secondary server, changes will not be deleted, and may result data loss.

  • Due to lack of metadata database tracking with file changes, any changes on the client during the time of server failover can become a conflict file, or delete files can come back, moved directories may surface again. Although not data loss, but user will need to manually figure out what to keep, and what to delete/move again.

  • Need to ensure of the user folder ownership is properly replicated. Work Folders can only sync the data if the user folder ownership is property maintained as the user or local admin on the replacement server.

However, if you don’t have any VSS backup application and want to use file replication application to build a poor man DR solution, you can follow the steps below to set it up:

  1. Assuming sync server for client connection is https://workfolders.contoso.com

  2. Configure 2 servers (server1 and server2) with the same sync shares. (Note you can use the cmdlet listed in the Export Work Folders server configuration section to export and import the settings)

  3. Configure DNS A records for server1 and server2

  4. Configure DNS CNAME record for WorkFolders, and point to server1. Configure a desired TTL on the record to facilitate an acceptable time for clients to “notice” the record change when required by the client.

  5. Get server certificate with hostnames including: WorkFolders.contoso.com; server1.contoso.com and server2.contoso.com

  6. Configure certificate on both server1 and server2 (details on how to configure certificate can be find here).

  7. For robocopy, you can run the following cmd to replicate data and the ACLs of the data

Robocopy \\<server1>\sharename \\<server2>\sharename /MIR /SEC

  1. For DFSR, make sure it is the shares on Server 2 are configured read-only, to make sure the file replicate is only from server 1 to server 2.

  2. Have client configure Work Folders against https://workfolders.contoso.com

Server1 failed, and you want to failover to the secondary server (e.g. server 2)

  1. Stop file replication app on server1

  2. On Server2, net stop syncsharessvc

  3. On Server2, delete the sync share metadata database which can be found VolumeDrive:\SyncShareState\<SyncShareName>

  4. On server2, net start syncsharesvc

  5. Wait for TTL time so clients purge the record from their DNS cache.

  6. Switch DNS by configuring DNS CNAME record of WorkFolders to point to server2

  7. Client automatically initializes new client sync and starts to sync to the new server2

Conclusion

Although all 3 options are supported, we recommend DR using BCDR products or VSS backup and restore approach, as the secondary server will be in a data consistency state after failover. As you can see from the steps, using file replication app is very error prone, and may result in data loss if the procedures are not followed correctly.