Every once in a while I hear from someone that they believe they have a performance problem with their Scale-Out File Server. When I dig a little further, it’s very common to find that file copies are being used as the mechanism for measuring storage performance for all kinds of scenarios.
This blog post is about why file copies are not a good metric for evaluating storage performance. It also covers what other tools you could use instead to measure things. As usual, my focus here is on copies that use SMB3 file shares or File Server Clusters.
2. Why people use file copies to measure storage performance
First of all, it’s important to understand why people use file copies to measure performance. There are actually quite a few reasons:
2.1. It’s so easy…
File copies are a simple thing to do. You can even do it from File Explorer.
2.2. It’s so pretty…
File Explorer now has a nice visualization of the file copy operation, including a chart showing bandwidth as the copy progresses.
2.3. I have lots of large VHDX files sitting around ready to copy
We now have many large VHDX files readily available for testing file copies, and they will take a while to transfer. When someone is looking for a simple way to generate IOs against a storage subsystem, they might be tempted to simply copy those files in or out.
2.4. Someone told me it was a good idea
There are a lot of blog posts and demo videos out there about how fast file copies are in this new storage system or this new protocol. I might have done that myself a few times (sorry about that).
3. Why using file copies to measure storage performance is not a good idea
Using file copy to measure storage performance has a number of issues. You might essentially end up reaching incorrect conclusions. Here are a few reasons for that:
3.1. Copy might not be optimized
In order to get the most of your storage subsystem, you need to queue up enough IO requests to keep the entire system busy end-to-end. For instance, when using hard disk drives, we like to watch the queue depth performance counters to make sure we maintain at least two IOs queued for each HDD you have in a system.
In order to take advantage of this, when you’re copying large files using the Windows copy engine, it will issue 8 asynchronous writes of 1MB each by default. However, when copying lots of small files (and using a fast storage backend), you are less likely to stand up the required IO queue depth unless you queue more IOs and use multiple threads.
To make matters worse, most file copy programs will copy only one file at a time, waiting for one file copy to finish before starting the next one. This serialization will give you a much lower storage performance than the system is actually capable of delivering when more requests are being queued in parallel.
The SMB protocol is able to queue up multiple asynchronous requests, but that does not happen unless the application (the file copy program in this case) takes advantage of it. ROBOCOPY is one of these programs that can use multiple threads to copy multiple files at once.
More importantly, the file copy behavior is not a good stand-in for all other workloads. You will find that its behavior is different from a SQL Server OLTP databases or a Hyper-V Virtual Desktop deployment, both in regards to IO sizes and the number of IOs that are queued up.
3.2. Every copy has two sides
People will often forget that file copies measure the performance of both source storage and the destination storage.
If you’re copying from a single slow disk at the source to 10 fast disks in a striped volume on the other end, the speed at which you can read from that source will determine the overall speed of the file copy process.
You’re basically measuring the performance of the slower of the two sides.
3.3. Offloads and caching
There are a number of specific technologies that can accelerate certain kinds of copies like Offloaded Data Transfers (ODX) and SMB file copy acceleration (COPYCHUNK) which will impact certain kinds of file copies and it might be hard to determine when they apply or not.
File copies will attempt to use buffered IOs, which can be accelerated by regular windows file caching. On the other hand, buffered file copies will not be accelerated by cluster CSV caching, which only accelerates unbuffered IOs.
As with item 3.1, these are examples of how file copy performance will not match the performance of other types of workload, like an OLTP database or a VDI deployment.
3.4. CA will not cache
Continuously Available (CA) file shares are commonly used to provide Hyper-V and SQL Server with a storage solution that will transparently failover in case of node failure.
In order to deliver on its availability promise, CA file shares will make sure that every write goes straight to disk (write-through) and no writes are cached only in RAM. That’s how SMB can recover from the failure of a file server node at any time without data loss.
While this is great for resiliency, it will slow down file copies, particularly if you are copying a single large file.
It’s important to note that most server workloads (including SQL Server and Hyper-V) will always use write-through regardless of CA, so while file copies are affected by the write-through behavior change caused by CA, most server workloads are not.
You can read more about this at http://blogs.technet.com/b/filecab/archive/2013/12/05/to-scale-out-or-not-to-scale-out-that-is-the-question.aspx.
4. Better ways to measure performance
4.1. Run the actual workload
The best way to measure performance is to run the actual workload that you are targeting.
If you’re configuring some storage that will be used for an OLTP database, install SQL Server, run an OLTP workload and see how many transactions per second you can get out of it.
If you’re creating a solution for static web sites running inside a VM, install Hyper-V, create a VM configured with IIS, set up a few static web sites and see how they handle multiple clients accessing the static web content.
4.2. Run a workload simulator
When it’s not possible or practical to run the actual workload, you can at least run a synthetic workload simulator.
There are many of these simulators out there that mimic the actual behavior of the application and allow you to simulate a large number of clients accessing your machine.
For instance, to simulate running SQL Server databases, you might want to try the DiskSpd tool (see https://aka.ms/DiskSpd).
DiskSpd is actually fairly flexible and can go beyond simulating just SQL Server behavior. I even created a specific blog post on how to use DiskSpd, which you can find at http://blogs.technet.com/b/josebda/archive/2014/10/13/diskspd-powershell-and-storage-performance-measuring-iops-throughput-and-latency-for-both-local-disks-and-smb-file-shares.aspx.
4.3. Keep your simulations as real as possible
While it’s more practical to use workload simulators, you should try to stay as close as possible to the actual solution you will deploy.
For instance, if you planning to deploy 4 SQL Servers in a Hyper-V-based private cloud, you should measure the storage performance by actually creating 4 VMs and running your DiskSpd simulation inside each one.
That is a much better simulation than just running 4 instances of DiskSpd in the host, since the IO pattern of 4 instances of SQL Server running on bare metal will be different from the IO pattern of four VMs each running one instance of SQL Server.
5. If your workload is actually file copies
All that aside, there’s a chance that what you are actually trying to test a file copy workload. I mean, you actually have a production scenario where you will be transferring files. In that case (and only in that case), here are a few tips to optimize that specific scenario.
5.1. Check both sides of the copy
Remember to optimize both the source and the destination storage subsystems. As mentioned before, you will be as fast as the weakest link in the chain. You might want to redesign your storage solution so that source and destination have better performance or are “closer” to each other.
5.2. Use the right copy tool
Most file copy tools like the EXPLORER.EXE GUI, the COPY command in the shell, the XCOPY.EXE tool and the PowerShell Copy-Item cmdlet not optimized for performance. They are single-threaded, one-file-at-a-time solutions that will do the job but are not designed to transfer files as fast as possible.
The best file copy tool included in Windows is actually the ROCOBOPY.EXE tool. It includes very useful options like /MT (for using multiple threads to copy multiple files at once) and /J (copy using unbuffered I/O, which is recommended for large files).
That tool got some love from the Performance Fundamentals team at Microsoft and it’s usually much faster than anything else in Windows.
It’s important to note that even ROBOCOPY with the /MT option won’t help if you’re copying a single file. Like most other file copy programs, it uses a common file copy API instead of custom code.
5.3. Offload with COPYCHUNK
If it’s an option for you, put source and destination of your copy on the same file server to leverage the built-in SMB COPYCHUNK. This optimization is part of the SMB protocol and basically avoids sending data over the wire if the source and destination are on the same machine.
You can read about it at http://msdn.microsoft.com/en-us/library/cc246475.aspx (yes, this has been there since SMB1 and it’s still there in SMB3).
Note that COPYCHUNK only applies if the source and destination shares are on the same file server and if the file size is at least 64KB.
5.3. Offload with ODX
If your file server uses a SAN back-end, consider using the Offloaded Data Transfers (ODX). This T10 standard improves performance by using a system of tokens to avoid transferring actual data over the wire.
It works only if the source and destination paths live on the same SAN (or somehow connected in the back-end). This also works with SMB file shares (SMB basically lets the request pass down to the underlying storage subsystem).
ODX support was introduced in Windows Server 2012 and requires specific support from your SAN vendor. You can read about it at http://msdn.microsoft.com/en-us/library/windows/hardware/dn265439.aspx.
5.4. Create a non-CA file share
If your file server is clustered, you can use SMB Continuously Available file shares that allow you to lose any node of the cluster at any time without impact to the applications. The file clients and file servers will automatically recover though a process we call SMB Transparent Failover.
However, this requires that every write be written through to the storage (instead of potentially being cached). Most server workloads (like Hyper-V and SQL Server) already have this unbuffered IO behavior, but not file copies. So, CA has the potential of slowing down file copy operations, which are normally done with buffered IOs.
If you want to trade reliability for performance during file copies, you can create a file share with the Continuous Availability property turned off (it’s on by default on all clustered file shares).
In that case, if there is a failover during a file copy, you might get an error and the copy might be aborted. But if you don’t have any failovers, the copy will go faster.
For server workloads like Hyper-V and SQL Server, turning off CA will not make things any faster, but you will lose the ability to transparently failover.
Note that you can create two shares pointing to the same folder, one without CA for file copy operations only and one with CA for regular server workloads. Having those two shares might have the side effect of confusing your management software and your file server administrators.
5.5. Use SMB Multichannel
If you can afford some extra hardware, consider adding a second network interface of the same type and speed to leverage SMB Multichannel (using multiple network paths simultaneously).
This was introduced in Windows Server 2012 (along with SMB3) and you must have it on both sides of the copy to be effective.
SMB Multichannel might be able to help with many scenarios, including a single large file copy when you are constrained by bandwidth or IOPS.
Also check if you have a second port on your NIC that is not wired to the switch, which might be an even easier upgrade (you will still need some extra cables and switch ports to make it happen).
You can learn more about SMB Multichannel at http://blogs.technet.com/b/josebda/archive/2012/05/13/the-basics-of-smb-multichannel-a-feature-of-windows-server-2012-and-smb-3-0.aspx.
When using SMB Multichannel in a file server cluster, be sure to use multiple subnets, as described at http://blogs.technet.com/b/josebda/archive/2012/11/12/windows-server-2012-file-server-tip-use-multiple-subnets-when-deploying-smb-multichannel-in-a-cluster.aspx.
5.6. Windows Server 2012 R2 Update
There are specific copy-related improvements in the Windows Server 2012 R2 Update released in April 2014. That update is especially important if you are using a Continuously Available file share as the destination of your file copy. You can find information on how to obtain the update at http://technet.microsoft.com/en-us/library/dn645472.aspx.
By the way, we are constantly evolving Windows Server and the Scale-Out File Server. We release updates regularly and keeping up with them will give you the best results.
5.7. File copy for Hyper-V VM provisioning
One special case of file copy is related to the provisioning of virtual machines from a template. Typically you keep a “Library Share” with your VHDX files and you copy from this library to the deployment folder, where the VHDX will be associated with a running virtual machine.
You can avoid this by using differencing VHDX files, or you can use some interesting tricks (like live VHDX file re-parenting, introduced in Windows Server 2012) to optimize your VM provisioning.
You can find more details about your options at http://blogs.technet.com/b/josebda/archive/2012/03/20/windows-server-8-beta-hyper-v-over-smb-quick-provisioning-a-vm-on-an-smb-file-share.aspx.
5.8. System Center Virtual Machine Manager 2012 R2
If you’re using SCVMM for provisioning your VMs from a library, it’s highly recommended that you upgrade to SCVMM 2012 R2.
Before that release, SCVMM used the slower BITS protocol to transfer files from the library to their final destination. In the 2012 R2 release, VMM uses a new way to copy files, which will leverage things like SMB Multichannel, SMB COPYCHUNK and ODX offloads to significantly improve performance.
You can find some details on how VMM deploys virtual machines at http://technet.microsoft.com/en-us/library/hh368991.aspx(that bottom of the page includes the details on Fast File Copy).
If you’re using copies to measure performance of anything except file copies, I hope this post made it clear that’s not a good idea and convinced you to use other methods of storage performance testing.
If you’re actually trying to optimize file copies, I hope you were able to find at least one or two useful tips here.
Feel free to share your own file copy experiences and tips using the comment section below.
Note: This post was updated on 10/26/2014 to use DiskSpd instead of SQLIO.