Linux Blob / File Transfer Python Code Sample

HPC customers have been using AzCopy to copy files in and out of Azure Blob (block) Storage for quite a while, but a similar binary for Linux does not exist.  The code sample linked below is an example of how you might build the basics of a similar blob copy program (though without all of the optimizations). It gives examples of how to do chunking of files for faster parallel transfer operations through chunking, enable back off policies, and using hashes for reliability.

You can find the code for it here:

https://github.com/Azure/azure-batch-samples/tree/master/Python/Storage

Usage

The blobxfer.py script allows interacting with storage accounts using any of the following methods: (1) management certificate, (2) shared account key, (3) SAS key. The script can, in addition to working with single files, mirror entire directories in to and out of containers from Azure Storage, respectively. Block- and file-level MD5 check-summing for data integrity is supported along with various transfer optimizations, built-in retries, and user-specified timeouts.

The blobxfer script is a python script that can be used on any platform where a modern Python interpreter can be installed. The script requires two prerequisite packages to be installed: (1) azure and (2) requests. The azure package is required for the script to utilize the Azure Python SDK to interact with Azure using a management certificate or a shared key. The requests package is required for SAS support. If SAS is not needed, one can remove all of the requests references from the script to reduce the prerequisite footprint. You can install these packages using pip, easy_install or through standard setup.py procedures.

Program parameters and command-line options can be listed via the -h switch. At the minimum, three positional arguments are required: storage account name, container name, local resource. Additionally, one of the following authentication switches must be supplied: --subscriptionid with --managementcert, --storageaccountkey, or --saskey. It is recommended to use SAS keys wherever possible, only HTTPS transport is used in the script.

The script will attempt to perform a smart transfer, by detecting if the local resource exists. For example:

blobxfer.py mystorageacct container0 mylocalfile.txt

If mylocalfile.txt exists locally, then the script will attempt to upload the file to container0 on mystorageacct. One can use --forcedownload or --forceupload to force a particular transfer direction. Note that you may use the --remoteresource flag to rename the local file as the blob name on Azure storage if uploading.

If the local resource is a directory that exists, the script will attempt to mirror (recursively copy) the entire directory to Azure storage while maintaining subdirectories as virtual directories in Azure storage. You can disable the recursive copy (i.e., upload only the files in the directory) using the --no-recursive flag.

To download an entire container from your storage account, an example commandline would be:

blobxfer.py mystorageacct container0 mylocaldir --remoteresource .

Assuming mylocaldir directory does not exist, the script will attempt to download all of the contents in container0 because “.” is set with --remoteresource flag. To download individual blobs, one would specify the blob name instead of “.” with the --remoteresource flag. If mylocaldir directory exists, the script will attempt to upload the directory instead of downloading it. In this case, if you want to force the download, indicate that with --forcedownload. When downloading an entire container, the script will attempt to pre-allocate file space and recreate the sub-directory structure if needed.

Please remember when using SAS keys that only container-level SAS keys will allow for entire directory uploading or container downloading. The container must also have been created beforehand, as containers cannot be created using SAS keys.

Code Sample Snippet Explanations

Now let’s examine how the script performs a few of the fundamental operations. One of the things that you may want to do is use MD5 checksums to validate the file transfers. We take advantage of Python’s hashlib library to conviently compute MD5 checksums. In the code sample, there are two functions, one to compute block-level and the other to compute file-level.

    hasher = hashlib.md5()
    hasher.update(data)
    return base64.b64encode(hasher.digest())

The data parameter is the block of data to compute the MD5 hash for, here we just instantiate a hasher and call update on the data itself. Because Azure storage expects the MD5 digest as a base64-encoded string, we call the base64.b64encode function on the digest.

    hasher = hashlib.md5()
    with open(filename, 'rb') as filedesc:
        while True:
            buf = filedesc.read(blocksize)
            if not buf:
                break
            hasher.update(buf)
        return base64.b64encode(hasher.digest())
    return None

The above function computes an MD5 digest for a file, a potentially large file, and thus computes the MD5 in chunks, so the entire file does not have to be loaded in memory. With the file open, we read blocks of data, send it to the hasher and the continue this loop until there are no more bits to read from the file. We then get the digest and encode the data as base64 for Azure storage.

One of the other things you’ll likely want is the ability to process SAS requests, which is one of the common ways of doing Storage authentication. The code sample utilizes the requests package for performing storage REST API requests using SAS keys. The “http_request_wrapper” function takes a function object and performs the request while retrying for retryable codes.

        except (requests.exceptions.ConnectTimeout,
                requests.exceptions.ReadTimeout):
            pass
        except requests.exceptions.HTTPError as exc:
            if exc.response.status_code < 500 or \
                    exc.response.status_code == 501 or \
                    exc.response.status_code == 505:
                raise

In the except blocks, we see two exceptions being caught, one is from the requests library for ConnectTimeout and ReadTimeout. When these exceptions are caught, we want to not-raise the exception, but instead invoke the sleep-wait/retry cycle, thus the pass statement. For specific HTTP status codes, we want to catch non-500 level codes, except for 501 and 505 (which are generally not retryable). As shown above, anything outside of this range, we re-raise the exception, everything else is implicitly silenced such that we retry.

Finally, one of the ways to increase speed in transfers is to chunk up a file and run some of the transfers in parallel. When not using a SAS-key, this functionality is easily provided for us in the Azure Python Storage SDK’s blob service.

        azure_request(self.blob_service.put_block, timeout=self.timeout,
                container_name=container, blob_name=remoteresource,
                block=blockdata, blockid=blockid, content_md5=blockmd5)

Here, we are wrapping the put_block call in the azure_request wrapper which much like the example above, helps retry on our behalf when using the Azure Python SDK.  The block parameter is simply the data for the block, while the blockid is a statically sized string that associates the block data for the put_block_list call. The content_md5 is the block MD5 which we computed using the function first mentioned above.

The code sample also shows how to formulate a requests call to perform a range-based get on a blob:

        reqheaders = {'x-ms-range': x_ms_range}
        response = http_request_wrapper(requests.get, url=url,
                headers=reqheaders, timeout=self.timeout)
        if response.status_code != 200 and response.status_code != 206:
            raise IOError('incorrect status code returned for get_blob: {}'.format(
                response.status_code))

Here we set the appropriate x-ms-range header with the range string. A range string is formatted as “bytes=<start>-<end>”. After we make the call, we check the status_code of the response to ensure that we either got the entire blob or a successful partial result.

Fred Park - Senior Software Engineer, Azure Big Compute

Alan Stephenson - Program Manager, Azure Big Compute