Rethinking Enterprise Storage – Archiving Data with the Hybrid Cloud

Article
03/18/2015

Recently, Microsoft published a book titled Rethinking Enterprise Storage – A Hybrid Cloud Model – the book takes a close look at an innovative infrastructure storage architecture called hybrid cloud storage.

Last week we published experts from Chapter 4. This week we provide an excerpt from Chapter 5, Archiving Data with the Hybrid Cloud. Over the next several weeks on this blog, we will continue to publish excerpts from each chapter of this book via a series of posts. We think this is valuable information for all IT professionals, from executives responsible for determining IT strategies to administrators who manage systems and storage. We would love to hear from you and we encourage your comments, questions and suggestions.

As you read this material, we also want to remind you that the Microsoft StorSimple 8000 series provides to customers innovative and game-changing hybrid cloud storage architecture and it is quickly becoming a standard for many global corporations who are deploying hybrid cloud storage. You can learn more about the StorSimple 8000 series here: https://www.microsoft.com/storsimple

Here are the chapters we will excerpt in this blog series:

Chapter 1 Rethinking enterprise storage

Chapter 2 Leapfrogging backup with cloud snapshots

Chapter 3 Accelerating and broadening disaster recovery protection

Chapter 4 Taming the capacity monster

Chapter 5 Archiving data with the hybrid cloud

Chapter 6 Putting all the pieces together

Chapter 7 Imagining the possibilities with hybrid cloud storage

That’s a Wrap! Summary and glossary of terms for hybrid cloud storage

So, without further ado, here is an excerpt from Chapter 5 of Rethinking Enterprise Storage – A Hybrid Cloud Model

Chapter 5 Archiving data with the hybrid cloud

In this chapter the word archive means to store digital business records for an extended period of time. Organizations depend on their IT teams to find and restore data that was archived for historical purposes in order to recall the conditions, discussions, decisions, and results of the past. Digital archiving is required in many industries to comply with government regulations for storing financial, customer, and patient information. For example, the healthcare industry is required to keep patient records for many years in order to inform future health care providers of a patient’s history of conditions, diagnosis, and treatment. Many businesses that do not have explicit regulations governing digital archiving have defined internal policies and best practices that archive data for legal reasons because courts expect companies to produce internal records when they are requested. As businesses, governments, societies, and individuals increase their dependence on data, archiving it becomes more important.

This chapter discusses the technologies used for digital archiving and describes how the Microsoft hybrid cloud solution (HCS) can be used to archive data to Microsoft Azure Storage.

Digital archiving and electronic discovery

There are two different use cases for digital archiving. The first is to create a repository of data that has intrinsic value and that people are interested in accessing. Libraries are excellent examples of archiving repositories that contain all sorts of information, including research data or documents that students and scientists may need to reference as part of their work. This form of digital archiving has become one of the most important elements of library science, with specialized requirements for very long-term data storage (think millenniums) and methods which are beyond the scope of this book.

The other use case for digital archiving is for business purposes and is the subject of this chapter. Digital archiving in the business context is one of the most challenging management practices in all of IT because it attempts to apply legal and compliance requirements over a large and growing amount of unstructured data. Decisions have to be made about what data to archive, how long to keep the data that is archived, how to dispose of archived data that is no longer needed, what performance or access goals are needed, and where and how to store it all.

Like disaster recovery (DR), archiving for legal and compliance purposes is a cost without revenue potential. For that reason, companies tend to limit their expenditures on archiving technology without hindering their ability to produce documents when asked for them. There are other reasons to archive business data but, in general, business archiving is closely tied to compliance and legal agendas.

The ability to find and access archived data tends to be a big problem. Storing dormant data safely, securely, and affordably for long periods of time is at odds with being able to quickly find specific files and records that are pertinent to unanticipated future queries. The selection of the storage technology used to store archived data has big implications for the long-term cost of archiving and the service levels the IT team will be able to provide.

Compliance and legal requirements have driven the development of electronic discovery (eDiscovery) software that is used to quickly search for data that may be relevant to an inquiry or case. Courts expect organizations to comply with orders to produce documents and have not shown much tolerance for technology-related delays. Due to the costs incurred in court cases, eDiscovery search and retrieval requirements are often given higher priority than storage management requirements. In other words, despite the desire to limit the costs of archiving, in some businesses, the cost of archival storage is relatively high, especially when one considers that the best case scenario is one where archived data is never accessed again.

Complete archiving solutions often combine long-term, archival storage with eDiscovery software, but there is a great deal of variation in the ways archiving is implemented. Many companies shun eDiscovery software because they can’t find a solution that fits their needs or they don’t want to pay for it. Unlike backup, where best practices are fairly common across all types of organizations, archiving best practices depend on applicable regulations and the experiences and opinions of an organization’s business leadership and legal team. Digital archiving is a technology area that is likely to change significantly with the development of cloud storage archiving tools in the years to come.

Protecting privacy and ensuring integrity and availability

Despite the relative importance of eDiscovery, it is not always the most important consideration in a digital archiving solution—protecting the privacy of individuals is. Privacy concerns extend to all forms of data storage, but archives have been explicitly addressed in regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States. In Europe, the Data Protection Directive covers all forms of stored data, including archives.

Encryption technology is commonly used to protect private, archived data from theft when it is in-flight or stored in the public cloud. That means IT teams need to consider the encryption features of their digital archiving solutions, including how encryption keys are managed in their storage solutions.

The integrity of archived data also needs to be ensured. Archival systems should be able to determine if the data being read is the same as the data that was written. The fact that data has been stored for an extended period of time makes it more susceptible to errors introduced by the physical degradation of stored data, sometimes called bit rot.

Some regulations for archiving require that redundant copies of archived data are stored in geographically remote locations to protect them from disasters. This can be done by making copies of archive tapes, taking regular backups of archive systems, replicating archive data to a remote site, or using cloud snapshots with Microsoft HCS. All remote copies need to meet the requirements for privacy and integrity.

Policies for managing data archives

One of the most common ways to manage data archives is by implementing policies and rules for data. Policies can be used to determine what data is selected for archiving and how long or where the archived data will be retained. For example, a policy could be established to retain all the data placed in a special archive folder for a minimum of ten years.

Most IT teams enforce data archiving policies through automated tools in their backup, archiving and eDiscovery solutions. Automation removes most of the human errors that could result in the loss of archived data and establishes the intent to comply with regulations. Corporate IT auditors will typically look to see that policies exist and are implemented to comply with applicable regulations.

Storage options for data archives

IT teams have chosen between tape and disk storage for storing archived data based on the cost and ease that archived data can be located and accessed. Cloud storage is now being looked at as an alternative. The sections that follow compare these three options.

Archiving to tape

IT teams often choose tape for storing digital archives because it is the least expensive option. Tapes can be stored for long periods of time with minimal operating costs, although they should be stored in low-humidity, air-conditioned facilities and be maintained by periodically rewinding them and checking their error rates, which may necessitate making new copies.

While data transfer speeds for tape are very good, the time it takes to retrieve tapes from an off-site facility is not. In addition, if no disk-resident archive index exists to identify the files or messages with pertinent data and map them to individual tapes, the process of finding archived information, that may have been stored across a large number of tapes, can take days or weeks. In this case, the best approach may be to restore the contents of archive tapes to a temporary disk storage volume and then search the data there.

Depending on the amount of disk capacity available, this process may need to be repeated multiple times, clearing out the capacity of the temporary search volume each time and refilling it with data from different archive tapes. The money saved using tapes for archiving can be offset by lengthy data searches through the archives. Courts have not shown much patience in these matters and have levied expensive penalties to businesses that have not been able to produce data in a timely manner.

Archiving to disk

High-capacity disk systems are also used for archiving, even though they are much more expensive to operate than tape. A feature, referred to as drive spindown, has been incorporated into some disk systems to reduce power and cooling costs by selectively stopping individual disk drives in the storage system. When data is needed on drives that are spun down, the system starts them again and reads the data. The problem with spindown technology is that disk drives are generally not made to be powered on and off and sometimes they do not respond as expected. Application performance can also be erratic.

There is no question that disk is superior to tape for searching with eDiscovery solutions. The immediate access to files and the ability to search both production data and online archived data on disk saves everybody involved a great deal of time—which is a big deal to corporate legal teams. However, disk-based archiving still requires some form of disaster protection, which is usually tape, and all the overhead related to data protection, including administrative time, equipment, media, and facilities costs.

Archiving to cloud storage

Cloud storage is another option for archiving data that will evolve in the years to come with new cloud storage services. There are many variables to consider with cloud-based data archiving, including the type of interfaces used and the level of integration with on-premises storage. For example, cloud storage for archiving could be achieved by making it look like virtual tape in the cloud for backup, as discussed in the section titled “Comparing recovery times with cloud storage as virtual tape” in Chapter 3, “Accelerating and broadening disaster recovery protection.” In general, disk-resident indices that are accessed on-premises to identify data objects in the cloud should be used for the same reasons they should be used with tape.

Another important consideration for using cloud storage for archival and compliance purposes is the documentation that is required by regulations. A cloud solution for corporate compliance needs to meet fairly strict guidelines to be a valid solution.

The remainder of this chapter examines Microsoft HCS as storage for long-term data archives using Microsoft Azure Storage.

Archiving with the Microsoft HCS solution

IT professionals are accustomed to thinking about archiving and backup as two related but different tasks and practices. The data and media used are often managed and maintained separately. Confusion over tapes for backup and archive can result in archive tapes being overwritten by backup processes and unexpected problems during recoveries. The media and equipment for tape backup and tape archiving might be similar, but the practices for both are decidedly different.

In contrast, Microsoft HCS automates both archiving and backup using cloud snapshots to upload fingerprints to Microsoft Azure Storage. Cloud snapshots used for backups typically expire in a few days to a few months, but cloud snapshots used for archiving may expire many years in the future, depending on compliance and governance requirements for archives.

Data archiving with Microsoft Azure Storage

Microsoft Azure Storage is being used successfully for data archiving. One of the most obvious advantages is that data stored there is off-site, but online, combining remote protection against site disasters with immediate access. Combined with the geo-replication service, Microsoft Azure Storage makes it significantly easier for IT teams to comply with regulations that mandate multisite disaster protection for archived data.

Archived data can be uploaded or downloaded from multiple corporate locations, enabling IT teams to flexibly design archiving workflows while simultaneously providing a centralized repository for accessing and exchanging data archives. Consolidating archives in Microsoft Azure Storage simplifies management of archived data by reducing the number of variables involved, including security management and encryption keys for all stored data.

As mentioned previously in the section “Archiving to cloud storage,” it is recommended that data archived to the cloud for long-term storage be searchable through on-premises, disk-resident indices.

Compliance advantages of Microsoft Azure Storage

Compliance with regulations can be complicated, especially the documentation that is required by auditors. Microsoft Azure has a website called the Trust Center that has information about compliance topics related to Microsoft Azure, including the HIPAA Business Associate Agreement (BAA), ISO/IEC 27001:2005 certification, and SSAE 16 / ISAE 3402 attestation. Microsoft Azure Storage services are named features for these compliance documents. The URL for this site is: https://www.windowsazure.com/en-us/support/trust-center/ compliance/.

Integrated archiving with the Microsoft HCS solution

Long-term storage for archived data is an integrated feature of the Microsoft HCS solution. Data can be kept for an extended period of time on Microsoft Azure Storage simply by configuring cloud snapshot operations for that purpose. The next section, “A closer look at data retention policies with the Microsoft HCS solution,” describes how this is done.

An important advantage of archiving with the Microsoft HCS solution is that archived data on Microsoft Azure Storage is viewable on-premises by scanning folders or mounting cloud snapshots. The details of how this works depends on how archiving is implemented, either by archiving data in-place or by copying data to archive folders.

Archiving data in place provides default archival storage for the contents of a storage volume. It is essentially the same as backing up data with cloud snapshots, but with extended data retention policies for storing data in the cloud. IT team members can view data that was archived in place and later deleted from primary storage by mounting cloud snapshots that were taken before the data was deleted. For instance, a cloud snapshot with a data retention policy of five years could be mounted to look for archived data that was deleted from primary storage any time in the last five years.

Often IT teams want to create special folders for archive data. Containerizing archives this way may be required by archive software or best practices designed to enforce special treatment of archived data. For instance, an archive volume could be established with long-term data retention policies so that data written to it would be protected, long-term, by the next cloud snapshot process that runs.

To learn more about Microsoft HCS as storage for long-term data archives using Microsoft Azure Storage and more about StorSimple, visit https://www.microsoft.com/storsimple and be sure to download your copy of Rethinking Enterprise Storage: A Hybrid Cloud Model.