Storing Unstructured Data – From file servers to cloud services

After joining the Storage Solutions Division at Microsoft, I got exposed many challenges that were not so close to me before. One of them is how we store and manage unstructured data, including things like file servers, NAS devices, document management systems and blob storage solutions. Here’s my initial attempt to cover a little of this area’s history and summarize its main issues.

As the personal computer gained popularity, a lot of data started being stored in files, like text documents and spreadsheets. Those personal computers eventually were networked together and started sharing those files. In the 80’s, file servers stored those files for a small group of computers, usually organized under a folder hierarchy. Unstructured data existed side-by-side with database servers, which stored data as sets of data tables connected by key fields. The personal computer evolved to store more complex documents, like presentations, long manuals, diagrams, messages, pictures and video. Over time, more storage became available and networks got fast enough to allow all that to be shared. File servers got larger hard drives and faster network interfaces. They also improved in terms of security, fault tolerance and scalability.

With the popularity of the internet, the file server went beyond the company borders. Standard protocols that allow different types of computers to transfer large files to and from remote servers became very popular. Companies adopted these protocols for internal use as well. Streaming media files also became popular, with specific protocols to handle the gradual transfer of very large media files as the clients played them. Files were now being served by old-style file servers, web servers and media servers each one with different approaches on how the data got transferred to and from them.

File formats are also another source of issues. Many of the document formats are proprietary and that can create a challenge to properly index the document. Long-term, you might also not have the right tool to open a specific file format. I recently found some old documents I created with a desktop publishing tool I no longer have on my system and I just could access it. HTML helps in some scenarios, but there is a need to also migrate to standard document formats like Open XML, which is clearly better than older proprietary file formats. Speaking of standards, the protocol for accessing those files is also another item that evolved over time. If you’ve configured file servers before, you probably heard of protocols like FTP, HTTP, SMB, WebDAV, CIFS and NFS. Most file servers and NAS devices will support a combination of those protocols to allow clients to get access to the files and documents.

With the ability to store all these documents over long periods of time came the need to index them to facilitate search. Unlike traditional relational databases, these document stores required a completely different way of indexing, usually starting with breaking the readable text in the documents down into keywords and indexing those. The document metadata (if available) would also be good data to index. Searching large pools of documents is not unlike searching the Internet. You crawl the source for all pages (documents), break down the content and then store the results in a more traditional index linking keywords back to the original URL. Some types of documents, like pictures, music and videos are specially hard to index, since it’s not easy to extract searchable tidbits from it, especially if you have no additional metadata available.

One additional category of servers included the document management system. These servers included additional features like storing the history of the changes to the documents or allowing a document to be checked out (locked for editing by a used to avoid conflicts). These systems could also include additional metadata about the documents, like project name, keyword, author, reviewer, approver and the associated dates. Some systems would handle the workflows and help you with the lifecycle of the document, from creation to approval to publishing to archiving. Some highly regulated industries also required additional control of that lifecycle, including the guarantee that a specific document, once approved and/or sent to a third party, would be archived for a specific time.

Another important trend is the regulatory requirements around document lifecycle. It is increasingly common to have laws that require companies to retain data and/or dispose of the data in a specific way, like SOX (the United States’ Sarbanes-Oxley Act of 2002) and  HIPPA (the United States’ Health Insurance Portability and Accountability Act of 1996). Records management in general is largely driven by these types of regulation. This applies to both structured data (like records in the Human Resources, Accounting or Customer Relationship Management system) and unstructured data (like e-mails and spreadsheets). Because of that, there is an increasing need to manage information lifecycle and properly classifying data (so that important and/or regulated documents can be treated differently).

As the data stored in these systems becomes vital, you also want to make sure you can backup and restore the data properly. Traditional backup to tape (and tape libraries) is alive and well, but it now competes with devices that work like tape drives  but actually store only to disks, called virtualized tape libraries. These VTL systems can replace tape in most cases. The one exception is the fact that you can’t ship a “virtual tape” to another location in a truck. However, you can use replication (or something similar) to ship the data to another location. The other difference is that you can always buy more tape media, while a VTL has a finite amount of disks that will fit in an enclosure. With the price of disks going down, many are just buying enough disk to store many times the amount of data and using snapshot techniques to create backup copies. You could also keep a first copy on disk (for quicker recovery) and take an additional copy data to tape later, as a second step. This is usually referred as D2D2T backup, as opposed to traditional D2T or D2D backups. Many are actually bypassing tape and VTL systems altogether and backing up only to disk, replicating to another remote disk device for disaster recovery.

Going beyond traditional backup, document management systems will often provide features to allow a file to be reverted to a previous version. With this, restoring to a previous state can usually be done by end users, especially when you can check who made the changes and when. Requiring a user to provide a specific comment with each new version saved helps. However, file servers and NAS servers will commonly implement a different type of versioning that requires no user action, simply allowing a user to revert the file to a previous date. Many systems will keep every version ever stored, in fact just storing the difference between versions. Some of these systems will ship a copy of the file changes to a remote server, allowing the files to be restored in the event of a complete disaster or simply as a remote archival system. These are usually referred to as replication systems or continuous data protection system.

Large enterprises also will typically end up with a large number of file servers and NAS devices, each one with its own name, folder structure and many times with different solutions in terms of security, high availability, backup, etc. Since these different file stores are typically disconnected, there are numerous issues around indexing them, managing them, making sure they scale properly. It’s also pretty common to end up with some servers being underutilized and others being overloaded. This led corporations to consolidate their file and NAS solutions, migrating the documents to new, centralized, fault-tolerant servers. That also comes with challenges, as the path to the documents changes. If the initial design for the document store did not include a permanent ID, users are left with broken network file paths and URLs. You can try tricking clients via DNS updates, using some sort of redirection, adding links/stubs to new consolidated locations or using your search tool to find documents after the move, but there is a clearly need to think about how you link to documents in the first place, especially if you plan to keep them around for a long time.

Consolidation paradise is having a single system that stores all the files in the company. This solution needs to be secure, fault-tolerant, highly available and should easily scale with the company’s storage needs. This is obviously not a single server solution and might not even be located in the same data center. This will likely be a set of servers that are managed as a single entity. It must also include a way to uniquely identify the data blobs, usually by some sort of GUID that will be used as access key. Many vendors in this space already offer some sort of fixed content storage or virtualized file storage. There’s also an effort by SNIA (Storage Networking Industry Association) to create a standard API for accessing this type of storage called XAM (eXtensible Access Method).

Even if you do manage consolidate everything, you cannot treat every file and document the same way. Tagging every item can be a very hard proposition, though. In general, you can try to properly mark what is considered “high business impact”. That’s the kind of information that, if leaked to the Internet, would impact your company stock price. You also want to make sure is that your company complies with government regulations, if you are under SOX and HIPPA, for example. There are also some smart ways to save money if you can tell which documents will rarely be accessed and/or will never change, like using compression. You might even consider storing some of the less important documents in cheaper storage or taking some of it offline. In the unlikely event a user needs one of those, they might have to wait a few hours for a tape restore. That’s the basis for a hierarchical storage management (HSM) systems. Some tagging is fairly easy to do, like classifying files based on age by looking at the last change date. With millions of documents out there already, tagging every one by hand is usually not feasible. However, If your file system keeps additional metadata, like author or project, more of it can be done automatically.

Another way to save space is making sure you’re not wasting it storing multiple copies of the same file. A centralized or virtualized file storage system opens the door to efforts to store duplicate files just once, while keeping this transparent to the end user. These are usually referred to as single-instance storage or deduplication. It’s not as hard as it seems, since there are fairly established hashing algorithms  that can tell if two files are the same. This sounds great, but it can also waste a lot of CPU cycles on your storage server. The real trick is to dedup efficiently, like looking at pieces of files instead of the entire document and finding more optimized ways of processing so users cannot perceive any delays.

Smaller companies are also being tempted to host the storage of unstructured data with an Internet-based service provider as opposed to implementing it with their own in-house servers. This vision is becoming more realistic as many large players are starting to offer a complete set of cloud services. For larger corporations, this is probably not a good way to store your high business impact documents, but it could be a part of the overall solution, maybe covering your archival needs or a tier of your HSM system.

As you can see, storing unstructured data can get quite involved. It’s also something that is clearly still evolving and certainly in high demand. It involves technologies that are not as established as relational databases and you should continue to see a lot of activity in this area in the coming years.