Documents in Databases
The first time I tell anyone that SharePoint Server 2007 stores documents as blobs (binary objects) in the database, people stare at me for a few seconds and usually ask: “Really?” It takes a few moments to explain that this is actually a good thing and that SQL has no problem handling those types of jobs.
Separating Files and Metadata
In fact, I have worked with other document management systems out there and I understand that many of them save only the document metadata in a database, leaving the actual documents in a regular file system somewhere. This usually sounds OK at first, but it creates trouble in the long run.
The main issue is keeping the database and the file system in synch. If you upload a file, for instance, you need to update the database and the file system in a single atomic operation, so that you never end up with orphaned database entries and/or files.
It is also very challenging to backup those systems, since you need to make sure the database backup and file system backup are in the same state. You see, by the time you’re finished the database backup, someone could have updated the file system (or vice versa). The only have truly consistent backup is to somehow freeze the two stores during the operation, but that usually means you either need downtime or a snapshot ability (usually available in SAN storage) that applies simultaneously to the database and the file system.
The other issue is where you keep those files. You might be tempted to have separate sets of database servers and file servers. Maybe you even use a NAS server for the files, which will further complicate your backup synchronization issues (this increases the possibility of having separate teams and tools for backing up the database and the file system). Or worst of all, you might fall in the trap of storing the files in the application servers, as if the files were not really data. Lots of trouble there as well, especially when you need to have multiple application servers in a scale-out solution.
It’s all in the Databases
With SharePoint, all you need to do is backup the content databases. They include both the documents and the metadata. No struggle there. You can even backup the system while it’s happily serving users. And, because this is a database, you’re never running into the consistency issues I mentioned earlier.
You also do not need to worry if you setup a system with multiple front-end web servers. Since they all connect to the same back-end database for everything (including configuration), it is pretty simple to add more web servers or indexing servers at the application layer.
Most people will like that idea, particularly if they run a small or medium size company where the amount of data is, say, under a terabyte. They will typically configure the database and application layers, have a single database server (or maybe two servers in a cluster, for fault tolerance) and one application server (or maybe two or three, using Network Load Balancing).
However, if you’re dealing with a larger enterprise, two things happen. First of all, the amount of data grows into multiple terabytes. Second, the teams are usually spread across multiple locations (if not multiple continents).
At this point, I take the time to explain that this is not a problem. We can easily set up SharePoint to work in such distributed environments and having terabytes of information is no problem either. Then there’s always someone that was not really paying attention to the details and will ask: “So you’re saying we can have a single, multi-terabyte, replicating SharePoint database?”
When they do say that out loud, it’s great! It gives me a chance to talk about the database architecture for SharePoint. The problem is when they just think about asking and leave the room just thinking I’m crazy or not exactly truthful :-). If only they let me explain myself…
It’s not a single database
I’m not saying SQL cannot actually have a single, multi-terabyte database. It certainly can and there are plenty of case studies to prove it. You could also use replication with those very large databases, but you typically do not want to replicate terabytes over a WAN without a lot of planning. That’s when you start learning about the different types of replication, partitioning of data and the like.
However, that’s not really what you do with SharePoint databases, even if you decide to put it all in the same database server.
You see, even the simplest of the SharePoint Server 2007 installations will use at least 8 databases (see other post on this blog). If you plan to host a lot of content (let’s say, 5 TB), you will likely separate that content into multiple content databases (like 10 databases with 500 GB each, or 100 databases with 50 GB each). This is a good idea, since it will facilitate your backup procedure, among other things. Some people like full database backups that fit in a single physical tape for some reason. 🙂
SharePoint natively supports the concept of spreading the content for a web application across multiple content databases and this is actually very simple to configure.
Also, if you’re hosting large amounts of SharePoint content of multiple types (like “My Sites”, team sites and a corporate portal), it’s probably the case of having multiple SharePoint farms that work together in a hierarchy. The corporate portal becomes the parent farm and the other SharePoint farms can be set up as children, consuming Shared Services (like Search or Profile Services) from the parent.
Each one of these SharePoint Server 2007 farms (even when they have this parent-child relationships) have their own databases (again, 8 or more for each). They could all share the same database server (if they all live in the same data center) or they could each have their own dedicated SQL Server cluster.
If you’re working in such complex SharePoint projects, you should have a good Architect working with you to figure out the best solution. It can get quite interesting.
SharePoint over a WAN
Another typical scenario for SharePoint farms is when you have multiple locations separated by a WAN. In this case, you also will end up with multiple SharePoint farms. However, this case needs careful handling.
First of all, each farm will need its own database server, since you’re not supposed to have your SharePoint application servers talking to your SQL server over a WAN. Second, you’re not going to be creating parent-child relationships either. These are not supported over a WAN in SharePoint Server 2007.
Your clients, however, will be able to access SharePoint over the WAN, since the client-server communication is all HTTP-based and that protocol can tolerate the latency.
Another thing you can do over the WAN is indexing and searching. It’s fairly simple to set up a content source for each of you remote farms in the corporate SharePoint Search portal. This way, you can have a single Search tool to find all SharePoint content in the entire enterprise.
As you see, once again, careful design and planning is required.
We’re not really replicating the databases
One thing you’re not doing here, though, is replication. Each SharePoint farm is “independently owned and operated” and their databases do not replicate. All cross-farm communications happen at the application layer, either with the parent-child relationship for Shared Services within the same LAN or with the Indexing across the WAN scenario.
That’s true even in the most complex scenarios with a corporate portal, child farms with different types of content and remote farms being indexed. It’s all about multiple different implementations of SharePoint, and there’s no linking between them at the database layer.
However, when we talk about SharePoint Server and we explain that the documents are all in the database, that we can have terabytes of documents, that we can have multiple farms that talk to each other and that each farm can be in a different continent, people sometimes visualize this single, multi-terabyte, replicating SharePoint database.
The fact is that this is just a legend, a myth. Or, better yet, it is an abstraction created by layers of software.
From a large Enterprise user standpoint, there is a single portal to search for terabytes of documents all over the globe. However, the resulting URL for the document they find will take them to one of many farms and the actual document will be retrieved from one of many, many databases.
You can compare this with the Exchange Server implementation in most large enterprises. You can usually calculate that, with 10,000 users each with a 1 GB mailbox quota, the Exchange database would be over 10 TB in size. However, this is actually divided into many smaller databases in separate mailbox servers.