One of the factors that will determine the ongoing success of the enterprise search solution is whether you are able to plan for and specify disk space requirements for full-text catalog files and search databases. These space requirements are affected by many factors, including the characteristics and size of the corpus being indexed.The relationship between a corpus and the disk space requirement for full-text catalog files is very complex. Although the relationship is generally governed by corpus size, there are many other factors that can cause considerable variation in this relationship.Furthermore, there is a complex relationship between a corpus and the disk space requirement for the search database.
The most apparent (although by no means only) factor that governs the full-text catalog space requirements is the size of the corpus. Therefore, you must attempt to determine the corpus size before you calculate disk space requirements.
Measuring Corpus Size
You can attempt to measure the corpus size by adding up the size of all files and other items to be indexed. For example, you can investigate the disk space used in file shares, and you can retrieve the file sizes for Microsoft® Office SharePoint® Server 2007 items and other content. However, because content sizes for similar files vary by system, this approach may yield misleading data. For example, if you measure SharePoint content database sizes, the figures can vary depending on your versioning strategy for documents and other items. Also, this approach can be unnecessarily time-consuming.
Estimating Corpus Size
A more typical, and probably more robust and manageable, approach is to estimate corpus size rather than measuring it. You can simply estimate corpus size by:
· Categorizing the different content forms, such as files, Web pages, lists items, database items, and so on.
· Multiplying the average size of each content form by the number of items in each form, to obtain size estimates of each content form.
· Adding together all of the size estimates.
When you have either estimated or measured the corpus size, you must also estimate corpus growth characteristics. You can start to base these estimates on past growth patterns, but you must be aware that content creation patterns for any organization vary over time. So you must also attempt to gather expected growth characteristics from analysts and other people in the organization.
Although the main governing factor that affects full-text catalog size is the size of the corpus, the relationship is not a simple one. The characteristics of the content in the corpus can affect this relationship in many ways.
One of the factors that affect the ratio between total file sizes and full-text catalog size is file format.
To help you understand this concept, imagine that your entire corpus consists of the text for the novel War and Peace. Now imagine that the text is stored in a plain text file, and that the size of that file is 10 megabytes (MB). Then, suppose that you index that file, and that the resultant full-text catalog is around 1 MB in size. The ratio between corpus and full-text catalog is, in this case, about 10:1.
Now imagine that the text for War and Peace has been typed into a Microsoft PowerPoint® presentation file, instead of the plain text file. Imagine that formatting has been applied to the text, and that the presentation consists of multiple slides representing pages of the book. The file size will be considerably larger for the same textual content than the plain text file. For this example, imagine that the file size is 50 MB. Suppose that you now index that file. It is likely that the full-text catalog will still be around 1 MB in size, because the same words have been indexed. The ratio between the corpus and the full-text catalog, however, is now about 50:1.
File compression can also affect the ratio between corpus size and full-text catalog size. For example, the compressed nature of the Microsoft Office Word 2007 format will result in a smaller file size than if the equivalent content is stored in an Office Word 2003 file.
Another factor that affects the ratio between corpus size and full-text catalog size is the ratio of textual content in files to embedded objects. Continuing with the War and Peace example, imagine that the PowerPoint version described previously includes an embedded picture file on each page to illustrate a scene in the novel. The resultant file size will be much larger than the previous 50 MB because of the embedded graphics. But the textual content will be the same, and indexing the file will still result in a full-text catalog of about 1 MB.
A further factor that affects the ratio between corpus size and full-text catalog size is the uniqueness of the content being indexed. Office SharePoint Server 2007 tokenizes indexed words for efficient storage and lookup; the less unique the words being indexed, the lower the ratio between the corpus size and the full-text catalog size.
This factor applies both to uniqueness of words within files, and to uniqueness of content between files. As an example of this first concept, a 10-MB file containing technical content about Microsoft Office SharePoint Server 2007 Enterprise Search is likely to have many occurrences of the words SharePoint, search, Microsoft, enterprise, document, file, server, index, query, and so on. Because of the tokenizing of these common words, the space required to index the file will be smaller than that required to index 10 MB of a novel that has a rich and varied vocabulary.
As for the second concept, indexing a corpus that consists of many unique documents about various subjects will result in a full-text catalog size that is larger than a corpus consisting of many copies of similar documents. For example, imagine that your organization stores a copy of terms and conditions in each project site within a site collection. The terms and conditions are likely to be very similar for each project, with perhaps only minor variations on a project-by-project basis. The words within these documents will be tokenized by the indexer and so will result in a smaller full-text catalog than if each file had relatively unique content.
Because all vocabularies are essentially limited, there is a relationship between total corpus size and the ratio of that size to the full-text catalog space requirements. This is simply a statistical phenomenon: 10 terabytes of data will usually contain less unique content as a proportion of the corpus size than 1 terabyte of data. To illustrate this point further, as a corpus grows, it tends to include more and more occurrences of words that have already been used elsewhere in corpus, until at some point the corpus contains every word in the organization’s vocabulary. Further additions to the corpus will not introduce new words.
As an extreme example, imagine that every French speaker in the world writes a novel. Over all the content, imagine that 98 percent of common French words have been used. Then imagine that these authors all write a second novel. The size of the content will double, but the additional content will largely consist of words that have already been used in the first set of novels. This concept applies to organizations and their common vocabularies as much as it applies to a written language.
Content Metadata and Crawled Properties
The amount of data stored in indexed and Managed Properties affects the ratio between the size of the corpus and the sizes of both the full-text catalog and search database.
Crawled properties are simply those attributes that are discovered and indexed at crawl time. Crawled properties include attributes from content source systems, such as the last modified date for files in file shares, and the column data for items in SharePoint lists and libraries. They also include embedded property values from the property sheets of specific file types, such as Microsoft Office documents.
Crawled property values are stored in the full-text catalog and so can affect the ratio between the size of the corpus and the size of the full-text catalog file.
By default, Office SharePoint Server discovers and indexes textual crawled properties. For SharePoint content, you can specify precisely which properties are indexed by setting the indexed properties of site columns.
Managed Properties, as previously described in Module 1, represent a virtual mapping between multiple and various indexed properties. Managed Properties and their corresponding values for each item are stored in the search database. Therefore, the number of Managed Properties and their mappings to indexed properties can affect the ratio between the number of files in a corpus and the size of the search database.
Access Control Lists
Permissions are retrieved for each secured item when content is crawled, and they are then used at query time for security trimming. Access control lists that represent permissions are stored in the search database for each secured item. Therefore, the ratio between the number of items in the corpus and the size of the search database varies depending on whether those items are secured.
Access control lists are not retrieved in every scenario, even for secured files. For example, if you have a non-SharePoint Web site that provides links to Office documents stored in a file system, and those items are secured, the indexer will not receive security information because the HTTP protocol handler does not retrieve permissions.
Another factor that affects the ratio between corpus size and full-text catalog size is the versioning strategy in the farm.
SharePoint Versioning and Indexing
The indexer only indexes one version of each item, so it is not possible to index all versions of files in a document library, or all versions of items in a list.
On a related point, if you want to index all versions of files, you must implement some sort of authoring process (perhaps with workflows) where new documents are created at each step in the process, rather than relying on versions of single documents.
Versioned Corpus and Index Ratios
If your corpus is characterized by many versions of items in SharePoint lists or libraries, the ratio of the entire corpus (including all item versions) to the size of the full-text catalog file will higher than if you disable versioning in SharePoint lists and libraries. You should remember this if you are measuring corpus size based on content database sizes.
Content Access Accounts and Versioning
You must be aware of the effect that the content access account has on the versioned content that is being indexed (although it does not affect the ratio between corpus size and index space requirements).
SharePoint technologies can maintain multiple versions of a page or document and will present specific versions to different users based on their roles. For example, if you have checked out and modified a published page and then saved it but not checked it back in, the next time that you request the page, you will be presented with your saved version. Anyone else who requests the page will be presented with the latest published version. Then, imagine that you make further changes and check the page back in and submit it for approval. The next time that you request the page, you will be presented with your edited version that is waiting for approval. And any person who is in the approver’s role will also be presented with that version. All other readers will still be presented with the latest published version.
The point is that when the indexer requests a page or file for indexing purposes, SharePoint technologies will present the version of the item that is appropriate for the account being used to perform the crawl. While there is no fixed rule for selecting content access accounts, you must be aware of this behavior so that you can specify an appropriate account for the crawl. In general, if you want to ensure that only approved, published content is indexed, you should use a reader’s account to crawl SharePoint content. However, if you want to index unpublished content, perhaps for a volatile authoring environment, then you can consider using an editor’s account or approver’s account, or another administrative account.
In Next Part –Part 2 , I shall talk about planning the schedules, architecture and topologies for Enterprise Search Capacity Planning