To estimate the size of a search installation we need to ask customers a couple of questions. For example we need to know the total number of documents and how big the current corpus is. Usually customers are only able to give us reliable answers on the 'how many' part of the question. The big question is:
What's the average size of an office document?
Whenever I ask customer what kind of content they have, the answer is always the same. "Just regular content, very average stuff". In most Enterprise Search scenarios this actually means they have no idea.
So what is average content? We think we know. But, do we have any actual data supporting our thoughts? I decided to collect data from a couple of real-life scenarios. In one case we have about 100 different data sources spread across tens of millions of searchable items. In the other case we have more or less a single content source (SharePoint) with hundreds of thousands of items.
|Document Type||Customer #1||Customer #2|
|686 kB||134 kB|
|PowerPoint||1489 kB||2693 kB|
|Excel||674 kB||732 kB|
|Word||270 kB||401 kB|
This is how the content is distributed across the different types of content. Some observations:
- Most web content is smaller than 200kB
- PowerPoint and PDF consume the most space
- Word documents are most frequent
Conclusion: The average size of an office document is 321 kB.
The next time you need to do some search engine sizing you have some data to back you up. For example you know that:
- 1 million documents consume about 306 GB, or 2.98 TB for 10 million documents
- 40% of space will be used to store PDF documents
Note that the space requirement is for the source content only. In addition you will need space for your index. For example the footprint of an JPEG image in the index will typically be 3% of the image size. For a PDF the average footprint is about 35%. But, more on this in a later post.