File size reduction for Open XML

I spend a lot of time working on the adoption of the Open XML Formats,

For IT organizations, it can be a daunting task to migrate document formats in Office, and it the benefits are not always immediately obvious. Microsoft spent a fair bit of time on tools / guidance to make the introduction of Open XML easier, and I'll drive deep on those in future posts. But I wanted to use this opportunity to discuss one of the primary reasons why you should let Open XML in, and how it can help. This will be the first in a 3 part series on file size reduction, document "sanitization" and improvements in document format security.

A tangible benefit of Open XML is file size reduction. Reducing file sizes means lower storage costs and reduced bandwidth consumption. Particularly for those paying for bandwidth on a meter, this can be quite helpful.

Why are Open XML Files smaller? With Open XML, and the Open Packaging Conventions, the file architecture is much more modular and is compressed using a ZIP archive. Storing XML content in a ZIP container lends itself very well to compression, so we do see great results for text-intensive documents like documents and spreadsheets. The benefits don’t translate as well for presentation files, because those tend to be image-intensive (and therefore do not benefit from ZIP compression), but even those are smaller.

The data in this post is a preview of a more comprehensive study we’re working on, but I thought I’d share some of the early returns. There’s no real magic in the study, it’s a pretty simple project. If you want to try this for yourself, you can do what we’re doing: use your favorite search engine / content store to retrieve 100 documents each for word processing (Word 97-2003), spreadsheet (Excel 97-2003) and presentation (PowerPoint 97-2003) format documents, and convert them to Open XML. Results will always vary slightly depending on your data set, but the results should be somewhat consistent with what we’re showing here.

You can do the document conversion using the desktop products, or the Office Migration Planning Manager (and the Office File Conversion tool, specifically), which has a command line interface. Other conversion tools are also available. Quality / results will vary depending on the translation environment.

This post will only discuss the Word documents converted using Word 2007, but the data will illustrate the survey results clearly.

Word Documents: Converting .DOC to .DOCX

"docx" Sizes

"doc" Sizes

Size Change

Storage Gain

Median

30Kb

69Kb

29Kb

52%

Minimum

11Kb

20Kb

-2Kb

-2%

Maximum

559Kb

975Kb

784Kb

87%

Percentiles

25

18Kb

35Kb

15Kb

40%

50

30Kb

69Kb

29Kb

52%

75

76Kb

160Kb

67Kb

62%

A median size reduction of 52% for documents is quite significant, and translates to real savings for disks and network traffic. We can assume a linear correlation between document size and the number of packets transmitted over a network; therefore we can assume a similar result in bandwidth consumption (bandwidth consumption data will be published in the final paper as well.)

Don’t believe it? – try this simple test:

Create a simple document in Word 2007. A great way to generate sample text in Word is by using a formula: “=rand(10,5)”, where 10 is the number of paragraphs in your document, and 5 is the number of sentences per paragraph. You can use this formula to generate documents of increasing length. In doing so, the benefit of compression in Open XML becomes instantly clear. I conducted this test 5 times, on documents ranging from 10 paragraphs of text to over 60 pages. (I have attached them here for you to use.)

I simply added the text, saved the file in binary format first, then saved the file again as Open XML. There is no formatting (beyond my default template, no tables, images or anything other than simple paragraphs.) As the documents increase in length, the benefit of compression is obvious:

Sample file name

.doc size

.docx size

Test 1

31k

11k

Test 2

86k

13k

Test 3

147k

15k

Test 4

269k

18k

Test 5

513k

26k

If you’re a graph type, we can make the relationship more clear:

Size Graph

 

This isn’t to say that 5,000 page documents stored using Open XML are going to be 1 – 2 % of their original size, but this is to point out that it is very easy to demonstrate real space savings with Open XML. Depending on the nature of the documents you are creating, especially if they are text-intensive, the size difference can be quite dramatic.

We’ll eventually publish the full data set in a more detailed (and scientific) white paper, and the paper will publish in late January. But as an introductory post, I thought I’d make this an easy one, with a pretty clear benefit. I’ll let you work out the math for your own storage & bandwidth savings, but if you can ask yourself “what would I gain if my files were half of their current size?” – I’ll bet the answer will usually be a good one.

 

Word Docs.zip