Open XML SDK: Merging Documents

(This post courtesy Natalia Efimsteva)

Office Open XML (OpenXML) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The Office Open XML specification has been standardized by ECMA (ECMA-376) [wiki]. Open XML is the native format for MS Office 2007/2010.

Open XML allows you to manipulate MS Office files in your own and desired way. For example, you can create .docx files programmatically on the server side (which wasn’t recommended for binary MS Office formats like .doc).

The Open XML SDK 2.0 for Microsoft Office is built on top of the System.IO.Packaging API and provides strongly typed part classes to manipulate Open XML documents. The SDK also uses the .NET Framework Language-Integrated Query (LINQ) technology to provide strongly typed object access to the XML content inside parts of Open XML documents.

The Open XML SDK 2.0 simplifies the task of manipulating Open XML packages and the underlying Open XML schema elements within a package. The Open XML Application Programming Interface (API) encapsulates many common tasks that developers perform on Open XML packages, so you can perform complex operations with just a few lines of code.

So now let’s discuss an often-asked question like programmatically merging Open XML documents. It’s not a very complicated task, but we need to think about some things.

First of all, let’s look at the internal .docx structure. Below is an unzipped view:

clip_image002

The OpenXML SDK 2.0 contains a great tool – Document Explorer – which allows us to view XML markup as well as .Net representation of a code to construct this markup:

clip_image004

So when we’re merging documents we need not only merge content (text) but also styles of the document and other formatting settings.

Open XML SDK operates on Open XML elements like paragraphs rather than logical (for user) objects like pages, content, and so on.

But we have tool which can make our life easier – DocumentBuilder from PowerTools for Open XML. Another way is to use altChunk. This element specifies a location within a document for the insertion of the contents of a specified file containing external content to be imported into the main WordprocessingML document. Differences between these two approaches described in a post “Comparison of altChunk to the DocumentBuilder Class”. We will talk further about the DocumentBuilder approach.

Use of the DocumentBuilder util is really very simple:

using (WordprocessingDocument part1 = WordprocessingDocument.Open(@"Doc1.docx", false))

using (WordprocessingDocument part2 = WordprocessingDocument.Open(@"Doc2.docx", false))

{

List<Source> sources = new List<Source>();

sources.Add(new Source(part1, true));

sources.Add(new Source(part2, true));

DocumentBuilder.BuildDocument(sources, "MergedDoc.docx");

}

The most interesting is the second argument of the constructor of Source class. Using the keepSections argument appropriately allows you to precisely control which sets of section properties (visual formatting in other words) are moved from source documents into the destination document. For more information please see How to Control Sections when using OpenXml.PowerTools.DocumentBuilder post.

We have two documents:

Doc1.docx

clip_image005

Doc2.docx

clip_image006

DocumentBuilder will do all of the work for you for merging the two documents preserving:

  • formatting
  • page numbers (including Link Sections)
  • headers and footers
  • orientation
  • and so on.

That’s magic!

clip_image008

Additional Resources