Document Inspector–a privacy feature in the 2007 Office system–is a convenient tool for removing private or personal information (often known as metadata) from Excel 2007, PowerPoint 2007, and Word 2007 files. However, as an out-of-the-box-solution, Document Inspector can’t be all things to all organizations, and there are many types of metadata that Document Inspector can’t or won’t remove. This article looks at some of the types of metadata that Document Inspector doesn’t remove, and provides a list of companies that offer tools for removing or scrubbing this type of information from 2007 Office system files.
Note: Document Inspector is extensible and can be programmatically customized to suit a wide range of document workflow requirements. For more information, see Customizing the 2007 Office System Document Inspector (http://go.microsoft.com/fwlink/?LinkId=78577).
What you can’t see can indeed byte you
Excel, PowerPoint, and Word have Document Inspector modules that remove some type of hidden or invisible data: Excel has three modules for removing invisible objects, hidden columns, hidden rows, and hidden worksheets; PowerPoint has one module for removing invisible on-slide content; Word has one module for removing hidden text. In order for these modules to work, the content must be formatted as hidden or invisible, such as hidden text in Word or invisible shapes in Excel. Why is this distinction important? Because there are lots of ways to make text and objects seem hidden or invisible, and Document Inspector doesn’t remove hidden or invisible content unless the content is specifically formatted as hidden or invisible.
For example, if you put white text on a white background, you effectively hide the text, but Document Inspector assumes you meant to make the text white on a white background, so it doesn’t consider it hidden and it won’t remove it. Likewise, if you create a blue shape on a blue background, the shape is effectively invisible, but Document Inspector doesn’t see this as invisible and won’t remove it. Similarly, a shape that is covered by another shape isn’t considered invisible, and a shape that has no fill and no outline is not considered invisible–even though both shapes are hidden from view. In both cases, Document Inspector does not remove the shapes even though they are not visible and appear to be hidden.
You can also hide data in Excel by putting data in a distant column or row, like row 10,000 or column 1,000,000. This effectively hides the data because the data is off-screen, or far beyond the standard display area of a typical spreadsheet. But Document Inspector sees it and thinks it’s just normal data that you meant to put there, so the modules for removing hidden or invisible content do not remove it. The same thing applies to shapes or SmartArt that you move out of the visible viewing area on a spreadsheet: it might not be visible to someone casually viewing the spreadsheet, but Document Inspector still sees it as visible (unhidden) data and Document Inspector doesn’t remove it.
A good rule to remember is that the Document Inspector modules for hidden and invisible content do not remove any content unless the content is explicitly formatted as hidden or invisible. Trickery and sleight of hand might make things invisible to a casual viewer, but not to Document Inspector.
Note: PowerPoint, unlike Excel and Word, has a Document Inspector module for removing off-screen content. The off-screen content module removes the off-screen content even if it is not formatted as hidden or invisible.
Several program features rely on data caching to increase performance. This can be a problem if the cached data contains metadata because Document Inspector does not remove cached data from files.
Pivot tables are one example where Document Inspector doesn’t remove cached data. When you create a pivot table in a new worksheet, Excel creates a data cache of the data you selected so it can quickly render the pivot table. In some cases, the cached data may remain in the new worksheet after you delete the pivot table. Running Document Inspector will not help you remove this information because Document Inspector does not remove cached data. If you are concerned about the pivot table data that is cached, you can clear the Save source data with file check box that is on the Data tab in PivotTable Options. Also, if you want to display the information that appears in a pivot table, but delete the cached data, you can copy the pivot table, use Paste Special to paste only the values and formats into a new area on the worksheet, and then delete the original pivot table.
Using the sort and filter features can also create cached data because the filter and sort states are cached. Generally, this is not an issue because the data that gets cached is derived from the data that’s visible in the worksheet, but it is possible that what’s in the cache no longer exists in the spreadsheet. For example, say you sort a column with a filter, and then you remove some rows and columns. The filtered values can still appear because they’re in a cache, but the data in the worksheet might’ve been deleted. Document Inspector doesn’t remove this type of cached data from a worksheet.
Embedded objects can also be a source of cached data. For example, if you copy a chart from Excel and use the default paste options to paste it into a PowerPoint slide, you are actually pasting the chart and the underlying data for the chart into the PowerPoint slide. The chart is visible, but the data associated with the chart isn’t visible, although it’s cached. Removing or deleting the chart does not necessarily remove the cached data that’s associated with the chart, and Document Inspector does not remove the cached data that’s associated with the chart. In general, Document Inspector does not remove any data that’s associated with an embedded object. If you paste an object into a document, and you don’t want to include the data that’s associated with the object, use the Paste as Picture option.
Database connections and printer connections are two common types of external connections that might put metadata into a file without you knowing it. In both cases, Document Inspector cannot remove this information from the file.
Database connections can be particularly tricky because you usually must provide private information in order to create the database connection, such as a user name, password, path to the database, database name, and the name of the machine from which you are creating the connection. This private information makes up the connection string, which is cached in the Excel file. Document Inspector does not remove this information from the file. However, you can remove the cached connection string data by deleting the connection. You can also configure connection properties so that passwords are not saved with connection information, which is a recommended best practice.
Printer information is also tricky because Office applications pass printer-specific information to printer drivers, and they do this by embedding the information in the document, workbook, or presentation file. Printer-specific information can include the path to the printer and the printer name. It can also include a user name and password if you’re using secure printing features. Document Inspector can remove printer name and printer path information from a file, but it can’t remove all of the printer-specific information from a file. Document Inspector can’t remove all printer-specific information from a file because printer drivers usually don’t provide enough detailed information for Document Inspector to determine what type of metadata is embedded in a file.
Protected and restricted files
Document Inspector doesn’t remove any metadata that’s in a protected or restricted file, such as a file that has editing restrictions, is digitally signed, or is protected by restricted permissions. For example, if you apply editing restrictions to a file or you add a digital signature to a file, Document Inspector can’t access the file and so it can’t remove any metadata. As a rule, be sure to run Document Inspector before you restrict or protect a file.
In addition, Document Inspector doesn’t remove comments that are added by a user when they apply a digital signature to a document. Since you have no control over what a document signer might say in a comment, it’s possible that a comment could contain metadata that you don’t want revealed. This can occur when you insert a Microsoft Office signature line in a document and you check the Allow the signer to add comments in the Sign dialog checkbox. This option enables a signer to create a comment when they add their signature to a document. Anyone can view the comment by looking at the signature details. But because the document is digitally signed, and can’t be modified, the comment can’t be removed by Document Inspector after the signature is applied. To avoid this, don’t allow signers to add a comment when they sign a document.
VBA and ActiveX
Document Inspector doesn’t remove any code or comments from Visual Basic for Application (VBA) modules, and Document Inspector doesn’t remove any data that’s associated with an ActiveX control. In both cases it’s impossible for Document Inspector to determine whether or not it’s removing critical data, so it leaves VBA modules and ActiveX controls as they are.
Other things to keep in mind when you use Document Inspector
Some collaboration or workflow features embed an email address in a file as metadata. Document Inspector usually removes these types of embedded email addresses, unless you use a send-for-review feature to embed an email address. In this case, Document Inspector doesn’t remove the email address because it assumes that you want someone to send the document back to you after they are done reviewing it. Keep in mind, Document Inspector doesn’t remove email addresses that are added to the content of a document, workbook, or presentation, such as an email address that appears in a cell or in a paragraph or on a slide.
Also, Document Inspector doesn’t remove hyperlinks, unless the hyperlinks are contained in some type of metadata that Document Inspector does remove, such as a document property, a watermark, a header, or a footer. For example, if you add a hyperlink to a comment, and you use Document Inspector to remove comments, then the hyperlink is removed along with the comment. But if you add a hyperlink to a paragraph or put a hyperlink in a cell or on a slide, Document Inspector will just see it as ordinary content and it won’t remove it.
File names, file paths, template names, and template paths can all be problematic as well, especially if you use template names or file names that contain metadata. In general, Document Inspector does not remove any of these things from a file. If it did, your files would not know what template to attach and it wouldn’t know where to save your file. A good rule to remember when choosing template names and file names is to keep them generic and not use naming conventions that contain personal or private information.
Field codes in Word documents can be problematic, too, because Document Inspector removes the contents of field codes, but it doesn’t remove the field code itself. For example, if you add the author field code to a document, Document Inspector removes the author name from the field code, but it keeps the author field code in the document. For more information about field codes, see Field codes in Word (http://go.microsoft.com/fwlink/?LinkId=154134).
Some other things that Document Inspector doesn’t inspect include SmartArt, WordArt, shapes, and quick parts. Document Inspector assumes these things are part of the content you are creating and it doesn’t remove them and it doesn’t remove the labels or text that you add to them.
Finally, Document Inspector doesn’t remove custom prompt text in PowerPoint presentations. You can add custom prompt text to a slide master, thereby overwriting the placeholder text that users see when they create new slides. Be sure to replace or remove the custom prompt text if it contains personal or private information. If you don’t, anyone who opens the presentation and then views the slide master will be able to see the custom prompt text.
Some final thoughts about using Document Inspector
Document Inspector is just a tool that helps you remove various types of metadata from Excel, PowerPoint, and Word files. Like spelling checker, it’s designed to help you perform a specific publishing task, but it’s not designed to take the place of common workflow processes, such as technical and legal review, peer review, and editorial review. Also, as you incorporate Document Inspector into your organization’s publishing workflows, make sure that your organization’s workflows aren’t inadvertently putting unwanted metadata back into a file after Document Inspector removes it. Some workflows might incorporate metadata back into a file when it’s sent for review or when it’s printed. Examples of things that could be added back into the file include: watermarks, author information, printer paths, and so on.
If your organization has specific compliance requirements or workflow requirements that aren’t met by the default Document Inspector modules, try using the Document Inspector API to create a custom solution or try using a third-party scrubbing tool. The following companies provide applications and services for scrubbing metadata from Office files.
Unedged Software, LLC
Esquire Innovations, Inc.
Payne Consulting Group
BEC Legal Systems