Debugging the Crawler Parsing in FAST Search for SharePoint 2010

I was recently clued into an pretty interesting little tool that can be useful when debugging how a document is crawled using FAST Search for SharePoint 2010. It will crack a document and break out for you exactly what properties are being extracted and with what values, as well as how it’s crawling the contents of the document itself.

The tool is in the %FastSearch%\Bin directory and is called ifilter2html.exe. You will want to save the file you want to be crawled to disk because it won’t run and get it over http. Once you have it locally you can run the tool like this: ifilter2html.exe -x <yourFileName> someFileToSaveItTo.xml

For example, here’s the contents of a simple Word doc I created:

This is a sample document for the Share-n-Dipity blog. It is being used to demonstrate the crawl debugging features of FAST Search for SharePoint 2010.

What’s also interesting about this though is that I set the properties for Title (Crawl Sample Document), Subject (FAST Crawl Debugging), Manager (Terri Peschka), etc. You’ll also see all of these properties broken out in the output. Here’s what the Xml looks like (I deleted several of other chunks for brevity):

  <?xml version="1.0" encoding="utf-8" ?>

<chunks>

  <chunk id="1" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="F29F85E0-4FF9-1068-AB91-08002B27B3D9/2" chunkSourceID="1" startSource="0" lenSource="0">Crawl Sample Document</chunk>

  <chunk id="2" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="F29F85E0-4FF9-1068-AB91-08002B27B3D9/3" chunkSourceID="2" startSource="0" lenSource="0">FAST Crawl Debugging</chunk>

  <chunk id="3" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="F29F85E0-4FF9-1068-AB91-08002B27B3D9/4" chunkSourceID="3" startSource="0" lenSource="0">Steve Peschka</chunk>

  <chunk id="9" breakType="EOP" flags="VALUE" vartype="64" locale="0 (en-US)" attribute="F29F85E0-4FF9-1068-AB91-08002B27B3D9/12" chunkSourceID="9" startSource="0" lenSource="0">2010-11-02T23:27:00Z</chunk>

  <chunk id="11" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="F29F85E0-4FF9-1068-AB91-08002B27B3D9/7" chunkSourceID="11" startSource="0" lenSource="0">Normal.dotm</chunk>

  <chunk id="16" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="F29F85E0-4FF9-1068-AB91-08002B27B3D9/18" chunkSourceID="16" startSource="0" lenSource="0">Microsoft Office Word</chunk>

  <chunk id="21" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="D5CDD502-2E9C-101B-9397-08002B2CF9AE/27" chunkSourceID="21" startSource="0" lenSource="0">Draft</chunk>

  <chunk id="29" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="D5CDD502-2E9C-101B-9397-08002B2CF9AE/14" chunkSourceID="29" startSource="0" lenSource="0">terri.peschka@foo.com</chunk>

  <chunk id="30" breakType="EOP" flags="VALUE" vartype="31" locale="0 (en-US)" attribute="D5CDD502-2E9C-101B-9397-08002B2CF9AE/15" chunkSourceID="30" startSource="0" lenSource="0">Microsoft</chunk>

  <chunk id="36" breakType="EOP" flags="TEXT" vartype="30" locale="1033 (en-US)" attribute="B725F130-47EF-101A-A5F1-02608C9EEBAC/19" chunkSourceID="36" startSource="0" lenSource="0">This is a sample document for the Share-n-</chunk>

  <chunk id="37" breakType="NO_BREAK" flags="TEXT" vartype="30" locale="1033 (en-US)" attribute="B725F130-47EF-101A-A5F1-02608C9EEBAC/19" chunkSourceID="37" startSource="0" lenSource="0">Dipity</chunk>

  <chunk id="38" breakType="NO_BREAK" flags="TEXT" vartype="30" locale="1033 (en-US)" attribute="B725F130-47EF-101A-A5F1-02608C9EEBAC/19" chunkSourceID="38" startSource="0" lenSource="0">blog. It is being used to demonstrate the crawl debugging features of FAST Search for SharePoint 2010.</chunk>

  </chunks>

Several of the mappings should be obvious as to what they are. You may also find it interesting how it split up the word “Share-n-Dipity” so that even “Dipity” is searchable.

If you wanted to dig more into the property mappings, you can do so as well. I suggest you start out by getting a list of all the crawled properties. In PowerShell on the FAST server you can dump them all to a text file like this: Get-FASTSearchMetadataCrawledProperty > c:\CrawledProps.txt. Now you can copy the attribute value from the Xml above and go look for it. For example, what looks like the Title property has this attribute value: F29F85E0-4FF9-1068-AB91-08002B27B3D9/2. So I start out by looking in my CrawledProps.txt file for the GUID and it stops here:

CategoryName : Office

IsMappedToContents : False

IsNameEnum : True

IsMultiValued : False

Name : 4

Propset : f29f85e0-4ff9-1068-ab91-08002b27b3d9

VariantType : 31

 

The second part of the attribute value is the property name – in this case “2”. So now you can go into central admin, Manage Service Applications, click on your FAST Query service application, then click on the FAST Search Administration link in the top left. Click on the Crawled Property Categories and you’ll see all the property categories. From the CrawledProps.txt file we see that that Category name is “Office”, so click the “Office” link in the list of categories. When I do that, I see a property named “2”, and I see it is mapped to “Title”. So hey, it really does work! This can be pretty nice when you’re curious or trying to troubleshoot why some attribute or content from a document perhaps isn’t showing up in the search results as you might expect.