What is WebAnalyzer and why does multiple content SSAs break it?

What is WebAnalyzer ?

The WebAnalyzer analyzes data to automatically improve the relevance of search results by counting and analyzing links. Within the pipeline is a stage named WebAnalyzer that parses links from the documents at document processing time.

!One gotcha is that The WebAnalyzer depends on other stages in the document processing pipeline to parse out hyperlinks from nontext documents. In most cases, for example from Word documents, hyperlinks are parsed out as pure text instead of <a href> hyperlinks. In those cases the WebAnalyzer will not analyze those hyperlinks.

 

The problem

With multiple content SSAs the SP crawler creates a contentid for each document, it does not use url/path. In the case of two SP crawlers they will (eventually) use the same contentid for two different documents, and this is not handled by WebAnalyzer well as they use the contentid as identifier, not contentid and collection name like the indexer. Having the same contentid for different documents will break the link analysis in a catastrophic if not laughable manner.

 

 

Official Documentation:

https://technet.microsoft.com/en-us/library/ff599525.aspx

https://technet.microsoft.com/en-us/library/gg405120.aspx

 

Short and sweet one this week, next up is troubleshooting refiners through FS4SP to SharePoint.