Interpreting “Crawl Queue” report with Sharepoint search

We’ve noticed something interesting a few weeks ago while working on a FAST Sharepoint 2010 system.  One of the advanced Search Administration Reports called Crawl Queue report was suddenly showing a consistent spike in “Links to Process” counter.  This report is specific to the FAST Content SSA and the two metrics shown are:

  • Links to Process:  Incoming links to process
  • Transactions Queued:  Outgoing transactions queued

New crawls would then start and complete without problems, which led us to believe that this had to do with a specific crawl that never completed.  We took a guess that perhaps one of the crawls has been in a Paused state, which ended up to be a correct assumption and saved us from writing SQL statements to figure out what state various crawls were in and so on.  Once this particular crawl was resumed, Links to Process went down as expected.  This process did give me a reason to explore what exactly happens when a crawl starts up and is either Paused or Stopped.

My colleague Brian Pendergrass describes a lot of these details in the following articles:

https://blogs.msdn.com/b/sharepoint_strategery/archive/2012/10/30/sp2010-search-explained-crawling.aspx

https://blogs.msdn.com/b/sharepoint_strategery/archive/2014/02/10/sharepoint-search-and-deadlocks-in-sql-server.aspx

If I had to just do a very high-level description, here is what happens when a crawl starts up:

  • The HTTP protocol handler starts by making an HTTP GET request to the URL specified by the content source.
  • If the file is successfully retrieved, a Finder() method will enumerate items-to-be-crawled from the WFE(Web Front-End) and all the links found in each document will be added to the MSSCrawlQueue table.  Gathering Manager called msssearch.exe is responsible for that.  This is exactly what “Links to Process” metric shows on a graph, it’s the links found in each document but not yet crawled.  If there are links still to process, they will be seen by querying the MSSCrawlQueue table.
  • Items actually scheduled to be crawled and waiting on callbacks are also seen in MSSCrawlURL table.  This corresponds to “Transactions queued” metric in the graph.
  • Each Crawl Component involved with the crawl will then pull a subset of links from MSSCrawlQueue table and actually attempt to retASTrieve each link from the WFE.
  • These links are removed from the MSSCrawlQueue once each link has been retrieved from a WFE and there is a callback that indicates that this item has now been processed/crawled.

Once a specific crawl was set to a Paused state, enumerated items-to-be crawled stayed in MSSCrawlQueue table and were not cleared out, corresponding to “Links to Process” metric in the graph.  If instead we attempted to Stop a crawl, these links would have actually be cleared out from the table.

This behavior should be similar with Sharepoint 2013 Search.