SharePoint 2013 Search Architecture Part 1 - Intro

Article
06/06/2013

Several things have changed in SharePoint 2013 Search. As you might already be aware, we merged and built upon the core SharePoint 2010 Search product and Fast Search for SharePoint 2010 into a single product\engine simply called SharePoint 2013 search. This is the first of a series of Search 2013 Architecture blogs. So much has changed that it requires a series of different blog posts covering all of the areas. This blog will just highlight the major changes and provide a quick intro into Search Architecture. The remaining Search Architecture blogs will dive into greater detail on aspects of Crawling, Feeding, Indexing, Fetching, and Search Administration.

SharePoint Search 2013 Framework Intro

Back in SharePoint 2010, a majority of search ran under mssearch.exe. To be more specific, when a Crawl component or Query Component was created, the server hosting the component would start the logical search service instance which enabled and started the windows SharePoint Server Search 14 service. This service started and initialized the mssearch.exe process. MSSearch process was and still is the core process which ran search in SharePoint 2010.

In SharePoint 2013, We have an entirely new framework now so now when a component is provisioned on a server, the logical SharePoint Server Search service instance must be started first and this is done in PowerShell via Start-SPEnterpriseSearchServiceInstance. This starts the logical search service instance on the defined server which is visible within Central Administrator\System Settings\Manage Services on Server.

After the search instance is started on a server, the SharePoint Server Search 15 windows service is started which is responsible for mssearch.exe. A new windows service is also started called the SharePoint Search Host Controller service. This spawns an instance of the hostcontrollerservice.exe process. In SharePoint 2013, we moved a majority of Search processing underneath the SharePoint Search Host Controller Service. However, we still use mssearch.exe for crawls via the crawl component.

Question: What is the Search Host Controller Service?

From TechNet: https://technet.microsoft.com/en-us/library/jj219591.aspx

"This service manages the search topology components. The service is automatically started on all servers that run search topology components."

I agree with TechNet that it does manage Search Topology components but what does that really mean? First, each Search component with the exception of the Crawl Component, runs within a noderunner.exe process. This is essentially a .Net Application and each search component runs within a unique nodecontroller.exe process. So if 3 search components are provisioned on a server and none of those are crawl components, then task manager will show 3 noderunner.exe processes in addition to the HostControllerService.exe. The host controller service is responsible for Initializing, Stopping, and Monitoring the search components "nodes" that run within noderunner.exe process. It is true that if the SharePoint Search Host Controller windows service is stopped, those Search components running on that server will not function until it's restarted.

Component Intro

SharePoint Search 2013 consists of the following components:

Admin Component
Crawl Component
Content Process Component
Analytics Processing Component
Index Component
Query Processing Component

For a complete Search product, each one of these components needs to be provisioned. Each Search Service Application contains an active Search Topology and the components are assigned to the Search topology upon creation. In SharePoint 2010, you had two deal with two topologies Crawl and Query which added additional complexity when configuring Search 2010 via PowerShell. The good news is now everything is managed into a single topology. Initially, when creating a Search Application within Central Administrator, all components are provisioned on one server. For larger Search Deployments, it's recommended to scale these components to other servers. The only way to do that after Search is provisioned is by using PowerShell. Please see the following article for more information on how to do this with PowerShell.

https://technet.microsoft.com/en-us/library/jj862354.aspx

All of these components can be made fault tolerant or scaled out for performance gains depending on the size of your Search Enterprise. I won't go into too much detail about this aspect during the intro portion but I may dive into some of that in more in depth blog posts.

Crawl Component Intro

The crawl component works very much like it did in SharePoint 2010 and still runs under MSSearch.exe and invokes daemon processes during crawl to fetch content. I'll go into greater depth in the next blog for Crawl and Feed. Just like in SharePoint 2010, the crawl component doesn't store the index. However the major difference between crawling in SharePoint 2010 and SharePoint 2013 is the destination for crawled items. During SharePoint 2010 crawl, as items are being indexed (built in memory), they are streamed\propagated to a Query Server and indexed there. During SharePoint 2013 crawl, crawled items are sent over to the Content Processing Component for further processing. More in depth of behind the scenes of how this works will be in subsequent blog posts.

During SharePoint 2010 crawl, as properties were discovered they're written to the Property Store Database. In SharePoint 2013, we no longer have a Property Store Database so crawled items and their properties are sent over to the content processing component for further processing. Yes, the final destination for crawled items and there properties is the index component but it must pass through the content processing component first.

In SharePoint 2010, it was easy to end up in a state where a very long incremental crawl would overlap into the next scheduled incremental crawl. This usually occurs because lots of security changes are being processed. Please see some of my original blog posts if you're curious about security only crawls. Because the original crawl was still running, the incremental crawl wouldn't run again until the next scheduled incremental crawl assuming the original crawl has completed. This ultimately effects crawl freshness and crawl freshness defines the time it takes from the moment a user uploads a document to when it's indexed and available in Search Results. We have two new crawl types, Continuous Crawl and Cleanup Crawl, in SharePoint 2013 which improves crawl freshness tremendously. I'll discuss this in more depth in the next blog post.

Content Processing Component Intro

The Content Processing receives crawled content from the crawl component and performs things like document parsing, metadata extraction, link extraction, and property mappings to name a few. After processing items, the content processing component sends these items over to the index component to be indexed. This is something new to SharePoint 2013 because in SharePoint 2010, the crawl component was ultimately responsible for extracting metadata, links, and property mappings and used multiple plug-ins for this purpose. With SharePoint 2013, the interesting thing with CPC and link extraction is that it does extract links and stores them in the links database. These links are processed later by the Analytics Processing Component.

Analytics Processing Component Intro

In SharePoint 2010, the web analytics service application was responsible for analytic processing. In SharePoint 2013, the web analytics service application has been removed and now all analytics processing is performed by the analytics processing component within the Search Service Application. During a crawl, as analytic information like links, anchor text, etc… are discovered, they are eventually processed by the Analytics Processing component. This is referred to as Search Analytics. The analytic process component also processes user initiated analytics like items clicked, etc.. This is referred to as usage analytics. The analytics processing component uses both the links database and analytic reporting database. This makes a lot of since to put this under the Search umbrella for the simple fact that post analytic processing, the analytic data is committed to the index and is used in a variety of ways like boosting relevance of search result or viewing the number of clicks when using the hover panel over a search result. Not only is it making analytic and search data more efficient, it's also improving crawl freshness. Back in SharePoint 2010, we had the concept of a secondary crawl behind the scenes called an anchor crawl. This was processing links discovered during a crawl and is visible when the crawl status showed computing ranking within the Search Administration Page. The crawler no longer performs an anchor crawl and this processing is now performed by the Analytics Processing Component so crawl freshness is again improved.

Please see the following TechNet article for additional details on the analytic processing component:

https://technet.microsoft.com/en-us/library/jj219554.aspx

Index Component Intro

The index component host the actual index itself. It receives data from the content processing component and indexes this data. It receives search query requests coming in from the Query Processing Component and returns results back. As I stated before, the index stores both crawled items and their associated properties. The index is more efficient now because it's been broken up into update groups. A single crawled document could be indexed across several different update groups. Yes, each update group contains a unique portion of the index. This allows for partial updates which means if I make a change to a document, only that change is updated within the index of the associated update group instead of the entire document. Back in SharePoint 2010, we would need to update and re-index the entire document. Also, we no longer store the index on servers hosting a Query component which was the case in SharePoint 2010. The whole concept of propagating index items from crawler to query server hosting a query component no longer applies in SharePoint 2013.

Query Processing Component Intro

When a search query comes in from a web-front end, the Query Processing Component analyzes and processes the query to attempt to optimize precision, recall and relevancy. The processed query is then submitted to the index components. The QPC is where Query Rule matching takes place and will transform the query if there is a match. The Query Processing component also performs word breaking, linguistics\stemming and parsing operations. It packages the results from the indexer and passes them back to the WFE which are passed onto the user. The biggest change in SharePoint 2013 is the fact that the QPC now runs under noderunner.exe and no longer retrieves properties from the property store or invokes the Search Admin database for security trimming when processing a query. It only fetches results from index components which simplifies things tremendously.

Note: Yes you read correctly, security descriptors associated with crawled items are also now solely stored in the index.

Search Admin Component Intro

The search admin component manages and controls the entire search infrastructure. It maps to a Search Admin database and the search admin component can be made fault tolerant (add additional search admin components) which is yet another improvement over SharePoint 2010 search.

The search admin component governs topology changes and stores things like the following:

Topology
Crawl and Query Rules
Managed Property Mappings (Search Schema)
Content Sources
Crawl Schedules

That's it for the intro, as I introduce additional blogs that dive a little deeper into the SharePoint Search 2013 Architecture I'll post them below.

SharePoint 2013 Search Architecture Part 2 – Crawl and Feed

Thanks,

Russ Maxwell, MSFT

SharePoint 2013 Search Architecture Part 1 - Intro

Additional resources