Some interesting facts about SharePoint 2007 Search

Can we search in any language other than English? Do we need language pack for the same?

Language Pack has nothing to do with search in languages other than English or the language in which SharePoint is depoyed. Out of the box, MOSS already shipped with the major wordbreakers/stemmers, although with very bad quality for some of the languages such as Chinese and Deutsch 

Despite the quality of the word breakers, by design you may encounter two problems,

1. In index time, the ifilters should emit a correct LCID for that language. However, this is not possible with some of the file types. For example, when processing file types like TXT, XLS(XLSX) and RTF, The ifilters will return 1033(en-us) instead of the correct ones. So what will happen? You may get nothing when you search for any long word, only single character for Japanese/Chinese in these files. Other language may have the same problem, but not as obvious like that.

2. In query time, when user submit the keyword through the browser, MOSS will detect the browser language setting that the user is using. And it will use the value to call the corresponding wordbreaker. If this wordbreaker does not match with the one used in index time, you will be in trouble again. For example if you use a English client to search for something in Chinese, without modifying browser language setting to chinese, even the files indexed in the right language, you will still get no result for a word.

The space needed for the index on the query machine is approximatley 2.8 times of the size of the actual index. What is the logic behind this?  

Lets say the index size is X.
During crawls, we accumulate more shadow indexes because of items that are indexed. When these shadow indexes cover about 0.1 times X (10% of X), we do a master merge.

A master merge takes the 1.1X (X + 0.1X) and creates a new index with that in the same location as the old index. The size of the new index is roughly 1.1X. So before the old index is deleted, the requirement is for at least 2.2X (for both indexes).

However, since query servers are expected to be online at all times, the master merge should have minimal impact on query latencies. To achieve this, we use more than the 1.1X space by creating temp files during the master merge.

This leads to the worst case number of 2.8X so that master merges can succeed while not impacting query latencies.

Then we will delete the old index on both, the indexer and query machines immediately after the master merge is complete.

How does the duplicate document is identified when we do a search?

Document similarity for purposes of identifying duplicates is based only on a hash of the content of the document. No File properties (e.g. file name, type, author, create and modify dates) are input to this hash. The MSSDuplicateHashes table in the SSP’s search database holds, for each document, all the 64bit hashes necessary to determine if one document is a near-duplicate of another. This is read while doing a search if duplicate collapsing is enabled.

What are discovered definitions and how does search find those?

Discovered Definitions are a feature in MOSS that can be enabled/disabled in the properties of the SearchCoreResults webpart. When enabled, the results web part will display not only document matches for a term, but also any definitions it has discovered for that word during crawling.

Definition extraction feature in MOSS 2007 is a feature that extracts meaning of definition from indexed text.

Definition Extraction is done during the crawl. The crawler looks for couple verbs like ‘is a’ or ‘is the’ and then, when a nebulous threshold is reached, it extracts the definition of the related word for later use in search results display with the words “What people are saying about <term>”.

 At query time passed search token is compared with existing entry in definitions database. If a match is found the definitions link is populated at the bottom of the search results page. Collapsing the link shows number of definitions.