How to use the Robots.txt file and HTML tags to prevent to crawl or access the content on the site

Article
12/06/2010

You can use a Robots.txt file to control where robots (Web crawlers) can go on a Web site. You can also use the Robots.txt file to indicate whether to exclude specific crawlers. Web servers use these rules to control access to Web sites by preventing robots from accessing certain areas. SharePoint Portal Server 2003 and SharePoint Server 2007 look for this file when it crawls, and it obeys the restrictions that are contained in the Robots.txt file.
You can prevent another server from crawling content on the portal site by modifying the Robots.txt file. For example, you might want to restrict a specific robot from accessing the server because the frequency of requests from the robot is blocking the Web site. You may also want to restrict all robots from certain areas on the server.
SharePoint Portal Server 2003 and SharePoint Server 2007 do not install a Robots.txt file. However, you can create a Robots.txt file and put the Robots.txt file in the home directory of the default Web site on the server. To determine the home directory of the default Web site on the server, follow these steps:

1. Start Internet Information Services (IIS) Manager.

2. Expand server name, and then expand Web Sites.

3. Right-click Default Web Site, and then click Properties.

4. Click the Home Directory tab.

5. Make a note of the path that appears in the Local Path box, and then click Cancel.
Put the Robots.txt file in the path that appears in the Local Path box. For example, if the path is D:\Inetpub\Wwwroot, put the Robots.txt in the D:\Inetput\Wwwroot folder on the server. To confirm that the Robots.txt file is in the correct folder on the server, start your Web browser, and then type https://server name/robots.txt .

You can restrict access to certain documents by using HTML META tags. HTML META tags tell the robot whether a document can be included in the index and whether the robot can follow the links in the document by using the INDEX/NOINDEX attribute and the FOLLOW/NOFOLLOW attributes in the tag. For example, you can mark a document with the following if you do not want the document crawled and you do not want links in the document followed:

<META name="robots" content= "NOINDEX, NOFOLLOW">

SharePoint Portal Server 2003 and SharePoint Server 2007 automatically obey the restrictions that are contained in the Robots.txt file.
Note for Microsoft Office SharePoint Server2007, you must restart the Office SharePoint Server Search service before thesaurus updates are applied to search queries. Also, changes to thesaurus files must be manually copied to every server in the farm that is serving search queries. To be thorough and allow for topology chagnes, you can copy the changes to all servers in the farm.

How to use the Robots.txt file and HTML tags to prevent to crawl or access the content on the site

Additional resources