freefind
menu   Login

home features pricing search faq library site map policies contact
How to Exclude Pages
Library > How To > Exclude Pages  
This how-to covers various techniques for preventing pages from being included in your search engine index. It also covers how to prevent one or more links on a page from being followed by the spider (indexer) and how to prevent pages from appearing in your site map and what's new page.

This tutorial is not a web/html primer and assumes that you already know how the process of "web surfing" is accomplished (i.e. a browser requests a page from a server which then returns the page to be viewed), what HTML tags are and how to use them. If you are not familiar with these concepts please read a basic web/html primer.  
Contents
Overview
Excluding Pages using the Control Center
Excluding Pages using Robots.txt
Excluding Part of a Page
Preventing Links from being Followed
Excluding Pages from the Site Map
Excluding Pages from What's New
[top]

Overview(top)

You can tell the spider (indexer) to ignore parts of your site in a few different ways. In order of preference you can either use the Control Center or use a "robots.txt" file. If you want to prevent only part of a page from being indexed you can use special search engine HTML tags.

You can also make the spider ignore specific links or types of links. Since the spider finds the pages of your site by following links, preventing it from following specific links can be used to prevent parts of your site from being indexed. This technique is also used to change the structure of your site map by changing how the spider thinks your site is linked together.

The sections below cover your various options.

Excluding Pages using the Control Center(top)

This is the preferred way of preventing pages from being included in your index. To do this, you simply log in to your account, go to the build index page and use the exclude pages link. When the wizard appears add your list of "exclusions", one per line (browser wrapping may be ignored), and press the finish button to save your changes.

Each exclusion consists of a "URL mask" optionally followed by one or more exclusion modifiers.

The URL mask is simply a standard web address, but may contain the common wildcards "*" and "?" to make it match more than one web address. The "*" will match any number of any character and the "?" will match any single character. Non-wildcard characters are matched without regard to case (case-insensitive). URL masks which do not begin with "http://" are treated as if they begin with "*". Because of this it is recommended that you include the "http://" in your URL masks.

The URL mask may be followed by exclusion modifiers. There are two:
   index=no/yes
   follow=no/yes
The "index" modifier specifies whether pages matching the mask will be included in the index. The "follow" modifier specifies whether pages matching the mask will have their links followed in order to locate other pages to index. The default values are:
   index=no follow=no

When determining which exclusion to apply, entire list of exclusions is considered and the last matching exclusion is used. This allows convenient expression of "exclude everything but..." logic. For example, to prevent everything in your "http://mysite.com/cgi-bin/" directory from being index except pages generated by the CGI "content.cgi" you can use the following:
   http://mysite.com/cgi-bin/*
   http://mysite.com/cgi-bin/content.cgi* index=yes follow=yes

Here are some more examples:

The exclusion:
   http://mysite.com/ignore.html
prevents that file from being included in the index.

The exclusion:
   http://mysite.com/archive/*
prevents everything in the "archive" directory from being included in the index.

The exclusion:
   /archive/*
prevents everything in any "archive" directory from being included in the index regardless of the site it's on.

The exclusion:
   http://mysite.com/*.txt
prevents files on "mysite.com" that end with the extension ".txt" from being included in the index.

The exclusion:
   *.txt
prevents all files that end with the extension ".txt" from being included in the index regardless of what site they're on.

The exclusion:
   http://mysite.com/alphaindex/?.html
prevents a file like "http://mysite.com/alphaindex/a.html" from being indexed, but would allow a file "http://mysite.com/alphaindex/aardvark.html" to be indexed.

The exclusion:
   http://mysite.com/alphaindex/?.html index=no follow=yes
prevents a file like "http://mysite.com/alphaindex/a.html" from being added to the index but would allow the spider to find and follow the links in that page.

The exclusion:
   http://mysite.com/endwiththis.html index=yes follow=no
allows that file to be added to the index but prevents the spider from following any of the links in that file.

Excluding Pages using Robots.txt(top)

If for some reason you cannot use the Control Center to exclude pages then you can use a robots.txt file to make the spider ignore certain parts of your site. This mechanism is more limited than Control Center exclusions, and so should typically not be used.

Excluding Part of a Page(top)

You can use special search engine tags to prevent part of a page from being indexed. If you want to prevent an entire page from being indexed then see Excluding Pages Using the Control Center.

To prevent part of a page from being indexed add the tag
   <!-- FreeFind Begin No Index -->
before the section of the page to be ignored, and the tag
   <!-- FreeFind End No Index -->
after the section of the page to be ignored, then respider your site. The spider, when it notices this tag, will prevent the text occurring between these tags from being included in the index. Note that the spider will still follow the links on this page to locate the other pages of your site.

Note that the spider does not use the "noindex" robots meta tag (<meta name="robots" content="noindex">) even though it does pay attention to the "nofollow" robots meta tag.

Preventing Links from Being Followed(top)

Since the spider determines the pages of your site by following all the links it can, preventing it from following links can change which parts of your site get indexed.

The spider uses the standard "nofollow" robots meta tag (<meta name="robots" content="nofollow">) although it does not pay attention to the "noindex" robots meta tag.

By default, the spider is very thorough in its link detection and use. In addition to using a variety of generic links, it will also try to extract links from any javascript on your page and follow links which contain query strings in them (like "/cgi-bin/doit.cgi?page=1&option=2").

Robust javascript link extraction is essentially impossible so the javascript link extractor may make some invalid guesses. If your server is receiving a lot of "404 page not found" requests when the spider runs, you may want to consider turning off the javascript link extraction.

All this can be customized. The sections below outline the various technique available in order to control the spider link-finding process.

Preventing all links in javascript from being followed
Add the tag
   <meta name="FreeFind" content="neverFollowScript">
to the very start of the first page the spider reads. After the spider processes this tag it will ignore all javascript in the current page and all subsequent pages.

Preventing links in javascript on a single page from being followed
Add the tag
   <meta name="FreeFind" content="noFollowScript">
to the very start of the page. This causes the spider to ignore all javascript after the tag, for the current page only. You can also enable javascript link processing for the current page by using the tag
   <meta name="FreeFind" content="followScript">
at the very start of a page.

Preventing links in selected javascript from being followed
Add the tag
   <nofollowscript>
before the javascript the spider should ignore, and
   </nofollowscript>
after the javascript the spider should ignore. Note that these tags should not be in the javascript itself.

Making the spider ignore your robots meta tags
Add the tag
   <meta name="FreeFind" content="noRobotsTag">
to the very start of the first page the spider reads (before any existing robots meta tag if any!). After the spider processes this tag it will ignore all robots meta tags in the current page and all subsequent pages.

Preventing all the links in a page from being followed
Add the tag
   <meta name="FreeFind" content="nofollow">
to the very start of the page. This causes the spider to ignore all links in the page, both before and after the tag, for the current page only.

Preventing specific links from being followed
Add the tag
   <nofollow>
before the link(s) the spider should ignore, and
   </nofollow>
after the link(s) the spider should ignore.

Preventing links with query strings from being followed
Add the tag
   <meta name="FreeFind" content="noQueries">
to the very start of the first page the spider reads. After the spider processes this tag it will ignore all links with query strings.

Stripping the query string off links before following
Add the tag
   <meta name="FreeFind" content="stripQueries">
to the very start of the first page the spider reads. After the spider processes this tag it will remove any query strings from the links it follows.

Excluding Pages from the Site Map(top)

To prevent a page from being included in the site map add the tag
   <!-- FreeFind No Map -->
to that page then respider your site. This tag often removes all of the pages linked "beneath" the current page from the site map as well. If the only link to a page was on the "no mapped" page, that page will not appear in the site map either.

Excluding Pages from What's New(top)

To prevent a page from being included in the what's new page add the tag
   <!-- FreeFind Not New -->
to that page then respider your site.

 

login home features pricing search faq library sitemap policies contact
FreeFind and FreeFind.com are trademarks of FreeFind.com.
Copyright 1998 - 2014