The Internet Connection
(November 1999 Action for Libraries newsletter)

By Michael Sauers

Many Web Sites Not Indexed by Search Engines

The last major study showed that only about one-third of the estimated 800 million pages on the World Wide Web are actually indexed by search engines. What about those pages that aren't indexed?

Certain Web pages, and sometimes even whole sites, are not indexed for a variety of reasons. The following categories of Web pages and sites are not or cannot be indexed by a search engine.

  • Databases -- Any Web page that is created by querying a database of information through user input will not be indexed by a search engine. Example: Search for an author's name; you will never find a link to a book that author has written for sale at Amazon.com.

  • Dynamic -- Dynamic Web pages generated for a particular user based on preferences set by that user are not indexed. Typically these pages end in the extension .asp (Active Server Pages) instead of .html. Example: You are asked for your name and zip code the first time you visit a Web site. From that point on, whenever you visit that site, you are welcomed by name to your own page, which displays information relevant to your location as in the case of MovieLink.

  • Constant updating -- Sites and pages that are different almost every time you visit them most likely are changed at least once a day. No search engine can keep up with such frequent changes. Example: Search for a current event such as "Monica Lewinski." You'll never find The New York Times articles as hits unless you are specifically using the search function on The New York Times on the Web site.

  • Frames -- Because of the way in which frameset documents are coded for the Web, almost all information presented in frames is not indexed.

  • Restricted -- Sites that require user registration, a password or have other security measures set up are not indexed. Examples: The New York Times site can be used only by those who register. An internal corporate Web site that is located behind a firewall cannot be indexed.

  • Blocked -- A webmaster may block search engines from indexing the webmaster's Web site by using a robots.txt file. Examples: At www.ibm.com/ robots.txt IBM is instructing all search engines (user agents) to keep out of certain directories on its servers. Yahoo! at www.yahoo.com/robots.txt is blocking all search engines from three particular directories and one search engine in particular (Roverbot, which no longer exists) from its entire Web site.

  • Unknown -- Search engines can only find a Web site under two circumstances. The site is announced to the search engine or the search engine finds a link to another site from a site it already indexes. If neither of these circumstances are met, a site will not be indexed. Example: A Web site will not be indexed if it is created without linking other pages to it, the URL is given to only a few close friends and a search engine isn't told about its existence.


BCR Home Page Table of Contents