|
The Internet Connection
(November 1999 Action for Libraries newsletter)
By Michael Sauers
Many Web Sites Not Indexed by Search Engines The last major study showed that
only about one-third of the estimated 800 million pages on the World Wide Web are
actually indexed by search engines. What about those pages that aren't indexed?
Certain Web pages, and sometimes even whole sites, are not indexed for a variety of
reasons. The following categories of Web pages and sites are not or cannot be indexed
by a search engine.
- Databases -- Any Web page that is created by querying a database of information
through user input will not be indexed by a search engine. Example: Search for
an author's name; you will never find a link to a book that author has written for
sale at Amazon.com.
- Dynamic -- Dynamic Web pages generated for a particular user based on
preferences set by that user are not indexed. Typically these pages end in the
extension .asp (Active Server Pages) instead of .html. Example: You are asked
for your name and zip code the first time you visit a Web site. From that point
on, whenever you visit that site, you are welcomed by name to your own page,
which displays information relevant to your location as in the case of MovieLink.
- Constant updating -- Sites and pages that are different almost every time you visit
them most likely are changed at least once a day. No search engine can keep up
with such frequent changes. Example: Search for a current event such as
"Monica Lewinski." You'll never find The New York Times articles as hits
unless you are specifically using the search function on The New York Times on
the Web site.
- Frames -- Because of the way in which frameset documents are coded for the
Web, almost all information presented in frames is not indexed.
- Restricted -- Sites that require user registration, a password or have other security
measures set up are not indexed. Examples: The New York Times site can be
used only by those who register. An internal corporate Web site that is located
behind a firewall cannot be indexed.
- Blocked -- A webmaster may block search engines from indexing the
webmaster's Web site by using a robots.txt file. Examples: At www.ibm.com/
robots.txt IBM is instructing all search engines (user agents) to keep out of certain
directories on its servers. Yahoo! at www.yahoo.com/robots.txt is blocking all
search engines from three particular
directories and one search engine in particular (Roverbot, which no longer exists)
from its entire Web site.
- Unknown -- Search engines can only find a Web site under two circumstances.
The site is announced to the search engine or the search engine finds a link to
another site from a site it already indexes. If neither of these circumstances are
met, a site will not be indexed. Example: A Web site will not be indexed if it is
created without linking other pages to it, the URL is given to only a few close
friends and a search engine isn't told about its existence.
|
|