Internet Search Engine - Online Article

Abstract

Search engines fall into four categories: primary search engines, aka pure search engines; mega-index search engines; simultaneous mega-index search engines, aka parallel mega-index search engines; and niche search engines, aka specialized search engines.

There are ten primary search engines; the top three are Alta Vista, Lycos, and Open Text. The most famous niche search engine is Yahoo.

The primary search engine uses a robot to travel the Internet to retrieve documents; it has a computer program that will search these documents; and it covers a significant part of the World Wide Web.

Internet Search Engines

At the New York Futuristic World's Fair of 1939, there was a robot which could vacuum rugs. People expected robots that would soon clean their homes. This hasn't happened.

A robot of another kind is alive and well on the Internet. It travels the Net retrieving documents and also retrieves all documents that are referenced by those documents ... It works day and night and can garner data by the gigabytes.

A search engine is a computer program that will search those documents for the ones which contain the keywords of interest to you. The retrieving robot and the search engine are truly an Information Robot or "knowbot", a sort of obedient servant who can reply with dozens or even thousands of documents. In a matter of seconds (all right, maybe a minute or two). And to make it all the more surreal, the knowbot allows you to retrieve and examine each document with just a mouse click.

In an information-driven society and economy, it is information about information that reigns supreme. In a marriage-made- in-heaven, the computer stands ready with its power and speed to interact with the information needs of the query maker. Information retrieval is then the skill of the age. The search machines are at our beck and call in our quest for the Holy Grail of information.

So far, so good. But there are so many search engines that it is all quite bewildering. Occam's razor -- the razor that slices through complexity and chaos to produce simplicity and clarity -- is much needed. A good beginning is to realize there is more than one type of search engine.

To put it in marketing terms, there are four major market segments in the search engine "industry":

  1. Primary or pure search engines. Their robots have collected data from a significant part of the World Wide Web -- that portion of the Internet which has universal resource locator (URL) addresses starting with http://, and which has the hypertext mark-up language (HTML) links -- "hotlinks". As indicated in the table, "The Four Types of Search Engine," the primary search engines consist of the "Top Three" -- Alta Vista, Lycos, and Open Text -- and of the "Second Tier" -- Excite, Harvest Broker, InfoSeek, Magellan, NlightN, WebCrawler, and World Wide Web (WWW) Worm.
  2. Mega-indexes. These search engines do not have their own information databases. Instead, they have access to other search engines. Although three mega-indexes are included in the table, this is just a sampling from hundreds of such entities.
  3. Simultaneous ("Parallel") Mega-indexes. These are mega-indexes which access other search engines in parallel (simultaneously) and present the unified results as a single package. There are just two: MetaCrawler and Savvy Search.
  4. Niche ("Specialized") Search Engines. These are primary search engines which focus on a small or specialized segment of the Internet. The most famous one is Yahoo, a manually- maintained subject directory which covers about one or two percent of the Web and which has 21 subject headings or categories. For searching within these categories, Yahoo has its own search engine. For searching the Web-at-large, Yahoo uses the primary search engine, Open Text.

The Internet itself has seven major market segments:

  1. World Wide Web
  2. Gopherspace
  3. Newsgroups and discussion lists
  4. Files available by file transfer protocol (FTP)
  5. People (White Pages)
  6. Companies (Yellow Pages)
  7. Software

The four niche search engines in the table are just a sample drawn from many such specialized search engines. There are also specialized search engines for Gopherspace ("Veronica") and for finding people, companies, and software. Being specialized, they are more efficient in finding data within their spheres than the primary search engines.

The niche search engines are a small but growing part of the Internet and are worthy of a study in and of themselves. The remainder of the present paper, however, will focus on the ten primary search engines and on how the top three were discerned from t he other seven ("Second Tier"). To put the ten primary search engines through their paces, two queries were submitted to each: "embargo" and "Woodrow Wilson's Fourteen Points". I remember submitting "embargo" to search engines a year ago and frequently g etting no returns. The situation is much better today.

Woodrow Wilson's Fourteen Points is not a scavenger hunt item, but it is certainly challenging because a perfect match requires all four terms. One of the most important speeches made by a U. S. President was made just before the end of World War I. It was an appeal for a forgiving rather than a punitive peace and for the establishment of a League of Nations.

Most of the search engines did fine with "embargo", returning between 40 and over 300 hits except for Harvest Broker (14) and WWW Worm (7) which thereby dropped out of the race.

The grand prize winner of Wilson's Fourteen Points was Alta Vista which had four returns, one of which was the full text of Wilson's Fourteen Points speech. InfoSeek came up with two references which briefly referred to the speech and the other search en gines struck out.

But these two queries are just a rough and ready approximation. Equally important is the way in which the returns are presented. The Top Three all communicated the following:

  1. Total number of returns or hits. For "embargo", Alta Vista had 20,000 documents, Open Text 1026, Lycos 366, and NlightN 1030, but neither Excite nor InfoSeek disclosed the number of returns.
  2. Keywords. As you look over 50 to 300 or even more returns to determine their possible relevance to your quest, it is extremely important that the keyword(s) show up in the material presented to you. For example, if you are looking for French wine, you'll be able to disregard returns with just "wine" and returns dealing only with just "France" or "French". With Woodrow Wilson's Fourteen Points, there were literally hundreds of returns with Woodrow Wilson and nothing else. The returns with all four terms are usually presented first. But if the keywords are not presented, you're left in the dark. The f our search engines which did not show keywords -- Excite, InfoSeek, Magellan, and WebCrawler -- were thereby eliminated. Down to the final four.
  3. Summaries. With one exception, each of the Final Four gave a summary of the document returned. This is of vital importance for determining whether to spend the time to view the document itself. NlightN gives no summary. Down to the Final Three: Alta Vista, Lycos. and Open Text. They are the most generous in terms of total returns you are allowed to view: Open Text and Lycos (both unlimited) and Alta Vista (200). Lycos is outstanding for the most complete summary for documents returned -- 4 to 8 lines or twice that of the other two. This is very helpful. And Lycos is the only search engine which makes keywords stand out by making them bold.

Not to be outdone by bold keywords, however, Open Text displays the bold keywords in the context of the phrase in which they appear (via their "see matches on the page' option).

Overall, my personal favourite is Yahoo. Besides being a niche search engine, it is also a megaindex with a quick response time and an easy interface which connects you to the Top Three primary search engines.

Types of Search Engines

Primary (Pure) Search Engines

  • The Top Three (Alphabetical Order)
    • Alta Vista. Developed by Digital Equipment Corporation. Went public December 15, 1995. According to their estimates, they have access to 16 million Web pages (over 90% of total Web pages) and they have a full-text index of 13,000 newsgroups updated in real time. Many helpful search techniques explained. Handles 2 million info requests per day. They provide a paragraph of HTML which allows you to put the Alta Vista Search engine on your home page. Praised to the moon by Netsurfer Digest (Jan. 19, 1996). http://www.altavista.digital.com/
    • Lycos. Born June 1995. Lycos comes from the first 5 letters of the Latin name (Lycosidae) for wolf spider, a wandering ground spider. According to their FAQ (Frequently Asked Questions), as of September 1995, Lycos had 7.2 million URL's (documents), or about 90% of the Web's 8 million documents at that time. Search techniques explained. http://www2.hawaii.edu/~rpeterso/http.//www.lycos.com/
    • Open Text. Ten billion words indexed. Searches every word of every page indexed. Extensive guidelines on search procedures. Open Text's software is commercially available for Intranets (private Webs). http://www.opentext.com:8080/
  • The Second Tier (Alphabetical Order)
    • Excite. Access to 1.5 million Web pages, 1.0 million articles from Usenet newsgroups, and two weeks of current Usenet classified advertisements. Searchable data base of 50,000 Web site reviews; these are crisply described -- a nice plus ! http://www2.hawaii.edu/~rpeterso/http.//www.excite.com/
    • Harvest Broker. 70,000 Web pages indexed. Extensive explanation and examples of search techniques. http://www.town.hall.org/Harvest/brokers/
    • InfoSeek. Handles up to 5 million information requests per day. Limit of 100 returns per search. Many search tips. Can search newsgroups and reviewed pages as well as Web pages. Formerly a for-fee for full-service search engine. http://guide.infoseek.com/
    • Magellan. Named for Ferdinand Magellan, the Portuguese explorer. Has 1.5 million reviewed and rated sites; ratings are *(worst) to **** (best). FAQ available and search techniques described. Magellan provides a paragraph of HTML which allows you to put the Magellan search engine on your home page. http://www.mckinley.com/
    • NlightN. Their database indexes the World Wide Web, reference works, news wires, literary works, dissertations and abstracts. Can query, browse, and graze for free. Document retrieval is available at $0.10 per abstract and $0.25 per document via a credit account which is obtained electronically, by FAX, phone, or postal mail. Extensive FAQ and search procedure tips. http://www.nlightn.com/
    • WebCrawler. 100,000 Web pages indexed. Indexes documents by content. Week of January 2, 1996: 17 million queries from 1.7 million users. Purchased July, 1995 by America On-line, but is freely available on the Internet. http://www.webcrawler.com/
    • WWW Worm. 3 million URL's indexed; 2 million information requests per month. Help and search examples provided. Returns anywhere from 1 to 5000 hits. http://guano.cs.colorado.edu/WWWW/

Mega-Indexes

  • All-in-One Search Page. Includes our Top Three primary search engines. Also searches for software and for people. A mirror of this site is available in French. http://www.albany.net/allinone/
  • My Virtual Reference Desk. Includes our Top Three primary search engines. Has "My Virtual Newspaper" with newspaper links to the Christian Science Monitor, New York Times, and National Public Radio (Real Audio). Searches Usenet newsgroups. http://www.refdesk.com/cgi-bin/refsrch.cgi/search/me
  • Net Search. Includes our Top Three primary search engines. Has the "Electric Library" with 150 full-text newspapers and 900 full-text magazines ($9.95 per month). http://home.netscape.com/home/internet-search.html

Simultaneous (Parallel) Mega-Indexes

  • MetaCrawler Multi-Threaded Web Search Service. Includes two of our Top Three primary search engines, and several others. These are simultaneously queried. In its option, "Verification Mode," MetaCrawler loads all references returned to make sure the links are active and that they contain valid data. http://metacrawler.cs.washington.edu:8080/
  • Savvy Search. Includes two of our Top Three primary search engines, and several others. Search results can be reviewed for each search engine individually or collectively. Available in 15 languages. Can also search reference, people, commercial sites, and software. Their FAQ and Help Page are quite informative. http://www.cs.colostate.edu/~dreiling/smartform.html

Niche (Specialized) Search Engines

  • Conferences. The full name is "WWW Virtual Library: Conferences." Has up-to-date information on conferences, symposia, workshops, seminars, exhibitions, and meetings throughout the world. Retrieval of information is by keyword search or browsing by subject category, alphabetically-ordered acronyms, and date. http://www.iao.fhg.de/Library/conferences
  • DejaNews. Largest collection of archived Usenet news on the Net according to their estimates. Most Usenet groups covered; excluded are alt.*, soc.*, talk.* and *.binaries. Can retrieve the entire thread of articles on a particular topic and can give you a profile of the author of an article which gives a history of their previous postings. Extensive explanation of search techniques. DejaNews FAQ available. http://www.dejanews.com/
  • Liszt. World's largest searchable directory of 25,000 listserv, list proc, and majordomo e-mail discussion groups. Searchable directory of 13,000 Usenet newsgroups. http://www.liszt.com/
  • Yahoo! A manually-maintained searchable directory which covers 21 major categories: Art, Business, Computers, Economy, Education, Entertainment, Environment/Nature, Events, Government, Health, Humanities, Law, News, Politics, Reference , Regional Information, Science, Social Science, Society/Culture. For searching the entire Internet, Yahoo uses Open Text and has easy access to Alta Vista, Lycos, WebCrawler, and DejaNews. http://www.yahoo.com/

About the Author:

No further information.




Comments

No comment yet. Be the first to post a comment.