SEOMining SEOMining


Information Architecture   «Prev 

Search Types

  1. Known item searching: Search for a specific item that the user is familiar with
  2. Existence searching: Search for something the user hopes to find for the first time
  3. Exploratory searching: Search for more information on a familiar topic
  4. Comprehensive searching: Search for many sources of information on a topic or item

On your own

Open another browser window and call up your preferred search engine.
Try a few of the following searches and think about which of the four kinds of searches you are carrying out:
  1. What was the highest grossing film for the previous calendar year?
  2. Is there a tuba repair shop in the capital of Madagascar?
  3. What were the jobs held by the current governor of New Hampshire before he or she was governor?
  4. What is some general information on the company called Associated Services?
  5. Who were the U.S. secretaries of state during the Vietnam War?

Google Story

Text Retrieval

Text (information) retrieval deals with the problem of how to find relevant (useful) documents for any given query from a collection of text documents. Documents are typically preprocessed and represented in a format that facilitates efficient and accurate retrieval. In this section, we provide a brief overview of some basic concepts in classical text retrieval.
The contents of a document may be represented by the words contained in it. Some words such as "a", "of", and "is" do not contain semantic information. These words are called stop words[1] and are usually not used for document representation. The remaining words are content words and can be used to represent the document. Variations of the same word may be mapped to the same term. For example, the words "beauty", "beautiful" and "beautify" can be denoted by the term "beaut."" This can be achieved by a stemming[2] program, which removes suffixes or replaces them by other characters. After removing stop words and stemming, each document can be logically represented by a vector of n terms, where n is the total number of distinct terms in the set of all documents in a document collection.

Vector Document

Suppose the document d is represented by the vector (d1 , . . . , di , . . . , dn), where di is a number (weight) indicating the importance of the ith term in representing the contents of the document d. Most of the entries in the vector will be zero because most terms do not appear in any given document. When a term is present in a document, the weight assigned to the term is usually based on two factors, namely the term frequency(tf ) factor and the document frequency (df ) factor. The term frequency of a term in a document is the number of times the term appears in the document. Intuitively, the higher the term frequency of a term is, the more important the term is in representing the contents of the document. Consequently, the term frequency weight(tfw) of a term in a document is usually a monotonically increasing function of its term frequency. The document frequency of a term is the number of documents having the term in the entire document collection. Usually, the higher the document frequency of a term is, the less important the term is in differentiating documents having the term from documents not having it. Thus, the weight of a term with respect to document frequency is usually a monotonically decreasing[3] function of its document frequency and is called the inverse document frequency weight (idfw).

[1]stop words: Non-content words like a, an, with that are not included in the keywords that the searching software attempts to match in the database.
[2]stemming: stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form, generally a written word form.
[3]monotonically decreasing: A function is called monotonically decreasing if, whenever x <= y , then f(x) >=f(y).