SEOMining SEOMining


Search Engine Basics   «Prev 

Information Retrieval Services and Search Engines

Expecting people to maintain an accurate directory of all of the documents available on the Internet is virtually impossible, considering the Internet's growth rate. For this reason SEO requires developers to create top landing pages within their website which can be found by Search Engines such as Google.
Search Engines automate the time-intensive process of gathering information and categorizing documents. They accomplish this using computer programs that search the Web for new and updated Web pages, gathering information and storing it in a database.
There are also search engines that gather information on Usenet newsgroups (we will be using one, Deja.com, later in the course).
Search engines typically provide an input area for you to place a search query, which is one or more words related to what you are looking for.

Some examples of Search Engines are:
  1. bing.com
  2. Startpage.com
  3. Google.com
  4. duckduckgo.com
Search engines will be discussed in more detail in the next module. For now, take a quick look at the main page (or "home page") of each search site. Note the engines that include a list of categories (as a Directory does). Also note which, if any, allow you to refine or expand your search through drop-down menus or small circles (called radio buttons) or squares (called check boxes).
Clicking on any of these links will open the Web site in a separate browser window, so you can switch between the lesson and the website.

Domain analysis and systems analysis for multiple related systems is a method for developing an information retrieval framework.
By means of domain analysis, one attempts to discover and record the similarities and differences among related systems.
The first steps in domain analysis are to identify important concepts and vocabulary in the domain, define them, and organize them with a faceted classification.

Conceptual Models of IR

The most general facet in the previous classification scheme is conceptual model. An information retrieval conceptual model is a general approach to IR systems. Several taxonomies for IR conceptual models have been proposed. Three basic approaches:
  1. text pattern search,
  2. inverted file search, and
  3. signature search.
Belkin and Croft categorize information retrieval conceptual models differently. They divide retrieval techniques first into exact match and inexact match. The exact match category contains text pattern search and Boolean search techniques. The inexact match category contains such techniques as probabilistic, vector space, and clustering, among others. The problem with these taxonomies is that the categories are not mutually exclusive, and a single system may contain aspects of many of them. Almost all of the information retrieval systems fielded today are either Boolean information retrieval systems or text pattern search systems.

Text pattern search queries

Text pattern search queries are strings or regular expressions. Text pattern systems are more common for searching small collections, such as personal collections of files. The grep family of tools used in the UNIX environment is a well-known example of text pattern searchers.
Almost all of the information retrieval systems for searching large document collections are Boolean systems. In a Boolean information retrieval system, documents are represented by sets of keywords, usually stored in an inverted file. An inverted file is a list of keywords and identifiers of the documents in which they occur.
Boolean queries are keywords connected with Boolean logical operators (AND, OR, NOT). While Boolean systems have been criticized, improving their retrieval effectiveness has been difficult. Some extensions to the Boolean model that may improve information retrieval performance will be discussed later. Researchers have also tried to improve information retrieval performance by using information about the statistical distribution of terms, that is the frequencies with which terms occur in documents, document collections, or subsets of document collections such as documents considered relevant to a query.
Term distributions are exploited within the context of some statistical model such as the vector space model, the probabilistic model, or the clustering model.
Using these probabilistic models and information about term distributions, it is possible to assign a probability of relevance to each document in a retrieved set allowing retrieved documents to be ranked in order of probable relevance. Ranking is useful because of the large document sets that are often retrieved.