How Search Engines Work

In this article, we look at how search engines such as Google, Yahoo and Bing (Microsoft) help people find information on-line, and in the next article on Search Engine Optimization (SEO), the things you can do if you are looking to be found.

A search engine stores information about a large number of web documents (or pages), and allows people to search for ones that contain information of interest. When the user enters a ‘search term’, the search engine sifts through the data it has stored, and displays a list of pages that its software considers relevant to the user’s request. The pages are presented to the user in descending order of relevance to the user’s search, and the ‘importance’ of each page.

In order to accomplish this, the search engine must perform 3 major functions:

(1) Web crawling

The search engine starts by reading one or more known document(s) (seeds).

Web pages contain ‘hyperlinks’ (links) that you use to navigate other pages. When it scans a page, the search engine identifies the links on the page, and adds them to the list of pages that it may scan (visit) in the future. As it scans those pages, it will in turn add any links on each of those to its list of documents to scan. In this way, it can branch out to a very large number of documents after starting with a small number of seeds.

(2) Indexing web pages

As the search engine scans each document (page), it builds and stores indexes based on its content, which allow shorthand access to the page. (It may also store the full content of the page. This is known as caching the page.) As an example, an index might consist of a list of words on the page (usually omitting words that don’t add meaning, such as ‘the’ and ‘and’), along with the page each word is found on. The indexes allow the search engine to quickly locate information without reading through the entire document on every search, which would take a very long time.

(3) Searching

When you perform a search, the search engine will examine its indexes in order to find pages that pertain to your query, and present the results in a list on a Search Engine Results Page (SERP). The listing will contain a link to each page, along with short descriptive text taken from the page. Pages are typically sorted according to a formula which includes relevance to your search terms, as well as the ‘importance’ of the page, and possibly other factors.

Off-page factors

Prior to Google, search engines only looked in one place to determine whether to include a document on the SERP, and how to rank it. It simply looked at the document itself.

Google’s biggest innovation was the concept of Page rank, which is determined largely by links to the page from other pages. For years, people had been trying to influence, often in improper ways, search engine results. They would stuff a page with keywords, often using invisible text, and do anything else they could think of that would elevate the page in the search results, often on topics (such as contraband medications) that bore little relevance to the user’s search.

Page rank involves assigning each page a value, from 1 to 10, based primarily on links back to the page from other websites that it considers to be important. The end result is to evaluate your page based on how others vote (via their choice of links), rather than by content on the page itself. This technique was quickly adopted by other search services.

Of course, this new system was gamed almost as much as the old one, leading to pages with long lists of links to low quality pages (link farms). The algorithm used by Google and other search services has evolved over the years as it tries to stay one step ahead of black hat Search Engine Optimization (SEO) practitioners, and includes many factors on the page, the website that surrounds it, and other sites that link to it. The newest addition to this arsenal is consideration of the freshness of content on the website, favoring sites that regularly publish new, high quality content.