Encyclopedia > Distributed crawling

  Article Content

Distributed web crawling

Redirected from Distributed crawling

Distributed web crawling is a technique used in Internet search engines employing many computers to do the web crawling[?] necessary to index the Internet. The idea is to spread the resource requirements of computing power and bandwidth across many computers and network connections.

As of 2003, most modern commercial search engines use this technique. Companies such as Google use thousands of individual computers in multiple locations to crawl the Web.

Newer projects are attempting to use a less structured, more ad-hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique in its Grub distributed web-crawling project.

The following is a proposed solution, but does Grub (or others) actually use this algorithm? One solution to this problem is using every computer connected to the Internet to crawl some Internet adresses (URLs) in the background. After downloading the pages, the new pages are compressed and sent back together with a status flag (changed, new, down, redirected) to the powerful central servers. The servers manage a large database and send out new URLs to be tested to all clients.

See also:



All Wikipedia text is available under the terms of the GNU Free Documentation License

 
  Search Encyclopedia

Search over one million articles, find something about almost anything!
 
 
  
  Featured Article
Ocean Beach, New York

... 29.5% of all households are made up of individuals and 4.9% have someone living alone who is 65 years of age or older. The average household size is 2.26 and the ...

 
 
 
This page was created in 37.5 ms